ftocc

TB 2.4 crashing on Asus/AMD and new Dell server spec

james.kemery@ce...
Posts: 10
Member Since:
2007-08-11

I've been testing TB 2.2 and now 2.4 for sometime now on a Asus MB with Athlon 64 processor with a Rhino 4FXO with Echo Cancel. This setup was running good for about 8 months on TB2.1 and 2.2, only issue was an occasional server reboot about every 1-3 weeks. Since it's kind of a test system, I never worried about it.
After upgrading to TB 2.4 recently the box keeps crashing. First the FOP errors out on the console, then Asterisk service stops, Hud-Lite stops, SSH server stops and the console goes dark. Usually takes about 3-5 hours to crash after a cold boot.
I have tried:
- Upgrading the RAM to 2GB
- Running YUM and Package mgr updates
- Changed BIOS setting to fix PowerOn error shown in dmesg
- Set ACPI=on on grub.conf

Next I'm going to remove the Rhino drivers and reinstall them using the Package manager, since I installed them manually using YUM and since then, there are Zaptel module load errors on boot, however the RCBFX module loads fine and calls go in/out as expected.

My dilema is, I've been marketing and starting to sell Trixbox CE to customers and would like my demo system running smoothly. So I'm thinking of buying a new server.

I'm partial to Dell Servers so this is what I'm thinking of buying. Please comment on the spec.
Looking at Ebay buys and found:
PowerEdge 2650 Xeon 2.4Ghz, 1GB, 36GB PE2650
I think this will fit the PCI Rhino card?
Will this spec be a stable platform to demo TB2.4 and above on?



james.kemery@ce...
Posts: 10
Member Since:
2007-08-11
More strange crash info

First of all, I'm a network engineer and not a Linux pro like others on this list. That said, here is what I found this morning on the console. The system was down and I had to do a hard reset to get it to boot.

EIP: [] page_remove_rmap+0x16/0x6d SS:ESP 0068:f5a8ace4
<0>Kernel panic - not syncing: Fatal Exception
<0>Uhhuh. NMI received for unknown reason 2d on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 2d on CPU 0.
Do you have a strange power saving mode enabled?
dazed and confused, but trying to continue

I tried disabling ACPI in BIOS, but then the OS would not boot, kept complaining about one of the two SATAII drives, I have them mirrored by the way.

I'm leaning towards getting the Dell server and moving off this whitebox setup. Although these issues are helping me learn how to troubleshoot Linux and TB.



SchlingBlade
Posts: 114
Member Since:
2007-11-29
I get the same errors on a

I get the same errors on a Dell PowerEdge 1950 w/4GB RAM running dual dual-core Xeon CPUs (HT disabled) at random intervals running Trixbox 2.3 with a TE110P PRI interface.

I have a second identical 1950 system that was recently installed with 2.4.1 (which will be updated to 2.4.2 ASAP) with a TE210P PRI interface. Haven't had much of a load on that system yet, so I'm interested to see if a difference in cards and OS will result in better uptimes.

The 2.3 machine is due to be updated to 2.4, as soon as I can find the time. Until then, we just power cycle when the system locks up.



james.kemery@ce...
Posts: 10
Member Since:
2007-08-11
Cause is the PCI card. I

Cause is the PCI card. I removed the Rhino 4FXO PCI card from the affected system and it has run stable for about 2 days now. So the crashing has something to do with the PCI bus. I'm not sure what to do about it or how to fix it. If someone can comment on a possible fix I would appreciate it.

Thanks,
JK



zforum69
Posts: 81
Member Since:
2007-04-18
Same issue here

I'm just starting to build a 2.4.2 system and get the same issue crash issue. I haven't done any investigative work to isolate the problem but it seems very familiar, the box crashing every 3 hours or so, which was fine on a 2.2 system.

I don't normally have a local console connected so I can't workout what happened other than that it is no longer responding and have to do a power off/on. I have put a screen on it now to see what happens.

I'm using a TDM400 clone card from BroadTel.

Is there a log one can interrogate to try to get more information on the problem?

Z



rfernandez
Posts: 8
Member Since:
2006-06-13
Please Read First the HCL for Asterisk

zforum69
Posts: 81
Member Since:
2007-04-18
I don't think the answere is there

I just read it but I'm not sure why that makes any sense, as both our systems work on a previous 2.2 install.

The link reports different issues, link sound quality issues relating to interrupts, and PC lockups that occur straight away, not 3 to 4 hours after boot up and working properly.

In any event I am going turn off everything in the bios that is not needed as see how everything goes. The crash occurring every 3 hours is very consistent so I will know if it did anything.

Z



Hyperus
Posts: 41
Member Since:
2007-12-09
Similar problems on a new Dell R200 (that replaces Dell 860)

Been trying to nail it for about 2 weeks since it arrived. started happening from Day 1. PCIe bus has a Sangoma A200 and a Sangoma A500 (2 cards in total). Dell R200 caused CentOS to have an NMI error :-

Jan 14 23:20:49 ak kernel: __sdla_bus_read_4:803: wanpipe PCI Error: Illegal Register read: 0x0040 = 0xFFFFFFFF
Jan 14 23:20:49 ak kernel: Uhhuh. NMI received for unknown reason a1 on CPU 0.
Jan 14 23:20:49 ak kernel: You probably have a hardware problem with your RAM chips
Jan 14 23:20:49 ak kernel: Dazed and confused, but trying to continue
Jan 15 04:02:06 ak logrotate: ALERT exited abnormally with [1]

If I removed either card to run the cards individually in any slot (exception being that A500 wont fit in lower PCIe slot), system ran perfectly. Putting both cards back in together caused the issue again within 8 hours.

Konrad at Sangoma helped heaps with the debugging of this due to it being Wanpipe reporting the error. Dell washed their hands of it VERY early in the piece claiming that Sangoma cards were the problem. Sangoma's support of this issue was nothing short of amazing - they have been exremely supportive during the entire saga.

After mincing heaps of ideas around, I posted a question to these forums about acpi=off being the default for TB 2.4.0 install. I then decided to remove the "acpi=off" option from grub.conf. I ensured that the issue was occuring with 2.4.2 fully yummed before I removed this, then and only then removed the acpi=off option from grub.conf.

System now stable for 4 days. too early to be completely sure, but it seems on the surface that the Dell R200 doesnt like a default tb 2.4.0 acpi=off option.

/Hyp



zforum69
Posts: 81
Member Since:
2007-04-18
Thanks Hyp that did the

Thanks Hyp that did the trick!

I was in the process of rebuilding a 2.2 config on the same box in a different partition (about 25% done at the time) with the intention to move back to the 2.4 partition when there was a fix (BTW I do this because to allow me to test different builds on one hardware platform as I don't have a lot of space for test machines and only have one TDM400 card).

But then I read your post and went back to the 2.4 partition and has not crashed since.

BTW for anybody else out there, this is not a DELL specific issue. My box is a custom build using a Jetway Mini-ITX mainboard that uses a VIA C7 processor and VIA chipsets.

Thanks again Hyp,
Z



Hyperus
Posts: 41
Member Since:
2007-12-09
Good to hear.. it might pay to check out this tb2.4 acpi post...

It has some interesting points of view in it.

http://trixbox.org/forums/trixbox-forums/open-discussion/acpi-def...

/Hyp



bltda
Posts: 27
Member Since:
2007-04-13
Check this link

tomo89aus
Posts: 3
Member Since:
2007-11-05
Ok I have a brand new SC440

Ok I have a brand new SC440 (dual core Xeon 2.33GHz) with intermittent NMI errors and onboard NIC intermittently functioning. This is on trixbox 2.4 as well as 2.6.1.

I have a Sangoma A200 PCI express card in slot 1 and a realtek 8139 NIC in slot 5 (standard PCI)

I tried today taking the acpi=off out of grub.conf - no success.
I then tried adding acpi=on.....even worse - hung on kernel bootup somewhere.

I think my problem is the realtek NIC and the sangoma card fighting on the PCI bus. Of course the DELL bios gives you no control over IRQ assignment.

As per the link above I am going to try adding idle=poll in grub.conf and see if that makes any difference.

Next step is just to remove the realtek NIC and try and get the system stable again. If that fixes things I'm just going to have to use VLANs on the OnBoard NIC.

Will report back with what works for me...



tomo89aus
Posts: 3
Member Since:
2007-11-05
just an update: - removed

just an update:

- removed realtek NIC.
- fresh install of trixbox 2.6 (hyper threading disabled)
- acpi=on in grub.conf
- stable (10 flawless reboots and counting)
- cat /proc/interrupts shows CPU0 and CPU1 so both cores are being used.
- shutdown works now!

Next step is to buy a pci express NIC and see if that causes NMI issues or not.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.