TB 2.4 crashing on Asus/AMD and new Dell server spec
I've been testing TB 2.2 and now 2.4 for sometime now on a Asus MB with Athlon 64 processor with a Rhino 4FXO with Echo Cancel. This setup was running good for about 8 months on TB2.1 and 2.2, only issue was an occasional server reboot about every 1-3 weeks. Since it's kind of a test system, I never worried about it.
After upgrading to TB 2.4 recently the box keeps crashing. First the FOP errors out on the console, then Asterisk service stops, Hud-Lite stops, SSH server stops and the console goes dark. Usually takes about 3-5 hours to crash after a cold boot.
I have tried:
- Upgrading the RAM to 2GB
- Running YUM and Package mgr updates
- Changed BIOS setting to fix PowerOn error shown in dmesg
- Set ACPI=on on grub.conf
Next I'm going to remove the Rhino drivers and reinstall them using the Package manager, since I installed them manually using YUM and since then, there are Zaptel module load errors on boot, however the RCBFX module loads fine and calls go in/out as expected.
My dilema is, I've been marketing and starting to sell Trixbox CE to customers and would like my demo system running smoothly. So I'm thinking of buying a new server.
I'm partial to Dell Servers so this is what I'm thinking of buying. Please comment on the spec.
Looking at Ebay buys and found:
PowerEdge 2650 Xeon 2.4Ghz, 1GB, 36GB PE2650
I think this will fit the PCI Rhino card?
Will this spec be a stable platform to demo TB2.4 and above on?
First of all, I'm a network engineer and not a Linux pro like others on this list. That said, here is what I found this morning on the console. The system was down and I had to do a hard reset to get it to boot.
EIP: [
<0>Kernel panic - not syncing: Fatal Exception
<0>Uhhuh. NMI received for unknown reason 2d on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 2d on CPU 0.
Do you have a strange power saving mode enabled?
dazed and confused, but trying to continue
I tried disabling ACPI in BIOS, but then the OS would not boot, kept complaining about one of the two SATAII drives, I have them mirrored by the way.
I'm leaning towards getting the Dell server and moving off this whitebox setup. Although these issues are helping me learn how to troubleshoot Linux and TB.
I get the same errors on a Dell PowerEdge 1950 w/4GB RAM running dual dual-core Xeon CPUs (HT disabled) at random intervals running Trixbox 2.3 with a TE110P PRI interface.
I have a second identical 1950 system that was recently installed with 2.4.1 (which will be updated to 2.4.2 ASAP) with a TE210P PRI interface. Haven't had much of a load on that system yet, so I'm interested to see if a difference in cards and OS will result in better uptimes.
The 2.3 machine is due to be updated to 2.4, as soon as I can find the time. Until then, we just power cycle when the system locks up.
Cause is the PCI card. I removed the Rhino 4FXO PCI card from the affected system and it has run stable for about 2 days now. So the crashing has something to do with the PCI bus. I'm not sure what to do about it or how to fix it. If someone can comment on a possible fix I would appreciate it.
Thanks,
JK
I'm just starting to build a 2.4.2 system and get the same issue crash issue. I haven't done any investigative work to isolate the problem but it seems very familiar, the box crashing every 3 hours or so, which was fine on a 2.2 system.
I don't normally have a local console connected so I can't workout what happened other than that it is no longer responding and have to do a power off/on. I have put a screen on it now to see what happens.
I'm using a TDM400 clone card from BroadTel.
Is there a log one can interrogate to try to get more information on the problem?
Z
I just read it but I'm not sure why that makes any sense, as both our systems work on a previous 2.2 install.
The link reports different issues, link sound quality issues relating to interrupts, and PC lockups that occur straight away, not 3 to 4 hours after boot up and working properly.
In any event I am going turn off everything in the bios that is not needed as see how everything goes. The crash occurring every 3 hours is very consistent so I will know if it did anything.
Z
Been trying to nail it for about 2 weeks since it arrived. started happening from Day 1. PCIe bus has a Sangoma A200 and a Sangoma A500 (2 cards in total). Dell R200 caused CentOS to have an NMI error :-
Jan 14 23:20:49 ak kernel: __sdla_bus_read_4:803: wanpipe PCI Error: Illegal Register read: 0x0040 = 0xFFFFFFFF
Jan 14 23:20:49 ak kernel: Uhhuh. NMI received for unknown reason a1 on CPU 0.
Jan 14 23:20:49 ak kernel: You probably have a hardware problem with your RAM chips
Jan 14 23:20:49 ak kernel: Dazed and confused, but trying to continue
Jan 15 04:02:06 ak logrotate: ALERT exited abnormally with [1]
If I removed either card to run the cards individually in any slot (exception being that A500 wont fit in lower PCIe slot), system ran perfectly. Putting both cards back in together caused the issue again within 8 hours.
Konrad at Sangoma helped heaps with the debugging of this due to it being Wanpipe reporting the error. Dell washed their hands of it VERY early in the piece claiming that Sangoma cards were the problem. Sangoma's support of this issue was nothing short of amazing - they have been exremely supportive during the entire saga.
After mincing heaps of ideas around, I posted a question to these forums about acpi=off being the default for TB 2.4.0 install. I then decided to remove the "acpi=off" option from grub.conf. I ensured that the issue was occuring with 2.4.2 fully yummed before I removed this, then and only then removed the acpi=off option from grub.conf.
System now stable for 4 days. too early to be completely sure, but it seems on the surface that the Dell R200 doesnt like a default tb 2.4.0 acpi=off option.
/Hyp
Thanks Hyp that did the trick!
I was in the process of rebuilding a 2.2 config on the same box in a different partition (about 25% done at the time) with the intention to move back to the 2.4 partition when there was a fix (BTW I do this because to allow me to test different builds on one hardware platform as I don't have a lot of space for test machines and only have one TDM400 card).
But then I read your post and went back to the 2.4 partition and has not crashed since.
BTW for anybody else out there, this is not a DELL specific issue. My box is a custom build using a Jetway Mini-ITX mainboard that uses a VIA C7 processor and VIA chipsets.
Thanks again Hyp,
Z
It has some interesting points of view in it.
http://trixbox.org/forums/trixbox-forums/open-discussion/acpi-def...
/Hyp
Ok I have a brand new SC440 (dual core Xeon 2.33GHz) with intermittent NMI errors and onboard NIC intermittently functioning. This is on trixbox 2.4 as well as 2.6.1.
I have a Sangoma A200 PCI express card in slot 1 and a realtek 8139 NIC in slot 5 (standard PCI)
I tried today taking the acpi=off out of grub.conf - no success.
I then tried adding acpi=on.....even worse - hung on kernel bootup somewhere.
I think my problem is the realtek NIC and the sangoma card fighting on the PCI bus. Of course the DELL bios gives you no control over IRQ assignment.
As per the link above I am going to try adding idle=poll in grub.conf and see if that makes any difference.
Next step is just to remove the realtek NIC and try and get the system stable again. If that fixes things I'm just going to have to use VLANs on the OnBoard NIC.
Will report back with what works for me...
just an update:
- removed realtek NIC.
- fresh install of trixbox 2.6 (hyper threading disabled)
- acpi=on in grub.conf
- stable (10 flawless reboots and counting)
- cat /proc/interrupts shows CPU0 and CPU1 so both cores are being used.
- shutdown works now!
Next step is to buy a pci express NIC and see if that causes NMI issues or not.


Member Since:
2007-08-11