Server Crashes solved?

Today, we upgraded the crash server to a new BIOS version.

This new BIOS version quotes that it Improves Memory Compatibility. And so far, the results seem to be promising.

We moved (backwards) from Fedora Development to Trixbox 2.0.

The previous server we know would crash when any interaction with yum, or any compilation was done, on MOST occasions, likely because this gave various areas of the CPU, memory, motherboard and IDE a workout.

Just tonight Trixbox 2.0 is installed and I’m moving it from CentOS 4.4 to CentOS 5 to take advantage of the newer kernel and bug fixed software, and then we’ll work on destorying everything that makes trixbox, trixbox by removing IRCD, the web admin system, the flash operator panel, some of the databases, and a few other items it installs.

What this does is give us a platform to start with to modify the key items I modify and create a stable operating platform, the change being this time, CentOS 5.

We have a dual core, 64bit CPU. The 64bit shouldn’t be a problem running on 32bit hardware, with a i386 kernel. What could be a problem however, is a dual core CPU running on a kernel that has been superseded a total of something like 11 times or more?

So, the running theory is we’ll use a newer kernel that should have some tastes of dual core implemented and work well with the system, and moving to CentOS 5 now, saves issues down the track if it doesn’t work out.

Of course we could have just took the perhaps different way about it, and installed CentOS 5, and ran the ugly Trixbox script, and removed all the rubbish it includes that we simply have no use for (like its IRCd).

Let’s hope this BIOS update solves the issue completely, so we can get running on some REAL hardware for a change.

In related hardware issues, I got a P3 passed to me today, complains of Windows 98 already installed, but running slow, and wants a format.

Took it to someone else, who said it worked the day before, and just today, they plugged it in, turned it on and it wouldn’t work. They investigated the hardware and concluded the issue was a faulty switch. I wasn’t happy with that conclusion, mainly for two reasons:

1. They hadn’t tested the power supply.
2. They hadn’t tested the motherboard.
3. They hadn’t isolated anything else in the machine.

So, I opened it up, tested the machine, no boot, removed PCI cards, and isolated machine to bare bones, no boot.
So, I dug a deeper, and changed the reset button with the power button, no boot – there goes that faulty switch right out the window, the switch is fine.
So, digging a little more deeper, we swap the motherboard over, with another board, no boot.

Used the power supply from another machine, motherboard from another machine, and the power switch from the same machine, viola, it boots.

So, the issue was that not only had the power supply gone, but the motherboard too.

The switch on the case is fine. Problem now is that the case won’t take the other power supply I have, and basically the rule of time average applies, average time to get an angle grinder, grind out a hole in the back of the case to take new power supply, probably longer than it would take placing all the components in the case we just took everything out of.

And so, it was done, and it boots. All successful. My old P3 motherboard that hasn’t been used in a fair while was working fine in a new case.

The message is clear, if you haven’t tested completely, you can’t diagnose an issue. If you have no experience diagnosing issues using proven methods, you are not going to be able to come to a solid conclusion on what could be the issue, without the resources and knowledge to test each component involved in a problem to completely identify the issue, you will not identify the issue.

To accurately determine an issue, you must first identify the problem, and not make problems up as you go to see if that fixes the issue. If you get lucky, sure it’ll solve it, but if you are wrong, you’ll end up wasting all of time, resources, and money chasing down an issue that you haven’t even identified yet.

The total time taken to identify that issue was around 1 hour, with an extra hour to move the components over, testing and cleaning up, with tomorrow used to install Winblows 98 again.

Back on the topic of my server, I am so happy to have it running, stable again (I hope), testing will tell, I think the next outage will be placing the extra 1GB ram back into it, or the reboot into CentOS 5, assuming it goes successful.

This entry was posted in Linux, Networking, Random, VoIP. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *