System Error, Kernel Panic, Linux and Windows

I’ve got two concurrent machine issues right now.

One is a Linux Server, which for the last few months has been Kernel Panic’ing every 2 days or so.

That issue is expected to be a problematic CPU, however, memory tests have come back fine, so I’m not too sure if we are gonna find it easily.

It’d only happen every 2 days or so, and this was in a data centre environment (air conditioned, etc), the system was placed on the a/c at one stage to chill the system, and it made it no more stable, however, saw the HDD temp come right down, so there might be minor temperature issues.

What I’m hoping to get happen is the system crash again, whilst its out of the racks, and that way some more testing can be done to prove its something AT fault.

Next, another weird issue, my system, a Windows 64bit (Professional) with SP2, has been crashing in game.

I would start a game, like Bus Driver (funny small game), and every time I would attempt 1 particular level, it would blue screen, no matter what. It was certain. So, what I set about doing today was to fix that issue, by narrowing it down.

Anyway, during the process, I was playing the game and sure enough, we’d get a BSOD.

The system configuration is:

Pentium 4 630, 3.0Ghz EM64T, 2MB Cache
3GB of DDR400 RAM (in a Dual Channel configuration, with 2 x 1GB, 2 x 512MB)
Gigabyte GA-8IPE775G Motherboard (socket 775, AGP 8x)
nVidia fx5500 256MB AGP Vid card
nVidia fx5200 128MB PCI Vid card
Dual LCD monitors

Some BSOD reasons excuses below:
SYSTEM_SERVICE_EXCEPTION
STOP ERROR
PAGE_FAULT_IN_NONPAGED_AREA
DRIVER_IRQL_NOT_LESS_OR_EQUAL
… and so on.

So, to eliminate, and the natural thinking process kicked in here, I shot down for Video card, after running memory tests in chunks successfully the day (or so) before.

I came with this idea, take the card out of my machine, place it into my fiance’s machine, and take hers, place it in mine, and get hers to do the same thing, that’d prove the card enough for me to make a frisbee out of it.

No dice. It worked fine on hers, no hassles (aside from being an fx5500, and being slower then her fx6200 at playing the game with Maximum detail).

So, the next step was to run some 3dmark (also made some of the BSOD messages, predictably (every time)).

Nope. Just low scores (300 ish), and taking a lot of time to complete.

My system did BSOD once only with her card in my machine, but it did work fine continuously after that single BSOD, and the game ran fine, everything was fine.

As was her machine.

But there was no way she’d tolerate rubbish graphics, even if the card seems perfectly fine.

I can nearly guarantee if I started the game with that card in my machine now, I’d get a BSOD. It’s a near certainty.
Nothing else causes a BSOD as yet, everything runs perfectly fine. I stay up and running for days on end, and take it down for a few minutes at a reboot (or BSOD).

So, I’m near certain its something about the card, or the drivers, at fault, but that doesn’t explain the differences in behaviour. It runs fine on her machine.

So, digging deeper on this when I can find some more spare time for it.

But, further news. I did test the HDD freeze trick, and unfortunately, it doesn’t work. But that’s fine.
You see, when I bought that drive, I bought two. Exactly the same time, drop shipped from the same supplier, so, chances are, the other drive, at someone elses place right now, has the same PCB on the back.

I dug some more deeper on the issue, and came across a site, called MyHardDriveDied.com, and the presentations he has on YouTube are great.

He goes on to say that most of the time, ie, 85% are software recoveries, I know this drive is beyond software. SpinRite didn’t have any luck finding it.

But his presentation goes further, to say, 10% of the remaining 15% are logic board failures. Incredible. The remaining 4% are simply head crash, and the worst situation ever, is the remaining 1% are motor failure, and motors + multiple platters equals bad news according to his video.

The good news for us is, we can recover the data off the drive, looking at that nice increased chance of it being a logic board swap.

All we have to do is basically, back up the data off that drive (unfortunately, also has Windows on it). Actually, could probably ghost it, to a new 320GB drive (payment for stealing the 120GB back), and swap its logic board over, and boot from it.

If that works well, we’ll whip the data off, and have some fantastic display items to stick in the wall unit.
If it DOESN’T work, I have my second identical drive to do a head swap in, should I decide to do it.

I only wish I can work this BSOD issue out so I can get the order in for 3 x 320GB drives, and “insert problematic part here”, and get all the issues solved (except the server, that’s pending some severe interrogation to both prove its stability, and find the cause behind the kernel panics.

I guess that’s a chunk of my time taken in just 3 seperate issues, which are complex in nature.

Computers, why couldn’t they be smart and simply say: Replace your graphics card, its crap, and its causing problems, or, take out bank 4 of RAM, its causing problems.. Speaking of Bank 4, I swapped that stick with my fiance’s as hers was 3-3-3-8, and all my sticks are 3-3-3-8, and one was 3-4-4-8, so the swap was ideally to match the timings.

Hopefully the reason behind the failures aren’t too much more complex, and the identification, and fix aren’t as difficult as this has been. It’s weird, it’s inconsistent, and breaks the theory that if its broken, it’ll perform near the same, or at least, have predictable behaviour. It doesn’t. Both machines worked fine. I couldn’t believe it. No conflicts visible, but its something.

That’s my system problems for now. Should be resolved soon!

This entry was posted in Linux, Networking, Programming, Random. Bookmark the permalink.

One Response to System Error, Kernel Panic, Linux and Windows

Leave a Reply

Your email address will not be published. Required fields are marked *