After travelling 460KM today, I’ve discovered how annoying a hardware failure can be.
My colocation box is a HP DL360, G3, so it’s not exactly bleeding edge, but it’s a fantastic server, does it’s job and does it reliably.
Last night, we had a planned outage for just a short time to move the box to a different power situation, this was done and the box came back online.
It dropped off 6.30pm, came back 6.45pm. Then, 7.15pm there was notices coming through – I’ve began to ignore the warnings for a while now, so I failed there but it wasn’t too bad, because..
.. My colo provider, Inticon detected the port was flapping 20 minutes later, and went back to have a look, at 8pm he had advised that the issue was with the processor fan assembly, only one fan was spinning – strange, it was OK for a while before hand.
He went out and about to find a replacement part, and tried to find alternative measures, but came up empty. Considering the nature of the part – it’s pretty server specific, so it was understandable that we had no alternatives.
I then began hunting around for the part – found one, $185 online. I thought, wow, this must be a pretty premium part..
The contact at the colo provider went even further, and actually advised me there was a box on eBay for $250 – no kidding.
I looked at it, considered it’s marginally more expensive, but gives us a lot more hardware to play with as well. I contacted the eBay seller, asked if he could accommodate pickup for tomorrow – he was OK with it, fantastic I thought..
So, I head to bed – we can’t pick it up at night, it’s just not THAT important (and I had started work 8am, so by that point I was very tired). I think about this for a while in bed, then somehow drift off to sleep.
One of the kids is sick, so they crawl into bed beside me and annoy me, so I get up at 6am, head to sleep on the lounge, peaceful.
The next morning, I get up, sort out a pending PC related issue for somebody (it was arranged already), then the seller was interstate for the morning, not back to 1pm, so that was as good as we could get.
He then contacted us, advising he wouldn’t be back til 3.30pm – considering we were getting ready to leave around then, that’d work out perfectly. We got near the address in the southern Sydney area, but he calls and advises 4.30 – he was delayed. No problem, can’t do much if he isn’t there..
So, 4.30pm, he arrives, we have a look at the box, it works, he upgrades the CPU and RAM and throws in a 15K 36GB SCSI drive – fantastic I thought, more spare parts. The box worked, so in the boot with it, time to hit the highway for Newcastle.
We hit Newcastle at 7pm (after litterally, racing down the freeway to Newcastle), and the colo provider contact was already there – fantastic.
I take the server into the colo room, upgrade the box with what good parts I could – extra 1GB of RAM, 3.06 to 3.2Ghz Xeon, swapped over the fan assembly.. Power it on, it takes some time to boot, unusual, it eventually does boot.
Then, I enter into windows, get a ping going, get task manager up – make sure the virtual machine kicks over, it does. Box stays up stable.
Pack up the changed parts into the ‘new’ box, and get the colo provider to hold those incase something else fails down the track – pretty much a fully working machine with exception the front panel – which shouldn’t fail again.
Then, head back home approximately 30 minutes later, by this time the kid who is sick complains about pain, ends up needing a toilet break, all sorted.
Hit the highway to go back home, arrive at 9.30pm.
Wait for the colo provider contact to remove some trickery done to assist with the downtime, and I modify the SMTP route I put in place to redirect the affected domains to my router.
The outcome – all mail leaves, some spam shows, delete this, and I update the database, all in sync.. fantastic..
All up, nearly a full day, 460KM’s travelled, $14 spent on food at Aldi in Sydney, $22 spent at Maccas in Newcastle, $250 spent on a pretty darn good deal (compared to $185 for the processor fan assembly) – and we are all back and operational.
Hardware failures are REALLY annoying – especially when you can’t just call HP and say fix it in 4 hours.. – but we did pretty good for a 23 hour restore time (and still managed to keep some essential services up anyway).
And the dedication of Inticon is outstanding.
Had this been Servers Australia, I have no idea where we’d be. Part of me says ‘online sooner, cause they have HP DL360’s (or had them)’, the other says ‘getting Jared to get off his arse and take some damned action’. I’m leaning towards the latter, it’s just typical of previous experiences.
Note to self: work harder on high availability.