OzVoIPStatus moved / Unexplained crashes unresolved.

That’s right.

It’s moved data centre this time to Equnix.

Now, only the stupid would ever think a mass scale move of any type would go without error. And I mean, only the completely stupid.

Every time you move a complete farm of servers, SOMETHING will not go right. Ask anyone that has done it before, and endured the late nights, or the customer complaints from the early risers when a problem occurs.

And even worse so, when the problems continue later into the day.

And much more so, when the problems creep into the night. Not likely.

Anyway, the move for OzVoIPStatus was started at 3.31AM. The move was completed at, I understand to be 6AM, the server was showing activity.

I was awake til something like 1.30AM, with my little one who thought being awake after 12AM was pretty good. So good he stayed up for an hour after while I quite happily built PHP 5.2.2 on my dev server and well, configured and maked a few times while I gave up waiting on the move to start (not an attack at anyone, its no easy task moving several servers, without drastically affecting their online abilities).

So, it got late, gave up waiting on swapping the DNS over, woke up this morning at something like 10AM with my server offline.
Nothing says “Good Morning” like a server that is down. You know your gonna have to do something to fix it, and get all its services running, and you are probably going to put that ahead of “EVERYTHING” else, so that its up and not affecting anyone.

So, when I found it was still down, waiting to be moved, and the WHM DNS had been automagically changed, I had a few things on my plate to take care of.

First, was making sure the server was online at its intended IP address. It wasn’t, but that was fixed quickly.
The next step, figuring out why OzVoIPStatus’s IP magically changed where it pointed to, to another IP.

A quick visit to the first DNS server confirmed the IP was correct. A quick visit to the second DNS server saw that the WHM tools had managed to unchange my changes.

A quick change, a router reboot later, and we were up and running.

The next step, check all services are running.
Naturally, the server had been sitting there from 6AM to 10AM without internet access, so the predictable had occurred, some applications running that require internet access would not be running, or would require manual work to get running, or adjusted.

So, the first step, start all services not running, pretty easy.

Check OzVoIPStatus to see if we are cooking with 200 OKs and pages load ok. Yep.

Check if outages are being logged correctly, and no provider is suffering DNS issues or other network issues.
Nope, no joy.

Unfortunately, a problem with a link out of the data centre (about 2 levels above SAU) is having issues.

The traceroute revealed that it wasn’t passing on traffic from the WCG Optus link.

So, I contacted SAU, alerted them of that, and you can only imagine that they are busy getting servers in racks, systems running, services online themselves, that it was no way going to be a #1 issue, but for me, and OzVoIPStatus, it was a rather critical issue that needed some attention.

So, I dug a bit deeper later on. Contacted Optus’s NOC, alerted them to it. The response was that it didn’t seem to be a problem their side, and that they couldn’t find a route advertised.

I was already aware such a problem existed, when I posted in this whirlpool thread (the hop just came to mind):
http://forums.whirlpool.net.au/forum-replies.cfm?t=740473&r=11610503#r11610503

As you can see, the issue is related exactly to that 59.154.15.66 link, which points to 59.154.15.65 to carry traffic to further hops.

Optus tells me that the traffic should be routed not to .66, but to .65.

I have no idea, not my network, not my problem, but the issue does exist none the less. I’m amazed, the poster in that thread claims it was fixed just Friday. However, today, it does not seem to be fixed at all, with myself experience, believe it or not, the exact same issue.

It’s a bit of a long running issue, and I have had one user complain they aren’t accessing the website, and beside that, the other issue I’m having is the fact the link is down, I’m logging incorrect outages for 25 providers which share the same 10 servers, so the stats are incorrect as a result, but there’s little that can be done, aside from deleting the outages once they are finished, and every time I check out how things are, to reduce the impact.

I was hoping for a “TODAY” fix. That was more than 12 hours ago, so it’s probably a next few hours issue, as it’s already 1.30AM.

Anyway, the move was pretty much a great success, and all worked out nicely, with the exception of this routing issue, which is pretty much in the hands of a third party to fix. I wonder what the SLA is like on the links.

In other “crashing” news, the crashed server remains crashed, as anyone can imagine, the time wasn’t there to get the required photo of the console, and the reboot of the console.

I still am not sure on the reasons behind the crashes either, but one thought I had was a 10 / 100 Mbit issue. That has apparently been ruled out, as it always was on 100Mbps at the testing point, and the data centres (news to me).

So, that being the case, there’s no real idea behind the crashes at this point in time, some months after January when they originally started.

Changed from x86 to x86_64.
Changed from CentOS 4.4 to Fedora 6 (different kernel versions).
Crashed with no additional software running.
Didn’t crash with stress testing.
Memory testing passed successfully.
HDD is fine (unless all 3 are faulty)
Motherboard and CPU would have caused crashes under the tests ran if they were faulty, as the operations performed were all generally, system intensive.

I guess the console dump holds key information here, but that’s still not useful if its only short lines like the netconsole output that I got, as it pointed us to CPU being the issue, but that’s not the case, as the tests all passed fine under another OS.

I got excited about the issue being a 10Mbit / 100Mbit issue too, but if thats not the case, what could it be? What on earth could be wrong?! Why does it crash, when it shouldn’t? Why is there no replication of the crashes?

Why blame hardware when its working?
Why blame software when it works on several servers elsewhere?

It’s not an easy identified issue, but hopefully, we find out soon enough, so I can start enjoying the dedicated server, and not have it there wasting rackspace.

I’ve pondered a lot of possibilities, but still remain without much of an idea on what the issue is, nothing is obvious, nothing points anywhere. Well, actually, according to someone I spoke to, the obvious link is that when they have it installed in the data center and pass it over to me, it crashes. Well, this time it was a clean install, with nothing installed, SSH and yum was installed by the technician, I installed nano, and went to use yum and it ran away.

That tells me its not liking SSH, or yum, or the data centre.

Consider the facts, here, yum is used by many servers, in fact, I have 1, 2, 3, 4, 5, 6, 7 servers that I have used yum on, and have worked fine. They were all CentOS 4.3/4.4, x86, on differing hardware.

All seemed fine with those though, so I think we can rule yum out as having the bugs, considering it crashed when yum was not running as well.

Further, I think we can rule out the OS, and kernel, as different versions still see it crash in the data centre.

The only thing that is in common between crashes are, well, data centre, because I’m not logged in when other crashes have occured, different kernel version, same results.

All testing outside the DC was successful, and without error.

Leads me to, the ethernet cable, the rack, or the manner it is placed in the rack, or some other issue surrounding the server in the rack.

I remain solid on my point that its not software, because the OS has changed, and the software it normally runs was not even loaded, with the exception of SSH and yum, and we don’t see other servers crashing running yum, and we don’t see SSH causing crashes on my other servers.

It doesn’t seem to be hardware, as the hardware testing they did was all successful, and it even maintained, I think it was 14 days uptime, in their office, or more. It never crashed, so it can’t be software.

Either something is flowing down the network cable it doesn’t like in the data centre, the network interface is not likely the cofiguration at the switch, the environment is too cold for it (not likely, its an Intel), power is dirty (unlikely, its all APC gear).

All fingers can be seen to point at the environment, or perhaps the driver for the network adapter and a conflict.

It’s all very much confusing stuff, and nothing is clear, and everything we’ve done has only made the initial points that conflicted clearer: It’s not hardware, it’s not software, the crashes are random, but the crashes only happen in the data center, and it doesn’t matter which rack its in.

I thought Telstra was weird, expecting Australia to believe their crap on how they have Australia’s interests at hand when they propose FTTN, yeh right, but this, by far, is a very weird issue, but will definitely have a solution, if myself and the staff at Servers Australia have anything to do with it.

This entry was posted in Random, VoIP. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *