What do you do when everything is down and your backup failed? Let’s paint a picture using a real scenario.
It’s Monday. I’ve just left my house and notice I have a missed call on my work phone. It’s not from a number I recognize, and I put the voice mail on my car speakers.
A customer we weren’t responsible for, that is, they don’t pay us to monitor their devices, they just call us every now and then, has a down event. They describe the issue as the server coming up to boot, and then turning back off in a loop.
I call them, and let them know I’ll be there in their office in about five minutes. I get a verbal nod, and an “okay”, as if it will be longer.
Four minutes later I walk in their door.
The server is indeed stuck in a boot loop, coming to a Windows Stop error every time it attempts to boot. Okay, well, this is a virtual machine on a host, and the host is okay, so I’ll just restore the server from last night’s backup.
Where is last night’s backup? Oh. It’s in error as a failure because, guess what, the same power event that caused this issue (buy battery backups!) caused the backup to be interrupted. Okay, I’ll just restore from the day before.
Okay, well, we setup this backup to also copy its files to another machine setup as a NAS to push images to the cloud.
Last copy date is March of last year.
I inspect why this has happened, and the reason is clear. The machine’s last boot time was in March of last year. And the backup is corrupt, so I can’t even retrieve the critical files from the failed server via mounting its virtual hard disk, which is perfectly readable by the way, and restore them to a year old backup.
So now what? What do you do in this situation?
First, go take a walk. Because panic has probably set in. I know you don’t want to. I didn’t want to. But what’s the difference, effectively, between telling them their data is gone now, and telling them their data is gone in twenty minutes?
The answer is twenty minutes. There’s no other difference. So go take a walk, and breathe.
When you get back, rethink the issue.
For me, it was a simple case of getting the error code and finding out that their particular issue has built in tools to resolve it. I found this blessed article that applies to just about every Windows Domain controller since Server 2008.
It turns out that when the AD DS log files are corrupt, a Windows Server Domain Controller can’t boot because, well, it uses that database for all logins, including the administrator. There is, however, a special boot mode that allows you to log into the DC without need for the database, and then, using a couple of commands, roll any good data into the database, discard any corruption, and then recreate the log files.
And, mercifully, after four hours of bad news, you’re back at the admin desktop with all services running.
So they’re back up and running. They’re happy. And I can go back to the office and receive my accolades for somehow turning lemons into lemonade. Again. (This is the literal phrase my boss used.)
Their backup failed, and we saved them from it, so now the client is on our list for daily checkups, as well as basic maintenance and IT calls.
That’s all well and good, but we should talk about how to prevent this scenario. Or rather, what’s being done now to prevent it from ever happening again.
First, root cause, their server wasn’t on a battery backup. So, when the power went out, the sudden stoppage caused disk errors that lead to data corruption.
What does the backup battery do? For starters, it provides a buffer between the server and electrical issues. A power surge, unless the building is struck directly, isn’t going to affect devices attached to Uninterruptible Power Supplies.
These are like surge protectors, except they also have batteries in them to ensure devices keep running for a time after the power goes out.
But what about when the batteries run out? Because they absolutely will.
Any battery backup worth its price has a cable that runs from it to some sort of data I/O on the device it’s supporting. Usually, this is a USB cable. The battery can then let the device know how much time it has left, and the device can decide when to gracefully shut itself down. Ya know, without the data corruption.
Second, have a backup. There should not be a circumstance, in a properly managed environment, where a backup failed like this one did. I’ve already outlined the importance of backups, and what to do to ensure you have a good one. Have a read at it.
No one was doing those checks on site, because no one had the time to do those checks every morning or afternoon.
Who has the time? I do. It’s my job. It’s a pleasant little hour where I get to sit back and click around on some servers, sipping iced tea every morning.
Have someone who knows what they’re looking at check the backups. If it fails, you can at least have someone else fix it, if not the very same person who found the issue.
An entire business was almost brought to its knees because of a power outage. And they could have prevented this nightmare scenario by doing two things.
One of these is more important than the other. You only get one guess.
Backups. It’s the backups.
Check your backups.
Please, if you do nothing else with your technology, check your backups.
Check your backups.
Go check your backups.
Stay safe online, and in the wide world.