A Moving Easter

In the days when I used to visit my Uncle Gerald, who was a keen gardener, he would often present me around this time of year with a large bundle of rhubarb and the instruction to "give these to your Mother and wish her a moving Easter". I suspect that the comment was somehow related to the laxative properties of rhubarb. We haven't had rhubarb in our house lately, but I still managed to have a moving Easter. I was moving all my VMs from a dead server to the backup one.

Yep. Woke up on Good Friday morning with the sun shining and plans for a nice relaxing day in the garden only to find the main server for my network sulking glumly in the corner of the server cabinet with no twinkly lights on the front and no whooshing of stale air from the back. Poke the "on" button and it runs for five seconds then dies again. Start to panic. Keep trying, no luck. Open the box and peer hopefully around inside. Nothing missing, no smoke or burnt bits, nothing looking like it was amiss.

Wiggle some wires and try again (the total extent of my hardware fault diagnosis capabilities). Disconnect the new hard drive I fitted a couple of weeks ago. Look in the BIOS logs, but they're empty. The most I could get it to do on one occasion was run as far as the desktop before it just died again. So, in desperation, phone a local Dell-approved engineer who offers to come and fix it the same day. But after three hours of testing, swapping components, general poking about with a multi-meter, and much huffing and mumbling, he comes the sad conclusion that the motherboard is faulty. And a new one is going to cost around 500 pounds in real money. Plus shipping and fitting.

The server is only two and a half years old (see Hyper-Ventilation, Act I), and I buy Dell stuff because it usually outlasts the lifespan of the software I run and ends up being donated to a needy acquaintance (with the hard drives removed, of course). But I suppose the sometimes extreme temperatures reached in the server cabinet can't have helped, especially as we've had a couple of very warm years and last week was a scorcher here. Though it has made me feel less like I trust the backup server I bought at the same time.

Ah, but surely there's no problem when a server fails? Just fire up the exported VM image backups on the other machine and I'm up and running again. Except that, unfortunately, I've been less than strict about setting things up generally on the network. Thing is, I was planning for a disaster such as a disk failure, which is surely more likely that a motherboard failure. With a disk failure it's just a matter of replacing the disk then restoring from a backup or importing the exported VMs. But a completely dead box raises lots of different issues. I know I should have nothing running within the Hyper-V host O/S, but somehow I ended up with one server having the backup domain controller running on the host O/S and the other (the main one) with the host O/S running WSUS, the SMTP server, Windows Media Services, the scheduled backup scripts, the website checker, and probably several other things I haven't discovered yet.

Therefore, while that main hosted server VMs (the FSMO domain controller, web server, ISA server, and local browser) fired up OK on the backup server, all the other stuff that makes the network work was gone. And then it got worse. The backup of the FSMO domain controller was a week old, and so it kept complaining that it didn't think the FSMO role was valid. And none of the recommended fixes using the GUI tools or ntdsutil worked. So I ended up junking the FSMO domain controller, forcing seizure of the roles on the backup domain controller, and then using ntdsutil to clean up the AD metabase. Afterwards, I discovered this document about virtualizing a domain controller which says "Do not use the Hyper-V Export feature to export a virtual machine that is running a domain controller" and explains why.

I certainly recommend you read the domain controller document. There's a ton of useful information in there, even though much is aimed at enterprise-level usage. However, when you get to the part about disabling write caching and using the virtual SCSI disk controller, look at this document that says you must use the virtual IDE controller for your start-up disk in a VM. But, coming back to the issue of backing up/exporting a VM'd domain controller, it looks like the correct answer is to run a regular automated backup within the DC's VM to a secure networked location instead. I've set it up for both the virtual and physical DCs to run direct to a local share and then get copied to the NAS drive, which will hopefully give me a fighting chance of getting my domain back next time. After you set up a scheduled backup in Windows Server Backup manager you can open Task Scheduler, find the task in the Microsoft | Windows | Backup folder, and change the schedule if you want something different from one or more times a day. And make sure any virtual DC VMs are set to always start up when the host server starts so that the FSMO DC can confirm it actually is the valid owner of the roles.

It does seem like a workable last resort disaster recovery strategy if a DC does fail is to force its removal from the domain and rebuild it from scratch. As long as you have one DC still working, even if it's not the FSMO, you should still be able to get (most of) your domain back by using it to seize the FSMO roles that were held by the dead DC and then cleaning it up afterwards. However, I wouldn't recommend this as a back-up strategy.

So after spending most of the holiday weekend with my head in the server cabinet, I managed to get back to some level of normality. I'm still trying to resolve some of the issues, and still trying to figure the ideal solution for virtualized and physical domain controllers. There's tons of variable advice on the web, and all of it seems to point to running multiple physical servers to overcome the problem of a virtualized DC not being available when a host server starts. Nobody is suggesting running Hyper-V on the domain controller host. However, my backup server that is valiantly and temporarily supporting the still working remnants of my network has both Domain Services (it's the FSMO domain controller) and Hyper-V roles enabled (it's hosting all the Hyper-V VMs).

Even though no-one seems to recommend this, they do grudgingly agree that it works and it does seem to be one way to cope with redundancy and start-up issues on a very small and lightly loaded network like mine, and when I get a new server organized it will also be a DC. Meanwhile I've created a "server operations" VM that contains all the other stuff that I lost - WSUS, SMTP server, Media Services, scheduled backup scripts, web site monitoring, etc. That way all I actually need on the base hosting server is Active Directory (so it is a DC) and the Hyper-V role with the correct network configuration. Oh, and the correct UPS configuration. And probably more esoteric setup stuff I'll only find out about when I get there.

Mind you, after I complained to my Dell sales guy about the failed server he's done me an extremely good deal on a five year pro support warranty with full onsite maintenance for the new box. So next time it fails I can just phone them and tell them to come and fix it. And until it arrives and is working so that I again have some physical server redundancy, I can only ruminate as to whether the fear of waking up to a dead network is as good a laxative as rhubarb...