Thursday, 22 October 2009

Administration: do you know your DR process?

"This computer can't connect to the remote computer" is a verbose error message that us server administrators dread seeing when we connect to a server farm via RDP. A typical reaction would be to blame the local Internet connection (which in my case is a certain privatised former state telecommunications operator that will remain anonymous). Most of the time, we will try to access a site that is known to be "reliable" to confirm this is the case - I normally use Gmail. However, what would YOU do if Gmail was accessible and your server(s) wasn't?

As I have an unusually busy weekend ahead of me I thought I would quickly share with you a recent experience of mine and the various "what if" questions that sprang to mind.

It was a late night in the office and the sysadmin team was busy deploying the most recent batch of MS security patches to our farm. Notifications had been sent to our client base warning them of downtime so I wasn't too concerned when I witnessed the error above when I attempted to RDP to one of our ISA server boxes.

However, minutes quickly turned to an hour and the command prompt window I had opened to ping the server ("ping fqdn -t") was still returning timeouts. I knew something was amiss and, after confirming with the rest of the team it soon became obvious that the server in question had hung during a compulsory reboot following deployment of MS security patches.

A few queries crossed my mind at this point that made me realise that our DR process was more than a little ambiguous:

Ask yourself
  • What number would you use to contact your data centre? Where is it documented (and would it be available in a disaster...)?
  • When would you "sound the alarm" and notify other members of staff? Who would you escalate the issue to internally? What if they weren't available?
  • At what point would you notify clients of the service disruption?
  • What kind of authorisation would be required at the datacentre to approve a hard reset if this was necessary, and would the relevant individuals within your organisation be available?
  • Would you have available hardware resource to rebuild a server at a moments notice if necessary?
  • Are there any virtual environments available as a temporary resolution?
  • Are software and backups readily available to allow a new server to be rapidly provisioned?
As it happened, I was very lucky. In the absence of a clearly defined process, I decided to call our data centre right away to clarify what would be required to obtain access to the server farm if necessary. They informed me that a phone call to one of our "authorised contacts" would be sufficient for them to reboot the server themselves, which just so happened to resolve the issue.

I still haven't had a chance to identify the root cause of the issue, but the lessons that I learnt from this experience that I hope will be of use to you are as follows:
  • Ensure all contact details are thoroughly documented for both internal and external escalations. This includes the order in which contacts should be called and (ideally) a minimum of two phone numbers for each contact
  • Ensure escalation and notification times are clearly specified so remove ambiguity. In this context, a "gut feeling" probably isn't clear enough.
  • Ensure the correct process for obtaining appropriate security clearance is documented and that staff are made aware of said process.
  • Ensure sufficient resources (hardware, software, people) are available to deal with a crisis, especially where the nature of your service depends on high availability.
Ideally, all of this information should be documented already as part of a more general DR strategy but I would certainly recommend you ask yourself a few basic "what if" questions with regard to disaster and see if you are prepared.

Subscribe to the RSS feed

Follow me on Twitter

Follow my Networked Blog on
Facebook

Add this blog to your
Technorati Favourites

No comments:

Post a Comment