Tuesday 20 July 2010

SharePoint disaster recovery on a budget

A post over on the MSDN forums got me thinking about disaster recovery in SharePoint. I thought I would share my thoughts with you given that availability and disaster recovery are important concepts to anyone hosting (or using) SharePoint. Throughout the article I use information contained in the 2010 and 2007 technet articles entitled "Plan for availability".

In the forum post, the OP asks for feedback on his SharePoint 2007 DR strategy:
  • The farm consists of two physical DB servers and two physical Web servers. One physical Web server is virtualised meaning there are 3 Web servers in total. Note that it is assumed that these servers also host the query and index role.
  • There is a very limited budget (as is often the case) in that only one server has been approved for DR purposes.
  • No availability (e.g. uptime %) requirements are specified.
  • No capacity (e.g. number of users, RPS) requirements are specified.
  • The OP plans to implement a "stretched farm" over a WAN link. We can assume here that they are not closely linked as it is stated that that backup server is "in another part of the world". The DR server would be added to the farm as a Web server, and other roles would be added if required.
Frankly, being given an entire server in a separate data centre for DR purposes is more than a lot of SharePoint administrators could hope for. Although in an ideal world we would all have a completely redundant DR farm in a nearby data centre on hot standby, the OPs question is a lot closer to reality for most organisations.

However, I did think of a few potential issues with the OPs suggestion:
  • Stretched farms are only a realistic scenario where they are in close proximity with high speed links. That means less then 1ms latency and 1GB/s bandwidth.
  • The SharePoint configuration and central administration content databases contain computer-specific information meaning that the restoration environment must contain the same topology and server roles - which would not be possible on a single server.
  • There is no mention of the other SharePoint infrastructure requirements, including DNS and a user directory, although we will assume for this scenario that the OP has included those services as part of their DR strategy anyway.
  • Aggregate capacity requirements need to be met on the single DR server for the duration of the servers use. This means whatever resources were used on the "live" farm servers need to be available on the standby. Amongst others this includes CPU cycles, RAM, disk capacity, disk IO requirements, and network bandwidth.









Sample aggregate resource requirements

Note that the aggregate totals displayed here are simply for the minimum system requirements and do not include network and disk capacity. Realistically, each server may well have a lot more hardware resources. For example, assuming each server has 8GB of RAM, 100GB of disk capacity and a 3Ghz dual core CPU (not uncommon in modern MOSS 64-bit environments) , the requirements are suddenly a lot tougher to meet on a budget: the OP would need a server with 40GB of RAM, 500GB disk space and a 10 core CPU!

From these observations I concluded that:
  • A separate standby farm is required - as we only have one server available this will be a standalone server with all roles on one box.
  • The standby server needs to meet the aggregate capacity requirements provided by the existing servers for the duration of the outage.
The standby type (cold, warm, hot) depends on the OPs availability requirements. These are detailed in Plan for disaster recovery (SharePoint Server 2010). Given that there is clearly a limited budget, one can assume that maintenance and configuration costs need to be kept to a minimum, in which case a hybrid approach seems preferable. Here is one possible approach based on Microsoft's guidance in provisioning a hot standby data centre:
  • Create a separate DR farm and apply all Windows, SQL, WSS and MOSS updates to match that of the live environment as closely as possible.
  • Deploy all SharePoint customisations to the DR farm
  • Create and configure all Web applications to the DR farm, restoring all content databases from the live farm.
  • Test it!
Going forward, the maintenance required really depends on required availability. If the business can't cope without SharePoint for more than a few hours than the OP needs to consider configuring SQL log shipping to ensure his content databases are synchronised, and will need to ensure all live updates are also deployed to the DR server at all times. If, on the other hand a day's worth of downtime is acceptable the OP may decide to simply document any live changes and deploy them to the DR server on an "as-need" basis.

Of course, this is just one possible approach. If you have any suggestions or improvements I'd like to hear them!

Benjamin Athawes

Subscribe to the RSS feed

Follow me on Twitter

Follow my Networked Blog on Facebook

7 comments:

  1. Good read, Benjamin! Thanks for sharing your observation, analysis, and thought process! I think that a large number of small and mid-size SharePoint shops will benefit from what you've written.

    ReplyDelete
  2. Ah, this post highlights the complex problems that SharePoint system admins face on a daily basis. My company, Azaleos, is showing some love to system admins everywhere- we are giving away a 16GB iPad to the best IT disaster story (and what you did to save the day) to celebrate National System Administrator Appreciation Day (July 30). To submit your story or nominate someone else, go to www.azaleos.com/Company-Info/Celebrate-National-Sys-Admin-Appreciation-Day

    ReplyDelete
  3. Great post!

    I think I am on the right track... I starded a thread on MSDN forums here: http://social.msdn.microsoft.com/Forums/en-GB/sharepoint2010general/thread/c3fa229e-25ac-4df6-a92a-3bafd8a1e016

    Perhaps you share me some insightful comments? :)

    I am doing my first SharePoint implementation that span across sites. Basically we will have an overseas office with 7 people, and they need to have access to the http://intranet (SP 2010 site) in case the link goes down.

    With regards to SQL server availability, I am pretty sure that asynchronous mirroring is the way to go. Since we will have limited IT administration at this remote office, we things must be as automated as possible so we might as well add a SQL witness server.

    My concern is more with regards to the SharePoint farm. Since this will be a remote office, thus on a different network segment than my current SharePoint farm, a stretched farm wouldn't apply. So I it is my understanding that I would have to do a stand-by farm. Again, due to the fact that we'll have limited IT administrators on this remote office, I reckon that a hot stand-by farm would be appropriate.

    The servers are will all be virtualised, and since this remote office will only have 7 people, I would start having all SharePoint services running on the same server (as we expand I will then consider a separate crawl/index server and a separate server for BI and other applications).

    It is also my understanding that when installing SharePoint it is recommended to use some sort of automation script so all the databases are neatly named (which is important with stand-by farms).

    My concern is about the HA of the front-end servers. In past projects I would often leave my network gurus to take care of ensuring that http://intranet reaches the appropriate server, either by DNS round robbing, Windows Load Balancing or some 3rd party accelerator. This time, however, I am pretty much on my own (and a bit rusty on SP). So I am not sure about the amount of configurations I would have to do on Windows DNS, IIS, or even at the SharePoint central administration of both farms.

    Thanks for the help and congratulations on this great blog!

    ReplyDelete
  4. Cheers for the comment pmdci.

    You are correct in that while SQL mirroring is a form of redundancy, it is primarily a LAN technology.
    Log shipping would be appropriate for keeping content databases in sync with a remote farm over a WAN - see this Technet article for me info: http://technet.microsoft.com/en-us/library/dd890507(office.12).aspx#requirements

    As regards the configuration database, it is environment specific so you would need to either update this manually or via a script. Therefore this DR farm scenario works best where the IA (Web apps, site collections etc) is relatively static.

    As far as front end HA is concerned it sounds as though this is quite a small farm and Windows NLB will do just fine.
    You basically configure a clustered IP address for the appropriate WFE servers, point your host names to the IP via DNS and the software takes care of the rest.

    Note that if you are using SSL NLB configuration is complicated slightly in that it is best to ensure that each IIS site has its own dedicated IP address.

    Steve Smith from Combined Knowledge wrote a great white paper on configuring NLB here: http://www.combined-knowledge.com/Downloads/2007/How%20to%20scale%20out%20a%20SharePoint%20farm%20and%20configure%20IIS%207%20Microsoft%20Network%20Load%20Balancing%20on%20windows%20server%202008.pdf

    ReplyDelete
  5. Thanks, it was interesting to hear your take on things, I'll be sure to check in again soon. As for the maintenance part - I hardly know of large businesses that can cope with a day's downtime... As they say, time is money!

    ReplyDelete
  6. Hmm.......nice sharing it's really informative as well as useful info.

    IT Disaster Recovery

    ReplyDelete
  7. Nice post. I bookmark your url and I will recommend this post to my friends.

    ReplyDelete