Over the last couple of months I have been contacted by various folks who thought long and hard about their Business Continuity and Disaster Recovery design. They bought a great backup solution which integrated with vSphere and they replicated their SAN to a second site. In their mind they were definitely prepared for the worst… I agree on that to a certain extend, their design was well thought-out indeed and carefully covered all aspects there are for BC/DR. From an operational perspective though things look different, first significant failure occurred and then they couldn’t fully recall the steps to recovery. That is what my tweet below was inspired by…
I don’t care how solid your infra architecture is; if you don’t have a well documented recovery plan you are doomed. #bcdr #epicfail
— Duncan Epping (@DuncanYB) July 4, 2013
Funny thing is that this tweet also triggered some responses like “Go SRM” or “that is where Zerto comes in”, and again I agree that an orchestration layer should be part of your DR plan but when talking about BC/DR I think it is more about the strategy, the processes that will need to be triggered in a particular scenario. What is involved typically? I am not going in to the business specific side of things even and all the politics that comes along with it. But instead look at you process, take one step back and ask yourself: what if this part of the process fails?
One of the things Lee and I will mention multiple times during our VMworld session on Stretched Clusters is: Test It! Not once, not twice but various times and be prepared for the worst to happen. Yes, none of us likes to test the most destructive and disruptive failure scenario, but you bet when something goes wrong it will be that scenario you did not test. Although I think for instance SRM is a rock solid solution, what if for whatever reason your recovery plan does not work as planned? While testing make sure you document your recovery plan, even though you might have a bunch of scripts laying around who knows if they will work as expected? Some scripts (or SRM type of solutions) have a dependency on certain components / services to be up, what if they are not? Besides your BC/DR strategy of course a lot of procedures will need to be documented. What kind of procedures are we talking about? Just a couple of random ones I would suggest you document while testing your scenarios at a bare minimum:
- Order in which to power-on all physical components in your Datacenter (and power-off)
- Location of infrastructure related services (AD, DNS, vCenter, Syslogging, NTP, etc), when virtual and on SAN document the datastore for instance
- Order in which to power-on all infrastructure related services
- Order in which to power-on all remaining virtual machines /vApps
- How to get your vCenter Server up and running from the commandline (this will make it a lot easier to get the rest of your VMs up and running)
- How to power-on virtual machines from the commandline after a failure
- How to re-register a virtual machine from the commandline after a failure
- How to mount a LUN from the commandline after a failover
- How to resignature a LUN from the commandline after a failover
- How to restore a full datastore
- How to restore a virtual machine
- etc etc
Now I can hear some of you think why would I document that, I know all of that stuff inside out? Well what if you are on a holiday or at home sick? Just imagine your junior colleague is by himself when disaster strikes, does he know in which order the services of that business critical multi tier application need to start?
When you do document these, make sure to have a (physical) copy available outside of your infrastructure, believe me … you wouldn’t be the first finding yourself locked out of a system and trying to find the documents to recover and then realizing they are stored on the system they need to recover. Those who have ever been in a total datacenter outage know what I am talking about. I have been in the situation where a full datacenter went down due to a power-outage, believe me when I say that bringing up over 300 VMs and all associated physical components without documentation was a living nightmare.
Although you probably get it by now… it is not the tool but a proper strategy, procedures and documentation are the key to success! Just do it.
I’m completely agree with you. Depending on technique is one step. This is mostly of the time guaranteed. But know what the second en third steps are and how this is done is order is less easy and should be well documented. SRM, and Zerto are great tools helping you do an actual failover but if you don’t keep up with reality you’re doomed. In the end is not only technology. Think of the organization itself. I wrote some time ago a Knowledge article addressing this issue which is a rough guide http://www.mikes.eu/download/2011KS_Mikes-Disater_Recovery_in_a_Cloudy_Landscape.pdf
This is exactly what I say to my clients. SRM and Zetro are just tools. It is policy, procedures and prepareness that are really the Key components of a BC/DR strategy. In a True DR situation the 7P’s are never more needed cause we all no that prior preparation and planning prevents piss poor performance.
I also agree, having a procedure during a stressful situation also helps. Sometimes under stress we can make mistakes if we rely only on memory to do everything… Great write up as usual Duncan.
I agree about the procedure but if you are the one who write it you should not test it, let another person test it will ensure you that you get a full documented procedure.
What great timing for this blog post! Have been thinking a lot about this subject over the past week. Completely agree. So in the case where you have a vMSC clusters stretched across sites, it seems to me like you should have at least two types of plans, one for planned site failover (or disaster avoidance) and one for unplanned failure/failover (DR). In regards to the testing – testing in a lab with vMSC is pretty easy, especially if you use the failure scenarios in the whitepaper. But when it comes to periodically testing a site disaster in a vMSC production environment, how do most folks test that? Unplug everything at one site and pray? Yikes. Testing the disaster avoidance/planned failover in production is a lot easier.
I’m repeating the same things of other people commenting your post.
Customers take this task too much easy. They think that SRM or Zerto are the definitive solution. When I tell them that “maybe” a DR plan is needed, they look at me surprised: “doesn’t this tool make the difference in DR/BC”? Yes, it does, but it’s a tool!!! A car is useful to reach some place, but don’t you have the skill to drive it?
Sometimes I think that these tools are even too simple for certain people…
Sounds like we’re all in vociferous agreement Duncan!
I’ve often told people the *last* thing they should look at is the technology and tools used for DR. The FIRST should be doing a business impact analysis, finding out dependencies, discovering critical systems that underpin everything else… the result of that analysis will inform the disaster recovery plan so you get it right!
This article made me think of when I was camping a while ago. Some people on the camping found their tent burned down. Yes, they did have a spare tent but the kept in their burned down tent. Bottom line, keep your vital documentation/information far away from where disaster can strike.
Since too many, too many clients are enthusiastic for the solution we offer, leaving behind all these aspects, in order to give them a basis on which think about, do you have any link that could explain which should be the complex structure of a reliable DR Plan?
Thank you all
Raffaello
Nice article. Where I work, they are nuts about the DR Plan. It’s global and the amount of specific documentation on execution, prepare, roles, work instructions, service operations procedures, service operations manual, return to normal, etc, etc, is so big that it made the SRM design and implementation look like a tiny tiny step to a huge process.