BC-DR

Site Recovery Manager is not about installing… Part II

Duncan Epping · Jan 12, 2009 ·

I’ve been playing around with Site Recovery Manager these last couple of days. Installing it was really easy and same goes for the basic configuration. I already wrote a blog about this topic a month ago or so but now I’ve experienced it myself. Most of the time during a Site Recover Manager project will be spent during the Plan & Design phase and writing documentation. I will just give you one example why. The following was taken from the SRM Course material:

Datastore Group
Replicated datastores containing the complete set of virtual machines that you want to protect with SRM

Protection Group
A group of virtual machines that are failed over together during test and recovery

For those who don’t know, there’s a one on one mapping between Datastore Groups and Protection Groups. So in other words, once you’ve mapped a Datastore Group to a Protection Group there’s no way of changing it without having to recreate the Protection Group.

I think a picture says more than a 1000 words so I stole this one from the Evaluator Guide to clarify the relationship between datastore, Datastore Groups and Protection Groups:

Notice that there are multiple datastores in Datastore Group 2 because VM4 has disks in both datastores. So these datastores are joined into one Datastore Group. This Datastore Group will have a one to one relationship with a Protection Group. Keep in mind, this is really important: a Protection Group contains VM’s that are failed over together during test and recovery.

If you’ve got VM’s with multiple disks on multiple datastores with no logic in which disk is placed on which datastore you could and probably will end up with all datastores being member of the same Datastore Group. Being member of the same Datastore Group means being part of the same Protection Group. Being part of the same Protection Group will result in a less granular fail-over. It’s all or nothing in this case and I can imagine most companies would like to have some sort of tiering model in place or even better fail over services one at a time. (This doesn’t mean by the way that if you create multiple Protection Group that you can’t fail over everything at the same time, they can all be joined in a Recovery Plan)

Some might think that you would be able to randomly add disks to datastores after you finished configuring. This clearly isn’t the case. If you add a disk to a protected(!) VM the Datastore Group will be recomputed. In our situation this meant that all VM’s in the “Medium Priority” Protection Group were moved over to the “High Priority” Protection Group. This was caused by the fact that we added a disk to a “Medium Priority” VM and placed it on a “High Priority” datastore. As you can imagine this also causes your Recovery Plans to end up with a “warning”, you will need to reconfigure the moved VM’s before you can fail them over as part of your “High Priority” datastore. (Which probably wasn’t the desired strategy…)

When I was searching the internet for information on SRM I stumbled upon this article on the VMware Uptime blog by Lee Dilworth. I’ve taken the following from the “What we’ve learnt” post, which confirms what we’ve seen the last couple of days:

Datastore Group computation is triggered by the following events:

Existing VM is deleted or unregistered

VM is storage vmotioned to a different datastore

New disk is attached to VM on a datastore previously not used by the VM

New datastore is created

Existing datastore is expanded

So in other words, moving VM’s from one Datastore to another or creating a new disk on a different Datastore can cause problems because the Datastore Group computation will be re-run. Not only do you need to take virtual disk placement in consideration when configuring SRM, you will also need to be really careful when moving virtual disks. Documentation, Design and Planning is key here.

I would suggest documenting current disk placement before you even start implementing SRM, and given the results you might need to move disks around before you start with SRM. Make sure to check your documentation and design before randomly adding virtual disks when SRM has been implemented. Documenting your current disk placement can be done easily with the script that Hugo created this week by the way, and I would suggest to regularly create reports and save them.

Expect some more SRM stuff coming up over the next couple of weeks.

EnableResignature and/or DisallowSnapshotLUN

Duncan Epping · Dec 11, 2008 ·

I’ve spend a lot of time in the past trying to understand the settings for EnableResignature and DisallowSnapshotLUN. It had me confused and dazzled a couple of times. Every now and then I still seem to have trouble to actually understand these settings, after a quick scan through the VCDX Enterprise Study Guide by Peter I decided to write this post and I took the time to get to the bottom of it. I needed this settled once and for all, especially now I start to focus more on BC/DR.

According to the San Config Guide(vi3_35_25_san_cfg.pdf) there are three states:

EnableResignature=0, DisallowSnapshotLUN=1 (default)
In this state, you cannot bring snapshots or replicas of VMFS volumes by the array into the ESX Server host regardless of whether or not the ESX Server has access to the original LUN. LUNs formatted with VMFS must have the same ID for each ESX Server host.
EnableResignature=1, (DisallowSnapshotLUN is not relevant)
In this state, you can safely bring snapshots or replicas of VMFS volumes into the same servers as the original and they are automatically resignatured.
EnableResignature=0, DisallowSnapshotLUN=0 (This is similar to ESX Server 2.x behavior.)
In this state, the ESX Server assumes that it sees only one replica or snapshot of a given LUN and never tries to resignature. This is ideal in a DR scenario where you are bringing a replica of a LUN to a new cluster of ESX Servers, possibly on another site that does not have access to the source LUN. In such a case, the ESX Server uses the replica as if it is the original.

The advanced LVM setting EnableResignature is used for resignaturing a VMFS volume that has been detected with a different LUN ID. So what does the LUN ID has to do with the VMFS volume? The LUN ID is stored in the LVM Header of the volume. The LUN ID is used to check if it’s the same LUN that’s being (re)discovered or a copy of the LUN that’s being presented with a different ID. If this is the case the VMFS volume needs to be resignatured, in other words the UUID will be renewed and the LUN ID will be updated in the LVM header.

UUID, what’s that? Chad Sakac from EMC described it as follows in his post on VMFS resignaturing:

It’s a VMware generated number – the LVM signature aka the UUID (it’s a long hexadecimal number designed to be unique). The signature itself has little to with anything presented by the storage subsystem (Host LUN ID, SCSI device type), but a change in either will cause a VMFS volume to get resigned (the ESX server says “hey I used to have a LUN with this signature, but it’s parameters were different, so I better resign this”).

Like Chad says the UUID has little to do with anything presented by the storage subsystem. A VMFS volume ID aka UUID looks like this:

42263200-74382e04-b9bf-009c06010000

1st part – The COS Time when the file-system was created or re-signatured
2nd part – The TSC Time; an internal time stamp counter kept by the CPU
3rd part – A random number
4th part – The Mac Address of the COS NIC

Like I said before, and this is a common misconception so I will say it again, the LUN ID and the Storage System product ID are stored in the LVM header and not the actual UUID itself. Not that it really matters for the way the process works though.

I think that makes it clear when to use EnableResignature and when not to use it. Use it when you want to access VMFS volumes of which the LUN ID changed for whatever reason. For instance a fail over to a DR Site with different LUN numbering or SAN upgrades which caused changes in LUN numbering.

That leaves DisallowSnapshotLun. I had a hard time figuring out when to set it to “0” and when to leave it at the default setting “1”. But found the following in a VMworld Europe 2008 presentation:

DisallowSnapshotLun: Should be set to “0” if SCSI Inquiry string differs between the two Array’s in order to allow access to datastores.

I googled for “SCSI Inquiry” and I found the following in a document by HP:

The storage system product ID retrieved from the SCSI Inquiry string (Example: HSV210)

In other words, when you’ve got an HP EVA 4000 and an HP EVA 8000 which are mirrored you need to set DisallowSnapshotLun to 0, when a fail-over has occurred. The SCSI Inquiry string would differ because the controllers would be of a different model. (The SCSI Inquiry string also contains the LUN ID by the way.)

When both sites are exactly the same, including LUN ID’s, you don’t need to change this setting. Leave it set to 1. Be absolutely sure that when you set DisallowSnapshotLun to 0 that there’s only 1 “version” of the VMFS volume presented to the host. If for some reason both are presented data corruption can and probably will occur. If you need to present both LUNs at the same time, use EnableResignature instead of DisallowSnapshotLun.

Depending on the way your environment is setup and the method you chose to re-enable a set of LUNs you may need to re-register your VM’s. The only way to avoid this is to use DisallowSnapshotLun and pre-register all VM’s on the secondary VirtualCenter server or use just one VirtualCenter server.

Re-registering can be done with a couple of lines of script on just one ESX box:

for i in `find /vmfs/volumes/ -name "*.vmx" ` do echo "Registering VM $i" vmware-cmd -s register $i done

You can change the EnableResignature or DisallowSnapshotLun setting as follows:

open vCenter
Click on a host
Click on “Configurations” tab
Click on “Advanced Settings”
Go to “LVM”
Change appropriate setting
Click “Ok”
Rescan your HBA’s (Storage Adapters, Rescan)

It’s also possible to use the command line to enable DisallowSnapshotLun or EnableResignature:

echo 0 > /proc/vmware/config/LVM/DisallowSnapshotLUN echo 1 > /proc/vmware/config/LVM/EnableResignature

I do want to stress that setting the options should always be used temporarily considering the impact these changes can have! When you set any of both options reset them to the default. The big question still remains, would I prefer resignaturing my VMFS volumes or setting “DisallowSnapshotLun” to “0” to be able to access the volumes? Well the answer is:”It depends”. It heavily depends on the type of setup you have, I can’t answer this question without knowing the background of an environment. The safest method definitely is Resignaturing.

Before you decide read this post again and read the articles/pdf’s in the links below that I used as a reference:

Updates for the VMFS volume resignaturing discussion
HP disaster tolerant solutions using Continuous Access for HP EVA in a VI 3 environment
Fibre Channel SAN Configuration Guide
VMFS Resignaturing by Chad Sakac

SRM, it’s just too easy

Duncan Epping · Nov 20, 2008 ·

You’ve probably also noticed a whole bunch of Site Recovery Manager(SRM) related articles popping up with people installing and configuring it in their home lab:

I love these articles because they are prove of the fact that SRM is really easy to set-up. But, and this actually scares me, it might seem a bit too easy. I said “too easy” because implementing a Disaster Recovery solution isn’t about the tools you are using. The tools, which will make your life a lot easier, are not the most important piece of the puzzle. Indeed PUZZLE.

There a whole bunch of SRM projects going on globally where VMware PSO, the department I work for, is assisting. These projects typically have a duration of 3 to 9 months, while it seems that with the ease of VMware Site Recovery Manager this should be a matter of days.

People tend to forget that the most important thing about Distaster Recovery / Business Continuity is the business. You need to know the organisation and IT environment very well before you can even start:

SLA’s? –> RPO / RTO?
Which services are most important to the business?
Which servers are part of the service?
In which order need these be started?
Which service have the highest priority?
Are there any dependencies between services?
What about the desktops?

And these are just a couple of questions one should normally have to answer before even going down the SRM road. The fact that SRM is so easy to setup makes it really hard to actually explain to a customer why a BCDR project will take much longer then he expected. And remember that although SRM is a great tool you would still need to create a Disaster Recovery Plan, SRM will be part of the plan but it needs to be in place!

I’m not saying that you should not go down the BCDR / SRM road, but be sure to be prepared. (read this e-book, it’s good and it’s free) Get to know your “business”, and be prepared for a long engagement… cause my experience is that normally people have a hard time answering really obvious questions.

You will talk to a lot of people who don’t have a clue of what the core business services / applications actually are. And the same goes for the sys admins, dependencies? Why would you want to know about that and how would I know?

Do you know which questions to ask, do you know how to get the right answers… This is why BCDR subject matter experts are needed for SRM engagements, so before you start give VMware a call, or your local VAC partner for that matter and make sure you get the best out of the SRM product.

Practical guide to Business Continuity and Disaster Recovery

Duncan Epping · Aug 13, 2008 ·

VMware released a 232 page PDF titled “A Practical Guide to Business Continuity & Disaster Recovery with VMware Infrastructure”

This VMware® VMbook focuses on business continuity and disaster recovery (BCDR) and is intended to guide the reader through the step-by-step process to set-up a multisite VMware Infrastructure that is capable of supporting BCDR services for designated virtual machines at time of test or during an actual event that necessitated the declaration of a disaster, resulting in the activation of services in a designated BCDR site.

Be sure to pick up this one and read it, it contains a lot of valuable information for every single one of you out there!