BC-DR

Site Recovery Manager is not about installing… Part II

Duncan Epping · Jan 12, 2009 ·

I’ve been playing around with Site Recovery Manager these last couple of days. Installing it was really easy and same goes for the basic configuration. I already wrote a blog about this topic a month ago or so but now I’ve experienced it myself. Most of the time during a Site Recover Manager project will be spent during the Plan & Design phase and writing documentation. I will just give you one example why. The following was taken from the SRM Course material:

Datastore Group
Replicated datastores containing the complete set of virtual machines that you want to protect with SRM

Protection Group
A group of virtual machines that are failed over together during test and recovery

For those who don’t know, there’s a one on one mapping between Datastore Groups and Protection Groups. So in other words, once you’ve mapped a Datastore Group to a Protection Group there’s no way of changing it without having to recreate the Protection Group.

I think a picture says more than a 1000 words so I stole this one from the Evaluator Guide to clarify the relationship between datastore, Datastore Groups and Protection Groups:

Notice that there are multiple datastores in Datastore Group 2 because VM4 has disks in both datastores. So these datastores are joined into one Datastore Group. This Datastore Group will have a one to one relationship with a Protection Group. Keep in mind, this is really important: a Protection Group contains VM’s that are failed over together during test and recovery.

If you’ve got VM’s with multiple disks on multiple datastores with no logic in which disk is placed on which datastore you could and probably will end up with all datastores being member of the same Datastore Group. Being member of the same Datastore Group means being part of the same Protection Group. Being part of the same Protection Group will result in a less granular fail-over. It’s all or nothing in this case and I can imagine most companies would like to have some sort of tiering model in place or even better fail over services one at a time. (This doesn’t mean by the way that if you create multiple Protection Group that you can’t fail over everything at the same time, they can all be joined in a Recovery Plan)

Some might think that you would be able to randomly add disks to datastores after you finished configuring. This clearly isn’t the case. If you add a disk to a protected(!) VM the Datastore Group will be recomputed. In our situation this meant that all VM’s in the “Medium Priority” Protection Group were moved over to the “High Priority” Protection Group. This was caused by the fact that we added a disk to a “Medium Priority” VM and placed it on a “High Priority” datastore. As you can imagine this also causes your Recovery Plans to end up with a “warning”, you will need to reconfigure the moved VM’s before you can fail them over as part of your “High Priority” datastore. (Which probably wasn’t the desired strategy…)

When I was searching the internet for information on SRM I stumbled upon this article on the VMware Uptime blog by Lee Dilworth. I’ve taken the following from the “What we’ve learnt” post, which confirms what we’ve seen the last couple of days:

Datastore Group computation is triggered by the following events:

Existing VM is deleted or unregistered

VM is storage vmotioned to a different datastore

New disk is attached to VM on a datastore previously not used by the VM

New datastore is created

Existing datastore is expanded

So in other words, moving VM’s from one Datastore to another or creating a new disk on a different Datastore can cause problems because the Datastore Group computation will be re-run. Not only do you need to take virtual disk placement in consideration when configuring SRM, you will also need to be really careful when moving virtual disks. Documentation, Design and Planning is key here.

I would suggest documenting current disk placement before you even start implementing SRM, and given the results you might need to move disks around before you start with SRM. Make sure to check your documentation and design before randomly adding virtual disks when SRM has been implemented. Documenting your current disk placement can be done easily with the script that Hugo created this week by the way, and I would suggest to regularly create reports and save them.

Expect some more SRM stuff coming up over the next couple of weeks.

Health Check tools I use

Duncan Epping · Dec 18, 2008 ·

A few days ago Scott Lowe asked me which tools I use to deliver a health check engagement. A health check is a standard VMware PSO engagement, a VMware Consultant will be on site to check the status of your environment and will draw up a report.

I personally use the following tools:

Health Check script by A.Mikkelsen → for a quick overview of the current situation and setup, small files and easy to carry around, runs from the Service Console.
VMware Health Analyzer Appliance → A linux appliance that can connect to your VC/ESX and analyze log files. At this point in time it’s only available for VMware Employees or Partners with access to Partner Central.
Powershell: Report into MS Word → Alan Renouf created this great reporting powershell scripts. It dumps info into a word document. (And i’ve heard he’s also working on a Visio export)
Powershell: Health Check Script → Create an html report with datastore, cpu, memory and snapshot info… and more.
RVTools → Gives a quick overview of current VM setup like snapshots, memory, cpu etc.
Common sense → I hardly encounter really huge problems, mainly decreased availability cause of choices made during implementation / design phase without following VMware’s guidelines. Use common sense is the best advise in this case and read the best practice documents and VMware’s collection of pdf’s!
And when there are some disturbing errors in one of the various log files you have the option to run it through one of the many toolkits we internally have.

I’m not using the following tools actively during engagements because of licensing but they can be very usefull in your enviroment:

Replicate Datacenter Analyzer → Analyze your VI3 environment, I wrote an article a few weeks ago on RDA, click here
Veeam Monitor → Monitor your VI3 environment including performance graphs etc.
Veeam Reporter → A reporting tool, which will come in handy when documenting environments and comparing the current config to an old config.
Vizioncore vFoglight → Might come in handy when doing analyses of trends and pinpointing resource contention.
Tripwire Configcheck → Analyze the security of your VMware ESX environment. Check my blog post on Configcheck here.

Scripts, scripts, scripts… come and get them!

Duncan Epping · Dec 18, 2008 ·

I was just pointed out to this amazing topic on the VMTN Forums by William Lam and Tuan Duong. These guys created a whole bunch of scripts and decided to share them with the rest of the world:

My colleague (Tuan Duong) and I (William Lam) have been working on a virtualization/VDI deployment project over the last six months. The result of this work is a set of scripts that assist in provisioning and managing the server and lab environment for the Residential Networking Services (ResNet) at the University of California, Santa Barbara.

We took the approach of developing scripts that would be free in nature to support a variety of offerings that currently exist in the enterprise space. One such tool that we would like to share with the VMware community is our Linked Clones script that was developed at the beginning of the summer of 2008. This script functions similarly to the View Composer component in the recent release of VMware View 3 but with relatively relaxed requirements.

A description and more details of the Linked Clones script can be found at:

http://communities.vmware.com/docs/DOC-9020

Another script that complements the Linked Clone’s script is our custom management script “*my-vmware-cmd*” which can be found at:

http://communities.vmware.com/docs/DOC-9061

An example of our implementation of these scripts can be found at:

http://communities.vmware.com/docs/DOC-9201

We also have other scripts and resources that have been consolidated onto a webpage and would like to share it:

http://www.engineering.ucsb.edu/~duonglt/vmware/

We hope that the community finds some of these scripts to be useful in aiding VI administrators to manage their virtual infrastructure and look forward to any feedback that is provided.

Thanks
William lamw and Tuan tlduong

Check these scripts out if you’re looking for a linked clone solution on a non View Composer environment. Their website also contains a bunch of scripts, tips and tricks. One that really stands out is the RDM script, believe me this is a must have for your toolkit:

Download: rdm.sh – 11/03/08
Compatiable with: ESX 3.5+ and ESXi

This script is used to locate all virtual machines that have an RDM mapping and provides the VMs Name, Hard Disk label shown on the VIC/VC, Datastore, LUN UUID, HBA/LUN, Compatibility Mode (Phys/Virt), DiskMode and Capacity.

HA: who decides where a VM will be restarted?

Duncan Epping · Dec 15, 2008 ·

During the Dutch VMUG someone walked up to me and asked a question about High Availability. He read my article on Primary and Secondary nodes and was wondering who decided where and when VM would be restarted.

Let’s start with a short recap of the “primary/secondary” article: “The first five servers that join the cluster will become a primary node, and the others that will join will become a secondary node. Secondary nodes send their state info to primary nodes and also contact the primary nodes for their heartbeat notification. Primary nodes replicate their data with the other primary nodes and also send their heartbeat to other primary nodes.”

The question was, when a fail-over needs to take place cause an isolation occurred who decides on which host a specific VM will be restarted. The obvious answer is one of the primaries. One of the primaries will be selected as the “fail-over coordinator”. The fail-over coordinator coordinates the restart of virtual machines on the remaining hosts. The coordinator takes restart priorities in account. Keep in mind, when two hosts fail at the same time it will handle the restart sequentially. In other words, restart the VM’s of the first failed host(taking restart priorities in account) and then restart the VM’s of the host that failed as second(again taking restart priorities in account). If the fail-over coordinator fails one of the primaries will take over.

By the way, this is another reason why you can only account for 4 host failures. You need at least 1 primary, this primary will be the fail-over coordinator. When the last primary dies….

EnableResignature and/or DisallowSnapshotLUN

Duncan Epping · Dec 11, 2008 ·

I’ve spend a lot of time in the past trying to understand the settings for EnableResignature and DisallowSnapshotLUN. It had me confused and dazzled a couple of times. Every now and then I still seem to have trouble to actually understand these settings, after a quick scan through the VCDX Enterprise Study Guide by Peter I decided to write this post and I took the time to get to the bottom of it. I needed this settled once and for all, especially now I start to focus more on BC/DR.

According to the San Config Guide(vi3_35_25_san_cfg.pdf) there are three states:

EnableResignature=0, DisallowSnapshotLUN=1 (default)
In this state, you cannot bring snapshots or replicas of VMFS volumes by the array into the ESX Server host regardless of whether or not the ESX Server has access to the original LUN. LUNs formatted with VMFS must have the same ID for each ESX Server host.
EnableResignature=1, (DisallowSnapshotLUN is not relevant)
In this state, you can safely bring snapshots or replicas of VMFS volumes into the same servers as the original and they are automatically resignatured.
EnableResignature=0, DisallowSnapshotLUN=0 (This is similar to ESX Server 2.x behavior.)
In this state, the ESX Server assumes that it sees only one replica or snapshot of a given LUN and never tries to resignature. This is ideal in a DR scenario where you are bringing a replica of a LUN to a new cluster of ESX Servers, possibly on another site that does not have access to the source LUN. In such a case, the ESX Server uses the replica as if it is the original.

The advanced LVM setting EnableResignature is used for resignaturing a VMFS volume that has been detected with a different LUN ID. So what does the LUN ID has to do with the VMFS volume? The LUN ID is stored in the LVM Header of the volume. The LUN ID is used to check if it’s the same LUN that’s being (re)discovered or a copy of the LUN that’s being presented with a different ID. If this is the case the VMFS volume needs to be resignatured, in other words the UUID will be renewed and the LUN ID will be updated in the LVM header.

UUID, what’s that? Chad Sakac from EMC described it as follows in his post on VMFS resignaturing:

It’s a VMware generated number – the LVM signature aka the UUID (it’s a long hexadecimal number designed to be unique). The signature itself has little to with anything presented by the storage subsystem (Host LUN ID, SCSI device type), but a change in either will cause a VMFS volume to get resigned (the ESX server says “hey I used to have a LUN with this signature, but it’s parameters were different, so I better resign this”).

Like Chad says the UUID has little to do with anything presented by the storage subsystem. A VMFS volume ID aka UUID looks like this:

42263200-74382e04-b9bf-009c06010000

1st part – The COS Time when the file-system was created or re-signatured
2nd part – The TSC Time; an internal time stamp counter kept by the CPU
3rd part – A random number
4th part – The Mac Address of the COS NIC

Like I said before, and this is a common misconception so I will say it again, the LUN ID and the Storage System product ID are stored in the LVM header and not the actual UUID itself. Not that it really matters for the way the process works though.

I think that makes it clear when to use EnableResignature and when not to use it. Use it when you want to access VMFS volumes of which the LUN ID changed for whatever reason. For instance a fail over to a DR Site with different LUN numbering or SAN upgrades which caused changes in LUN numbering.

That leaves DisallowSnapshotLun. I had a hard time figuring out when to set it to “0” and when to leave it at the default setting “1”. But found the following in a VMworld Europe 2008 presentation:

DisallowSnapshotLun: Should be set to “0” if SCSI Inquiry string differs between the two Array’s in order to allow access to datastores.

I googled for “SCSI Inquiry” and I found the following in a document by HP:

The storage system product ID retrieved from the SCSI Inquiry string (Example: HSV210)

In other words, when you’ve got an HP EVA 4000 and an HP EVA 8000 which are mirrored you need to set DisallowSnapshotLun to 0, when a fail-over has occurred. The SCSI Inquiry string would differ because the controllers would be of a different model. (The SCSI Inquiry string also contains the LUN ID by the way.)

When both sites are exactly the same, including LUN ID’s, you don’t need to change this setting. Leave it set to 1. Be absolutely sure that when you set DisallowSnapshotLun to 0 that there’s only 1 “version” of the VMFS volume presented to the host. If for some reason both are presented data corruption can and probably will occur. If you need to present both LUNs at the same time, use EnableResignature instead of DisallowSnapshotLun.

Depending on the way your environment is setup and the method you chose to re-enable a set of LUNs you may need to re-register your VM’s. The only way to avoid this is to use DisallowSnapshotLun and pre-register all VM’s on the secondary VirtualCenter server or use just one VirtualCenter server.

Re-registering can be done with a couple of lines of script on just one ESX box:

for i in `find /vmfs/volumes/ -name "*.vmx" ` do echo "Registering VM $i" vmware-cmd -s register $i done

You can change the EnableResignature or DisallowSnapshotLun setting as follows:

open vCenter
Click on a host
Click on “Configurations” tab
Click on “Advanced Settings”
Go to “LVM”
Change appropriate setting
Click “Ok”
Rescan your HBA’s (Storage Adapters, Rescan)

It’s also possible to use the command line to enable DisallowSnapshotLun or EnableResignature:

echo 0 > /proc/vmware/config/LVM/DisallowSnapshotLUN echo 1 > /proc/vmware/config/LVM/EnableResignature

I do want to stress that setting the options should always be used temporarily considering the impact these changes can have! When you set any of both options reset them to the default. The big question still remains, would I prefer resignaturing my VMFS volumes or setting “DisallowSnapshotLun” to “0” to be able to access the volumes? Well the answer is:”It depends”. It heavily depends on the type of setup you have, I can’t answer this question without knowing the background of an environment. The safest method definitely is Resignaturing.

Before you decide read this post again and read the articles/pdf’s in the links below that I used as a reference:

Updates for the VMFS volume resignaturing discussion
HP disaster tolerant solutions using Continuous Access for HP EVA in a VI 3 environment
Fibre Channel SAN Configuration Guide
VMFS Resignaturing by Chad Sakac