srm

VMware HA or VMware SRM, what should I use?

Duncan Epping · Feb 12, 2009 ·

I was just reading up on VMTN and noticed this great topic. For some reason there are a lot of people that don’t see the difference between HA and SRM. I suggest reading the full topic and especially Jay Judkowitz’s replies and Smoggy’s reply, both are Subject Matter Experts on SRM and explained the topic starter what the differences are and when to use it. Here’s an outtake of the discussion which captures the essence of the answer in my opinion:

With SRM, you get a much more well defined failover.
- The VMs start in a specified order
- You can set some VMs to be started serially with others starting in parallel
- You can designate VMs at the recovery site to suspend to make room for recovery VMs
- You can have callout scripts and predefined breakpoints to make sure that critical non-VMware activity is done at the right time and place
- You can set the resource pool at the remote site (with the same size or different as the source resource pool) so that you get a predictable and defined QOS on CPU and memory
Once you have that well defined failover plan, you can test it and audit the results
- Testing will automatically snap the recovery LUNs so you can power on the recovery VMs without interrupting replication
- You can specify a test network at the second site that SRM will automatically put the recovery VMs on during a test so that they do not interfere with the running VMs
- You can therefore do non-disruptive DR testing any time without warning. The recovery plan executes the same as for failover, but in a “test bubble” where storage and network IO are safely segregated away from production work.
- There is a test results page for the recovery plan which lists all test runs, how long they took and how successful they were. From this page, you can drill down to each test run and see exactly what steps succeeded and failed and how long they took to run.
- With the history page, you can grade your organization over time. With the detailed reports, you can troubleshoot specific runs.

I suggest that if you’re looking into Business Continuity / Disaster Recovery and you’ve got questiosn on what/where/when/how with SRM you visit the VMTN forums… these guys really know what they are talking about and can really help you understanding what BC/DR is about.

SRM Failback?

Duncan Epping · Feb 4, 2009 ·

I get this question a lot:Does SRM have Failback capabilities? The answer is short but not simple, yes it does. Keep in mind that there’s no big red button labeled “Failback” which is the “not simple” part of the answer. Luckily for us the VMware Uptime Blog Team wrote an extensive article on how to do a failback with the current version of Site Recovery Manager. In short this is what one needs to do to failback:

Reverse the replication direction in the storage layer to be from Site B to Site A
Clean up the shadow virtual machines and protection groups on Site A
Clean up the Recovery Plans configured on Site B
Configure the protection group(s) on Site B
Configure the Recovery Plans on Site A
Test recovery from Site B to Site A
Perform the recovery from Site B to Site A

Read the complete article on the Uptime Blog for all the details and show the article to your manager. It includes a table with an the estimated amount of time a failback would normally take manual vs SRM.

EMC SRDF Storage Replication Adapter

Duncan Epping · Jan 30, 2009 ·

I was delivering a Site Recovery Manager Jumpstart today. During the configuration of the Storage Replication Adapter(SRA) the task got stuck at 23%. I’ve seen the configuration of the SRA get stuck once before at 23% so I knew it was the “DiscoverLuns” command that failed for one reason or the other.

We ran the configuration of the SRA again and it stopped after exactly five minutes. We decided to run the DiscoverLuns task again but this time manual with use of command.pl and a xml file as input. If you read the previous article on DiscoverLuns you know how to feed command.pl with the xml file and what info the file should contain.

Running the DiscoverLuns manually worked great, but it actually took little over 15 minutes to complete before the complete results were returned by the EMC DMX3. During the configuration via the GUI the task failed after exactly 5 minutes, so it seemed to time-out. Opening up vmware-dr.xml, which can be found in the Site Recovery Manager installation folder, revealed a time out of 300 seconds:

We changed the value to 1800, restarted the SRM service and reconfigured the SRA successful.

Failover using SRM might be slow…

Duncan Epping · Jan 26, 2009 ·

I was just reading an excellent weekly technical digest by VMware’s Michael White and noticed the mention of a KB article on SRM. This KB article has the following describtion:

With VMware Site Recovery Manager 1.0 Update 1, recovery of a VM might take a long time. The recovery time during a test or real recovery will be longer when more VM’s are involved. The Change Network Settings task might time out during the test or real failover. This is due to the serial fashion in which Site Recovery Manager waits until a guest heartbeat is seen prior to customizing the VM.

This problem can be encountered when running the following ESX versions:

ESX 3.5 Update 2 and Update 3
ESX 3.0.2 and 3.0.3

In other words, the behaviour of ESX has changed and it might be useful and beneficial for SRM to change this behaviour again. We are talking about a 5 minute delay, that’s 5 minutes for each VM. You can imagine that running a recovery plan can and will take a long time when this setting isn’t changed. Here’s the solution which has also been outlined in the KB article.

Set hostd heartbeat delay to 40.
Disconnect the host from VC (Right click on host in VI Client and select “Disconnect” )
Login as root to the ESX Server with SSH.
Using a text editor such as nano or vi , edit the file /etc/vmware/hostd/config.xml
Set the “heartbeatDelayInSecs” tag under “vmsvc” to 40 seconds as shown here:

<vmsvc> <heartbeatDelayInSecs>40</heartbeatDelayInSecs> <enabled>true</enabled> </vmsvc>

Restart the management agents for this change to take effect. See Restarting the Management agents on an ESX Server (1003490).
Reconnect the host in VC ( Right click on host in VI Client and select “Connect” )

Storage Replication Adapter: discoverLuns…

Duncan Epping · Jan 20, 2009 ·

Today I was implementing Site Recovery Manager with a colleague(Thanks Andy!!). During the configuration of the HP EVA SRA(Storage Replication Adapter) we received the following error:

discoverLuns script failed to execute properly

The error indicates that that the first part of the SRA configuration “discoverArrays” worked but when discovering the LUN’s and it’s replica’s it bailed out(23%). So after checking the config files and log files we decided to run the scriptfile, that the SRA uses, manually and see what happens.

First we created an XML file which feeds the script. The XML file contained the following, which can be copied from the SRM Log files:

<?xml version=”1.0″ encoding=”ISO-8859-1″?>
<Command>
<Name>discoverLuns</Name>
<ConnectSpec>
<Name>HP StorageWorks EVA Virtualization Adapter</Name>
<Address>san.yellow-bricks.com</Address>
<Username>user</Username>
<Password>password</Password>
</ConnectSpec>
<ArrayId>YB-SAN-01</ArrayId>
<OutputFile>C:\TEMP\SAN.Log</OutputFile>
<LogLevel>trivia</LogLevel>
</Command>

Now we were able to run the script with the XML file as input:

perl command.pl < file.xml

In our case running the script manually with the XML file as input didn’t return an error. This gave us the idea that it might be account or permissions related. During the configuration of the SRA we entered domain credentials, which were the same as the account being used during the manual run of the script. So it wasn’t the SRA account that was causing these problems.

After diving into the configuration we stumbled upon the SRM Service. The SRM service was started with a Local System account. We decided to change the account used for the Service from “Local” to a domain account… and indeed problem solved.

One would expect this to be part of the SRA documentation, but it isn’t. We contacted VMware Support and they had the same configuration running in their test environment except for the fact that they weren’t using AD authentication. In their case the Local System account just worked fine.

I’ve emailed Support all the log files and according to them our suspicion was correct. It seems to be related to the HP EVA SRA. The HP SRA seems to use the wrong account for authentication at one point during the script. Next up: Contact HP Support and let’s see if they can a) fix this or b) update their documentation.