srm

VMware vCloud Director Infrastructure Resiliency Case Study paper published!

Duncan Epping · Mar 1, 2012 ·

Yesterday the paper that Chris Colotti and I were working on titled “VMware vCloud Director Infrastructure Resiliency Case Study” was finally published. This white paper is an expansion on the blog post I published a couple of weeks back.

Someone asked me at PEX where this solution came from all of a sudden, well this is based on a solution I came up with on a random Friday morning half of December when I woke up at 05:00 in Palo Alto still jet-lagged. I diagrammed it on a napkin and started scribbling things down in Evernote. I explained the concept to Chris over breakfast and that is how it started. Over the last two months Chris (+ his team) and I validated the solution and this is the outcome. I want to thank Chris and team for their hard work and dedication.

I hope that those architecting / implementing DR solutions for vCloud environments will benefit from this white paper. If there are any questions feel free to leave a comment.

Source – VMware vCloud Director Infrastructure Resiliency Case Study

Description: vCloud Director disaster recovery can be achieved through various scenarios and configurations. This case study focuses on a single scenario as a simple explanation of the concept, which can then easily be adapted and applied to other scenarios. In this case study it is shown how vSphere 5.0, vCloud Director 1.5 and Site Recovery Manager 5.0 can be implemented to enable recoverability after a disaster.

Download:
http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.pdf
http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.epub http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.mobi

I am expecting that the MOBI and EPUB version will also soon be available. When they are I will let you know!

vCloud Director infrastructure resiliency solution

Duncan Epping · Feb 13, 2012 ·

By Chris Colotti (Consulting Architect, Center Of Excellence) and Duncan Epping (Principal Architect, Technical Marketing)

This article assumes the reader has knowledge of vCloud Director, Site Recovery Manager and vSphere. It will not go in to depth on some topics, we would like to refer to the Site Recovery Manager, vCloud Director and vSphere documentation for more in-depth details around some of the concepts.

Creating DR solutions for vCloud Director poses multiple challenges. These challenges all have a common theme. That is the automatic creation of objects by VMware vCloud Director such as resource pools, virtual machines, folders, and portgroups. vCloud Director and vCenter Server both heavily rely on management object reference identifiers (MoRef ID’s) for these objects. Any unplanned changes to these identifiers could, and often will, result in loss of functionality as Chris has described in this article. vSphere Site Recovery Manager currently does not support protection of virtual machines managed by vCloud Director for these exact reasons.

The vCloud Director and vCenter objects, which are referenced by each product, that are both identified to cause problems when identifiers are changed are:

Folders
Virtual machines
Resource Pools
Portgroups

Besides automatically created objects the following pre-created static objects are also often used and referenced to by vCloud Director.

Clusters
Datastores

Over the last few months we have worked on, and validated a solution which avoids changes to any of these objects. This solution simplifies the recovery of a vCloud Infrastructure and increases management infrastructure resiliency. The amazing thing is it can be implemented today with current products.

In this blog post we will give an overview of the developed solution and the basic concepts. For more details, implementation guidance or info about possible automation points we recommend contacting your VMware representative and you engage VMware Professional Services.

Logical Architecture Overview

vCloud Director infrastructure resiliency can be achieved through various scenarios and configurations. This blog post is focused on a single scenario to allow for a simple explanation of the concept. A white paper explaining some of the basic concepts is also currently being developed and will be released soon. The concept can easily be adapted for other scenarios, however you should inquire first to ensure supportability. This scenario uses a so-called “Active / Standby” approach where hosts in the recovery site are not in use for regular workloads.

In order to ensure all management components are restarted in the correct order, and in the least amount of time vSphere Site Recovery Manager will be used to orchestrate the fail-over. As of writing, vSphere Site Recovery Manager does not support the protection of VMware vCloud Director workloads. Due to this limitation these will be failed-over through several manual steps. All of these steps can be automated using tools like vSphere PowerCLI or vCenter Orchestrator.

The following diagram depicts a logical overview of the management clusters for both the protected and the recovery site.

In this scenario Site Recover Manager will be leveraged to fail-over all vCloud Director management components. In each of the sites it is required to have a management vCenter Server and an SRM Server which aligns with standard SRM design concepts.

Since SRM cannot be used for vCloud Director workloads there is no requirement to have an SRM environment connecting to the vCloud resource cluster’s vCenter Server. In order to facilitate a fail-over of the VMware vCloud Director workloads a standard disaster recovery concept is used. This concept leverages common replication technology and vSphere features to allow for a fail-over. This will be described below.

The below diagram depicts the VMware vCloud Director infrastructure architecture used for this case study.

Both the Protected and the Recovery Sites have a management cluster. Each of these contain a vCenter Server and an SRM Server. These are used facilitate the disaster recovery procedures. The vCloud Director Management virtual machines are protected by SRM. Within SRM a protection group and recovery plan will be created to allow for a fail-over to the Recovery Site.

Please note that storage is not stretched in this environment and that hosts in the Recovery Site are unable to see storage in the Protected Site and as such are unable to run vCloud Director workloads in a normal situation. It is also important to note that the hosts are also attached to the cluster’s DVSwitch to allow for quick access to the vCloud configured port groups and are pre-prepared by vCloud Director.

These hosts are depicted as hosts, which are placed in maintenance mode. These hosts can also be stand-alone hosts and added to the vCloud Director resource cluster during the fail-over. For simplification and visualization purposes this scenario describes the situation where the hosts are part of the cluster and placed in maintenance mode.

Storage replication technology is used to replicate LUNs from the Protected Site to the Recover Site. This can be done using asynchronous or synchronous replication; typically this depends on the Recovery Point Objective (RPO) determined in the service level agreement (SLA) as well as the distance between the two sites. In our scenario synchronous replication was used.

Fail-over Procedure

In this section the basic steps required for a successful fail-over of a VMware vCloud Director environment are described. These steps are pertinent to the described scenario.

It is essential that each component of the vCloud Director management stack be booted in the correct order. The order in which the components should be restarted is configured in an SRM recovery plan and can be initiated by SRM with a single button. The following order was used to power-on the vCloud Director management virtual machines:

Database Server (providing vCloud Director, vCenter Server, vCenter Orchestrator, and Chargeback Databases)
vCenter Server
vShield Manager
vCenter Chargeback (if in use)
vCenter Orchestrator (if in use)
vCloud Director Cell 1
vCloud Director Cell 2

When the fail-over of the vCloud Director management virtual machines in the management cluster has succeeded, multiple steps are required to recover the vCloud Director workload. These are described in a manual fashion but can be automated using PowerCLI or vSphere Orchestrator.

Validate all vCloud Director management virtual machines are powered on
Using your storage management utility break replication for the datastores connected to the vCloud Director resource cluster and make the datastores read/write (if required by storage platform)
Mask the datastores to the recovery site (if required by storage platform)
Using ESXi command line tools mount the volumes of the vCloud Director resource cluster on each host of the cluster

esxcfg-volume –m <volume ID>

Using vCenter Server rescan the storage and validated all volumes are available
Take the hosts out of maintenance mode for the vCloud Director resource cluster (or add the hosts to your cluster, depending on the chosen strategy)
In our tests the virtual were automatically powered on by vSphere HA. vSphere HA is aware of the situation before the fail-over and will power-on the virtual machines according to the last known state

Alternatively, virtual machines can be powered-on manually leveraging the vCloud API to they are booted in the correct order as defined in their vApp metadata. It should be noted that this could possibly result in vApps being powered-on which were powered-off before the fail-over as there is currently no way of determining their state.

Using this vCloud Director infrastructure resiliency concept, a fail-over of a vCloud Director environment has been successfully completed and the “cloud” moved from one site to another.

As all vCloud Director management components are virtualized, the virtual machines are moved over to the Recovery Site while maintaining all current managed object reference identifiers (MoRef IDs). Re-signaturing the datastore (giving it a new unique ID) has also been avoided to ensure the relationship between the virtual machines / vApps within vCloud Director and the datastore remained in tact.

Is that cool and simple or what? For those wondering, although we have not specifically validated it, yes this solution/concept would also apply to VMware View. Yes it would also work with NFS if you follow my guidance in this article about using a CNAME to mount the NFS datastore.

Avoid changing your VMs IP in a DR procedure…

Duncan Epping · Jan 19, 2012 ·

I was thinking about one of the most challenging aspects with DR procedures, IP changes. This is a very common problem. Although changing the IP address of a VM is usually straight forward it doesn’t mean that this is propagated to the application layer. Many applications use hardcoded IP addresses and changing these is usually a huge challenge.

But what about using vShield Edge? If you look at how vShield Edge is used in a vCloud Director environment, mainly NAT’ing and Firewall functionality, you could use it in exactly the same way for your VMs in a DR enabled environment. I know there are many Apps out there which don’t use hardcoded IP adresses and which are simple to re-IP. But for those who are not, why not just leverage vShield Edge… NAT the VMs and when there is a DR event just swap out the NAT pool and update DNS. On the “inside” nothing will change… and the application will continue to work fine. On the outside things will change, but this is an “easy” fix with a lot less risk than re-IP’ing that whole multi-tier application.

I wonder how some of you out in the field do this today.

Fiddling around with SRM’s Storage Replication Adapter – Part II

Duncan Epping · Jan 12, 2012 ·

** Disclaimer: This is for educational purposes, please don’t implement this in your production environment as it is not supported! **

After my article this week about (ab) using the SRA provided through Site Recovery Manager to fail-over any LUN I expected some people reaching out to me with additional questions. One of the questions which came in more than once was “is it possible to do a test-failover of a LUN which is not managed by the SRM infra”? I guess the short answer is yes it is. The long answer is: well it depends on what your definition of a “test-failover” is. Of course booting up a physical machine from SAN while keeping the same IP etc would cause conflicts. I am also not going to show you how to re’ip your physical machines as I expect you to know this. From an SRM perspective how exciting is this?

To be honest, not really. The same concept applies. For a test-failover SRM calls the SRA by a script called “command.pl” and it feeds it XML. The following lines of XML are relevant for this exercise, but the critical one is “TestFailoverStartParameters”:

--> <TestFailoverStartParameters> --> <ArrayId>BB005056AE32820000-server_2</ArrayId> --> <AccessGroups> --> <AccessGroup id="domain-c7"> --> <Initiator type="iSCSI" id="iqn.1998-01.com.vmware:localhost-11616041"/> --> <Initiator type="iSCSI" id="iqn.1998-01.com.vmware:localhost-4a15366e"/> --> <Initiator type="NFS" id="10.21.68.106"/> --> <Initiator type="NFS" id="10.21.68.105"/> --> </AccessGroup> --> </AccessGroups> --> <TargetDevices> --> <TargetDevice key="fs14_T1_LUN1_BB005056AE32800000_fs10_T1_LUN1_BB005056AE32820000"> --> <AccessGroups> --> <AccessGroup id="domain-c7"/> --> </AccessGroups> --> </TargetDevice> --> </TargetDevices> --> </TestFailoverStartParameters> --> </Command>

Now in our case we want to fail-over a random non vSphere LUN. We will need the “initiator” (server(s)) who will need to see be able to see this LUN and we will need the LUN identifier. All of this can either be found in the SRM log files (LUN identifiers) or on the physical server (initiator details). If you would call command.pl and feed it the XML file the SRA will request the array to create a snapshot and give the host access to that snapshot. Now it is up to you to take the next steps!

It is no rocket science. Anything SRM does with the SRA you can do from the command line using command.pl and a custom XML file. As mentioned in the comments in my previous article, I know people are interested in using this for Physical Hosts… I will discuss this internally, but for now don’t come close, it is not supported!

“Hacking” Site Recovery Manager (SRM) / a Storage Array Adapter

Duncan Epping · Jan 10, 2012 ·

** Disclaimer: This is for educational purposes, please don’t implement this in your production environment as it is not supported! **

Last week I received a question and I figured I would dive in to it this week. The question was if it is possible to fail-over LUNs using VMware Site Recovery Manager (SRM) which are not part of the Cluster which SRM “manages”. In other words, can I fail-over a LUN which is attached to a physical Windows Server or to a completely separate VMware Cluster? Before we continue, I did not hack SRM itself, neither did I make any changes to the SRA.

Lets briefly explain what SRM does normally when you go through the process of of creating a DR plan. Now this is slimmed down with only focussing on the relevant stuff for this question:

First it will discover the devices using the Storage Replication Adapter (SRA)
It then discovers all LUNs using the SRA
It show the replicated LUNs containing VMs to the admin
Admin can use these in his plan and “protect” the VMs appropriately

I decided to install SRM in a nested environment using the Celerra Uber VSA. I installed the VNX SRA and configured it and went through some of the log files just to find a piece of evidence that my plan is even possible. For Windows 2008 you can find the SRM Log Files in this location by the way:

%ALLUSERSPROFILE%\VMware\VMware vCenter Site Recovery Manager\Logs\

Other locations are documented in this KB. When I created the environment I created multiple LUNs with different sizes to make them easily recognizable. The LUN which is replicated but not exposed to our vCenter/SRM environment is 25GB and the LUN which is exposed is 30GB. This is what the log files showed me when I did a quick find on the size:

(Production) fsid=14 size=30000MB alloc=0MB dense  read-write
path=/srm01/fs14_T1_LUN1_BB005056AE32800000/fs14_T1_LUN1_BB005056AE32800000 (snapped)

(Production) fsid=16 size=25000MB alloc=0MB dense read-write
path=/vc01/fs16_T1_LUN2_BB005056AE32800000/fs16_T1_LUN2_BB005056AE32800000 (snapped)

As you can see both my 25GB and my 30GB LUN is listed. I added a name to it which also allows me to quickly identify it “srm01” and “vc01”, where “vc01” is the one which is not managed by SRM.

So how does SRM get this information? Well it is actually pretty straight forward, SRM calls a script which is part of the SRA. SRM feeds this script XML. This XML code contains the commands / details required. I’ve written about this a long time ago when I was troubleshooting SRM and it is still applicable:

perl command.pl < file.xml

Now the XML file is of course key here… How does that need to be structured and can we use, or should I say abuse, it to do a fail-over of a LUN which is not “managed” by SRM/vCenter. Well I started digging and it turns out to be fairly straight forward. Keep in mind the disclaimer at the top though, this is not what the SRA’s were intended for… this is purely for educational purposes and far from supported. Again the logfiles exposed a lot of details here, but I stripped it down to make it readable. This is the response from the SRA when SRM asked for details on which devices are available:

2012-01-09T12:14:53.583-08:00 [05388 verbose 'SraCommand' opID=7D6C5634-00000023] discoverDevices responded with:
--> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
--> <SourceDevice state="read-write" id="1-1">
--> <Name>fs14_T1_LUN1_BB005056AE32800000</Name>
--> <Identity>
--> <Wwn>60:06:04:8c:ab:b2:88:c0:59:40:72:24:1b:5f:77:72</Wwn>
--> </Identity>
--> <TargetDevice key="fs14_T1_LUN1_BB005056AE32800000_fs10_T1_LUN1_BB005056AE32820000"/>
--> </SourceDevice>
--> <SourceDevice state="read-write" id="1-2">
--> <Name>fs16_T1_LUN2_BB005056AE32800000</Name>
--> <Identity>
--> <Wwn>60:06:04:8c:b8:50:22:96:0c:0b:bf:d8:59:0b:a1:75</Wwn>
--> </Identity>
--> <TargetDevice key="fs16_T1_LUN2_BB005056AE32800000_fs12_T1_LUN3_BB005056AE32820000"/>
--> </SourceDevice>
--> </SourceDevices>

Now if you look at SRM and try to make a Protection Group plan you will quickly discover that only those Datastores which have a VM hosted on there can be added. This is shown in the screenshot below.

As mentioned SRM filters out the “irrelevant LUNs”, to me this LUN wasn’t irrelevant however. So what’s next? I decided to initiated a fail-over and to look at the log files. When the fail-over is initiated the following is issued by SRM, again I stripped some details to make it more readable:

--> <FailoverParameters>
--> <ArrayId>BB005056AE32820000-server_2</ArrayId>
--> <AccessGroups>
--> <AccessGroup id="domain-c7">
--> <Initiator id="iqn.1998-01.com.vmware:localhost-11616041" type="iSCSI"/>
--> <Initiator id="iqn.1998-01.com.vmware:localhost-4a15366e" type="iSCSI"/>
--> <Initiator id="10.21.68.106" type="NFS"/>
--> <Initiator id="10.21.68.105" type="NFS"/>
--> </AccessGroup>
--> </AccessGroups>
--> <TargetDevices>
--> <TargetDevice key="fs14_T1_LUN1_BB005056AE32800000_fs10_T1_LUN1_BB005056AE32820000">
--> <AccessGroups>
--> <AccessGroup id="domain-c7"/>
--> </AccessGroups>
--> </TargetDevice>
--> </TargetDevices>
--> </FailoverParameters>

I guess we should be able to work with this! Using the “discoverdevices” information and combining it with the “Failover” information I should be able to construct my own custom XML file. After creating this XML file I should be able to fail-over any LUN which is part of the selected device… What is my plan? I am planning to change the following:

Initiator id
TargetDevice key

I wasn’t sure if I needed to change the AccessGroup so I figured I would just test it like this. I called the script as follows:

<path to perl>\bin\perl.exe command.pl < file.xml

I watched a whole bunch of messages pass by and then looked at the Celerra when then fail-over commend was completed and noticed the following:

And of course within the “unmanaged” vCenter you can see it:

Successful fail-over of a LUN which wasn’t part of an SRM Protection Group! Yes, when you replace the Initiator ID even the masking is correctly configured. The only thing left would be either resignaturing the volume or mounting the volume. This of course depends on the OS owning the volume and the desired end result. All in all, a nice little experiment… Once again, don’t try this in your own environment, it is far from supported!