Automating vCloud Director Resiliency whitepaper released

About a year ago I wrote a whitepaper about vCloud Director resiliency, or better said I developed a disaster recovery solution for vCloud Director. This solution allows you to fail-over vCloud Director workloads between sites in the case of a failure. Immediately after it was published various projects started to implement this solution. As part of our internal project our PowerCLI guru’s Aidan Dalgleish and Alan Renouf started looking in to automating the solution. Those who read the initial case study probably have seen the manual steps required for a fail-over, those who haven’t read this white paper first

The manual steps in the vCloud Director Resiliency whitepaper is exactly what Alan and Aidan addressed. So if you are interested in implementing this solution then it is useful to read this paper new white paper about Automating vCloud Director Resiliency as well. Nice work Alan and Aidan!

VMware View Infrastructure Resiliency whitepaper published

One of the white papers I worked on in 2012 when I was part of Technical Marketing was just published. This white paper is about VMware View infrastructure resiliency. It is a common question from customers, and now with this white paper you can explore the different options and understand the impact of these options. Below is a link to the paper and the description is has on the VMware website.

VMware View Infrastructure Resiliency: VMware View 5 and VMware vCenter Site Recovery Manager
“This case study provides insight and information on how to increase availability and recoverability of a VMware View infrastructure using VMware vCenter Site Recovery Manager (SRM), common disaster recovery (DR) tools and methodologies, and vSphere High Availability.”

I want to thank Simon Richardson, Kris Boyd, Matt Coppinger and John Dodge for working with me on this paper. Glad it is finally available!

Demo time – vCloud Director 5.1 disaster recovery demo

When I was playing with the new vCloud Director 5.1 and Site Recovery Manager 5.1 I figured I would record a demo of the DR solution that Chris Colotti and I developed. The demo is fairly straight forward and hopefully helps you in the process of building a resilient cloud infrastructure. In this demo I have included:

  • vSphere 5.1
    • vSphere Replication
  • vCloud Director 5.1
  • Site Recovery Manager 5.1

Update: VMware vCloud Director DR paper available in Kindle / iBooks format!

I just received a note that the DR paper for vCloud Director is finally available in both epub / mobi format. So if you have an e-reader make sure to download this format as it will render a lot better then a generic PDF!

Description: vCloud Director disaster recovery can be achieved through various scenarios and configurations. This case study focuses on a single scenario as a simple explanation of the concept, which can then easily be adapted and applied to other scenarios. In this case study it is shown how vSphere 5.0, vCloud Director 1.5 and Site Recovery Manager 5.0 can be implemented to enable recoverability after a disaster.

Download:
http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.pdf
http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.epub
http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.mobi

DR of View persistent linked clone desktops…

I know some of you have been waiting for this so I wanted to share some early results. I was in the UK last week and we managed to get an environment configured using persistent linked clone virtual desktops with View. We also managed to fail-over and fail-back desktops between two datacenters. The concepts is really similar to the vCloud Director DR concept.

In this scenario Site Recover Manager will be leveraged to fail-over all View management components. In each of the sites it is required to have a management vCenter Server and an SRM Server which aligns with standard SRM design concepts. Since it is difficult to use SRM for View persistent desktops there is no requirement to have an SRM environment connecting to the View desktop cluster’s vCenter Server. In order to facilitate a fail-over of the View desktops a simple mount of the volume is done. This could be using ‘esxcfg-volume -m’ for VMFS or using a DNS c-name mounting the NFS share after point the alias to the secondary NAS server.

What would the architecture look like? This is an oversimplified architecture, of course … but I just want to get the message across:

What would the steps be?

  1. Fail-over View management environment using SRM
  2. Validate all View management virtual machines are powered on
  3. Using your storage management utility break replication for the datastores connected to the View Desktop Cluster and make the datastores read/write (if required by storage platform)
  4. Mask the datastores to the recovery site (if required by storage platform)
  5. Using ESXi command line tools mount the volumes of the View Desktop Cluster cluster on each host of the cluster
    • esxcfg-volume –m <;volume ID>;
      or
    • point the DNS CNAME to the secondary NAS server and mount the NAS datastores
  6. Validate all volumes are available and visible in vCenter, if not rescan/refresh the storage
  7. Take the hosts out of maintenance mode for the View Desktop Cluster (or add the hosts to your cluster, depending on the chosen strategy)
  8. In our tests the virtual desktops were automatically powered on by vSphere HA. vSphere HA is aware of the situation before the fail-over and will power-on the virtual machines according to the last known state

These steps have been validated this week and we managed to successfully fail-over our desktops and fail them back. Keep in mind that we only did these tests two or three times, so don’t consider this article to be support statement. We used persistent linked clones as that was the request we had at that point, but we are certain this will work for the various different scenarios. We will extend our testings to include various other scenarios.

Cool right!?

VMware vCloud Director Infrastructure Resiliency Case Study paper published!

Yesterday the paper that Chris Colotti and I were working on titled “VMware vCloud Director Infrastructure Resiliency Case Study” was finally published. This white paper is an expansion on the blog post I published a couple of weeks back.

Someone asked me at PEX where this solution came from all of a sudden, well this is based on a solution I came up with on a random Friday morning  half of December when I woke up at 05:00 in Palo Alto still jet-lagged. I diagrammed it on a napkin and started scribbling things down in Evernote. I explained the concept to Chris over breakfast and that is how it started. Over the last two months Chris (+ his team) and I validated the solution and this is the outcome. I want to thank Chris and team for their hard work and dedication.

I hope that those architecting / implementing DR solutions for vCloud environments will benefit from this white paper. If there are any questions feel free to leave a comment.

Source - VMware vCloud Director Infrastructure Resiliency Case Study

Description: vCloud Director disaster recovery can be achieved through various scenarios and configurations. This case study focuses on a single scenario as a simple explanation of the concept, which can then easily be adapted and applied to other scenarios. In this case study it is shown how vSphere 5.0, vCloud Director 1.5 and Site Recovery Manager 5.0 can be implemented to enable recoverability after a disaster.

Download:
http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.pdf
http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.epub http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.mobi

I am expecting that the MOBI and EPUB version will also soon be available. When they are I will let you know!

Digging deeper into the VDS construct

The following comment was made on my VDS blog and I figured while would investigate this a bit further:

It seems like the ESXi host only tries to sync the vDS state with the storage at boot and never again afterward. You would think that it would keep trying, but it does not.

Now lets look at the “basics” first. When an ESXi host boots it will get the data required to recreate the VDS structure locally by reading /etc/vmware/dvsdata.db and from esx.conf. You can view the dvsdata.db file yourself by doing:

net-dvs -f /etc/vmware/dvsdata.db

But is that all that is used? If you check the output of that file you will see that all data required for a VDS configuration to work is actually stored in there, so what about those files stored on a VMFS volume?

Each VMFS volume that holds a working directory (place where .vmx is stored) for at least 1 virtual machine that is connected to a VDS will have the following folder:

drwxr-xr-x    1 root     root          420 Feb  8 12:33 .dvsData

If you go to this folder you will see another folder. This folder appears to be some sort of unique identifier, and when comparing the string to the output of “net-dvs” it appears to be the identifier of the dvSwitch that was created.

drwxr-xr-x    1 root     root         1.5k Feb  8 12:47 6d 8b 2e 50 3c d3 50 4a-ad dd b5 30 2f b1 0c aa

Within this folder you will find a collection of files:

-rw------- 1 root root 3.0k Feb 9 09:00 106
-rw------- 1 root root 3.0k Feb 9 09:02 136
-rw------- 1 root root 3.0k Feb 9 09:00 138
-rw------- 1 root root 3.0k Feb 9 09:05 152
-rw------- 1 root root 3.0k Feb 9 09:00 153
-rw------- 1 root root 3.0k Feb 9 09:05 156
-rw------- 1 root root 3.0k Feb 9 09:05 159
-rw------- 1 root root 3.0k Feb 9 09:00 160
-rw------- 1 root root 3.0k Feb 9 09:00 161

It is no coincidence that these files are “numbers” and that these numbers resemble the port ID of the virtual machines stored on this volume. This is the port information of the virtual machines which have their working directory on this particular datastore. This port info is also what HA uses when it needs to restart a virtual machine which uses a dvPort. Let me emphasize that, this is what HA uses when it needs to restart a virtual machine! Is that all?

Well I am not sure. When I tested the original question I powered on the host without access to the storage system and powered on my storage system when the host was fully booted. I did not get this confirmed, but it seems to me that access to the datastore holding these files is somehow required during the boot process of your host, in the case of “static port bindings” that is. (Port bindings are more in-depth described here.)

Does this imply that if your storage is not available during the boot process virtual machines cannot connect to the network when they are powered on? Yes that is correct, I tested it and when you have a full power-outage and your hosts come-up before your storage you will have a “challenge”. As soon as the storage is restored you probably will want to restart your virtual machines but if you do you will not get a network connection. I’ve tested this 6 or 7 times in total and not once did I get a connection.

As a workaround you can simply reboot your ESXi hosts. If you reboot the host the problem is solved and your virtual machines can be powered on and will get access to the network. Rebooting a host can be a painfully slow exercise though, as I noticed during my test runs in my lab. Fortunately there is a really simple workaround: restarting the management agents! Before you power-on your virtual machines and after your storage connection has been restored do the following from the ESXi shell:

services.sh restart

After the services have been restarted you can power-on your virtual machines and network connection will be restored!

Side note, on my article there was one question about the auto-expand property of static port groups and whether this was officially supported and where it was documented. Yes it is fully supported. There’s a KB Article about how to enable it and William Lam recently blogged about it here. That is it for now on VDS…