5.0

Navigating your application landscape…

Duncan Epping · May 9, 2012 ·

I was on a holiday the last two weeks and slowly catching up on everything that happened. Some of you might think it wasn’t a lot, but in the world of cloud and virtualization it was. Not only was there a huge EUC launch event but also a new version of vCenter Infrastructure Navigator was released. Somehow it has been amazingly quiet around this product. Something I didn’t really understand, especially not after reading the release notes of version 1.1 of vCenter Infrastructure Navigator. Two things stood out:

vCloud Director support
Infrastructure Navigator discovers VMware services, such as Site Recovery Manager (SRM) Server, VMware View Server, VMware vCloud Director Server, and VMware vShield Manager Server.

For those who don’t know, Infrastructure Navigator is an application awareness plugin for vCenter Server. This enables you to get a better understanding of what is running on top of your virtual infrastructure. A lot of you may say, well why would I care? Think about DR for a second. What is the most challenging part of creating a DR Plan? Indeed, figuring out all dependencies. That is exactly where vCenter Infrastructure Navigator comes in to play as shown in the screenshot below, which I stole from Ben Scheerer. Ben wrote an excellent blog about some of the cool new features in vCenter Infrastructure Navigator, I am not going to repeat those just read his. It is worth it if you are serious about providing the best service to your (internal) customers!

With vSphere 5.0 and HA can I share datastores across clusters?

Duncan Epping · Apr 30, 2012 ·

I have had this question multiple times by now so I figured I would write a short blog post about it. The question is if you can share datastores across clusters with vSphere 5.0 and HA enabled. This question comes from the fact that HA has a new feature called “datastore heartbeating” and uses the datastore as a communication mechanism.

The answer is short and sweet: Yes.

For each cluster a folder is created. The folder structure is as follows:

/<root of datastore>/.vSphere-HA/<cluster-specific-directory>/

The “cluster specific directory” is based on the uuid of the vCenter Server, the MoID of the cluster, a random 8 char string and the name of the host running vCenter Server. So even if you use dozens of vCenter Servers there is no need to worry.

Each folder contains the files HA needs/uses as shown in the screenshot below. So no need to worry around sharing of datastores across clusters. Frank also wrote an article about this from a Storage DRS perspective. Make sure you read it!

PS: all these details can be found in our Clustering Deepdive book… find it on Amazon.

Scripts release for Storage vMotion / HA problem

Duncan Epping · Apr 17, 2012 ·

Last week when the Storage vMotion / HA problem went public I asked both William Lam and Alan Renouf if they could write a script to detect the problem. I want to thank both of them for their quick response and turnaround, they cranked the script out in literally hours. The scripts were validated multiple times in a VDS environment and worked flawless. Note that these scripts can detect the problem in an environment using a regular Distributed vSwitch and a Nexus 1000v, the script can only mitigate the problem though in a Distributed vSwitch environment. Here are the links to the scripts:

Perl: Identifying & Fixing Virtual Machines Affected By SvMotion / VDS Issue (William Lam)
PowerCLI – Identifying and fixing VMs Affected By SvMotion / VDS Issue (Alan Renouf)

Once again thanks guys!

Limiting stress on storage caused by HA restarts by lowering restart concurrency?

Duncan Epping · Apr 16, 2012 ·

I had a question last week, and it had me going for a while. The question was if “das.perHostConcurrentFailoversLimit” could be used to lower the hit on storage during a boot storm. By default this advanced option is set to 32. Meaning that a max of 32 VMs will be restarted by HA on a single host. The question was if lowering this value to for instance 16 would help reducing the stress on storage when multiple hosts would fail, or for instance in a blade environment when a chassis would fail.

At first you would probably say “Yes of course it will”. Having only 16 restarts concurrently vs 32 should cut the stress in half… Well not exactly. The point here is that this setting is:

A per host setting and not cluster wide
Addressing power on attempts

So what is the problem with that exactly? Well in the case of the per host setting, if you have a 32 node cluster and 8 would fail, there would still be a max of 384 VMs power on attempts concurrently. (32 – 8 failed host) * 16 VMs max restart per host. Yes it is a lot better than 768, but still a lot of VMs hitting your storage.

But more importantly, we are talking power-on attempts here! A power-on attempt does not equal the boot process of the virtual machine! It is just the initial process that flips the switch of the VM from “off” to “on”, check vCenter when you power on a VM, you will see the task as completed during the boot process of your VM. Reducing this number will reduce the stress hostd, but that is about it. In other words, if you lower it to 16 you will have less power-on attempts concurrently, but they will be handled fast by HOSTD and before you know it 16 new power-on attempts will be done, and near simultaneous!

The only way you can really limit the hit on storage and virtual machines sharing this storage would be by enabling Storage IO Control. SIOC will ensure that all VMs who are in need of storage resources will get it in a fair manner. The other option is to ensure that you are not overloading your datastores with a massive amount of VMs and not the IOPS to back the boot storm process up. I guess there is no real need to be overly concerned here though… How often does it happen that 50% of your environment fails? If it does, are you worried about that 15 minute performance hit, or worried about those 50% of the VMs being down?

Clarifying the SvMotion / VDS problem

Duncan Epping · Apr 14, 2012 ·

<Update>I asked William Lam if he could write a script to detect this problem and possibly even mitigate it. William worked on it over the weekend and just posted the result! Head over to his blog for the script! Thanks William for cranking it out this quick! For those who prefer PowerCLI… Alan Renouf just posted his version of the script! Both scripts provide the same functionality though!</Update>

I think there is some confusion around the SvMotion / VDS problem I described a couple of days back. Let me try to clarify it in a couple of simple steps.

First of all, this only applies to virtual machines that have been Storage vMotioned by vCenter 5.0 and are connected to a Distributed vSwitch. This could be either manually or using Storage DRS. So what is the exact problem?

When a VM is attached to a dvPortgroup it is connected to a port. This information is stored locally on the host and on the VMFS volume this VM is stored on.
This volume will contain a file which is named equal to the port number of this VM.
When the VM is Storage vMotioned to a different datastore this file is not created on the destination datastore
When the host fails on which the Storage vMotioned VM resides HA will attempt to restart that VM.
In order for HA to restart it and connect it to the dvPortgroup this file is required.
As the file is not available the restart fails.

You can simply resolve this by connecting the impacted VMs to a different dvPortgroup temporarily and then reconnect them back to the original portgroup. As soon as you’ve done that the file will be created on the datastore. For now this is a manual task, but I am sure some of my teammembers are working on a scripted solution as we speak… right Alan / William? 🙂