5

vSphere HA in 5.0 constantly pinging my gateway?

Duncan Epping · Jun 26, 2012 ·

I had this question today and noticed someone also dropped it on the community forums. The question was if vSphere HA is constantly pinging the default gateway or not. I knew HA would ping the gateway on a regular basis as of vSphere 5.0, and on a more frequently basis if a ping would fail but I wasn’t sure about the timing. I pointed Marc Sevigny from the HA engineering team to the thread on the community forums and he added added some nice juicy details to the it. I figured I would share them with you.

First of all, each ESXi host in a 5.x cluster will ping the isolation address every 5 minutes (300 seconds). Could this flood the isolation device?

There should be no “flood” of ICMP messages, and it should have little impact on network performance. The ICMP packet is 53 bytes long and sent once every 5 seconds from each of the HA hosts until the address(es) become pingable once again, at which point it returns to pinging once per hour.

If your default gateway is never pingable because of your firewall, you should open up the ports needed by HA. It is also possible to or disable the isolation address monitoring on the default gateway by using an advanced option (das.useDefaultIsolationAddress = false). It is recommended to specify a different isolation address (das.isolationaddress0) when the default gateway is a non-pingable device. Note that it is highly recommend to use a device as the default gateway which is as few hops removed from your hosts as possible!

With vSphere 5.0 and HA can I share datastores across clusters?

Duncan Epping · Apr 30, 2012 ·

I have had this question multiple times by now so I figured I would write a short blog post about it. The question is if you can share datastores across clusters with vSphere 5.0 and HA enabled. This question comes from the fact that HA has a new feature called “datastore heartbeating” and uses the datastore as a communication mechanism.

The answer is short and sweet: Yes.

For each cluster a folder is created. The folder structure is as follows:

/<root of datastore>/.vSphere-HA/<cluster-specific-directory>/

The “cluster specific directory” is based on the uuid of the vCenter Server, the MoID of the cluster, a random 8 char string and the name of the host running vCenter Server. So even if you use dozens of vCenter Servers there is no need to worry.

Each folder contains the files HA needs/uses as shown in the screenshot below. So no need to worry around sharing of datastores across clusters. Frank also wrote an article about this from a Storage DRS perspective. Make sure you read it!

PS: all these details can be found in our Clustering Deepdive book… find it on Amazon.

Scripts release for Storage vMotion / HA problem

Duncan Epping · Apr 17, 2012 ·

Last week when the Storage vMotion / HA problem went public I asked both William Lam and Alan Renouf if they could write a script to detect the problem. I want to thank both of them for their quick response and turnaround, they cranked the script out in literally hours. The scripts were validated multiple times in a VDS environment and worked flawless. Note that these scripts can detect the problem in an environment using a regular Distributed vSwitch and a Nexus 1000v, the script can only mitigate the problem though in a Distributed vSwitch environment. Here are the links to the scripts:

Perl: Identifying & Fixing Virtual Machines Affected By SvMotion / VDS Issue (William Lam)
PowerCLI – Identifying and fixing VMs Affected By SvMotion / VDS Issue (Alan Renouf)

Once again thanks guys!

Digging deeper into the VDS construct

Duncan Epping · Feb 23, 2012 ·

The following comment was made on my VDS blog and I figured while would investigate this a bit further:

It seems like the ESXi host only tries to sync the vDS state with the storage at boot and never again afterward. You would think that it would keep trying, but it does not.

Now lets look at the “basics” first. When an ESXi host boots it will get the data required to recreate the VDS structure locally by reading /etc/vmware/dvsdata.db and from esx.conf. You can view the dvsdata.db file yourself by doing:

net-dvs -f /etc/vmware/dvsdata.db

But is that all that is used? If you check the output of that file you will see that all data required for a VDS configuration to work is actually stored in there, so what about those files stored on a VMFS volume?

Each VMFS volume that holds a working directory (place where .vmx is stored) for at least 1 virtual machine that is connected to a VDS will have the following folder:

drwxr-xr-x    1 root     root          420 Feb  8 12:33 .dvsData

If you go to this folder you will see another folder. This folder appears to be some sort of unique identifier, and when comparing the string to the output of “net-dvs” it appears to be the identifier of the dvSwitch that was created.

drwxr-xr-x    1 root     root         1.5k Feb  8 12:47 6d 8b 2e 50 3c d3 50 4a-ad dd b5 30 2f b1 0c aa

Within this folder you will find a collection of files:

-rw------- 1 root root 3.0k Feb 9 09:00 106
-rw------- 1 root root 3.0k Feb 9 09:02 136
-rw------- 1 root root 3.0k Feb 9 09:00 138
-rw------- 1 root root 3.0k Feb 9 09:05 152
-rw------- 1 root root 3.0k Feb 9 09:00 153
-rw------- 1 root root 3.0k Feb 9 09:05 156
-rw------- 1 root root 3.0k Feb 9 09:05 159
-rw------- 1 root root 3.0k Feb 9 09:00 160
-rw------- 1 root root 3.0k Feb 9 09:00 161

It is no coincidence that these files are “numbers” and that these numbers resemble the port ID of the virtual machines stored on this volume. This is the port information of the virtual machines which have their working directory on this particular datastore. This port info is also what HA uses when it needs to restart a virtual machine which uses a dvPort. Let me emphasize that, this is what HA uses when it needs to restart a virtual machine! Is that all?

Well I am not sure. When I tested the original question I powered on the host without access to the storage system and powered on my storage system when the host was fully booted. I did not get this confirmed, but it seems to me that access to the datastore holding these files is somehow required during the boot process of your host, in the case of “static port bindings” that is. (Port bindings are more in-depth described here.)

Does this imply that if your storage is not available during the boot process virtual machines cannot connect to the network when they are powered on? Yes that is correct, I tested it and when you have a full power-outage and your hosts come-up before your storage you will have a “challenge”. As soon as the storage is restored you probably will want to restart your virtual machines but if you do you will not get a network connection. I’ve tested this 6 or 7 times in total and not once did I get a connection.

As a workaround you can simply reboot your ESXi hosts. If you reboot the host the problem is solved and your virtual machines can be powered on and will get access to the network. Rebooting a host can be a painfully slow exercise though, as I noticed during my test runs in my lab. Fortunately there is a really simple workaround: restarting the management agents! Before you power-on your virtual machines and after your storage connection has been restored do the following from the ESXi shell:

services.sh restart

After the services have been restarted you can power-on your virtual machines and network connection will be restored!

Side note, on my article there was one question about the auto-expand property of static port groups and whether this was officially supported and where it was documented. Yes it is fully supported. There’s a KB Article about how to enable it and William Lam recently blogged about it here. That is it for now on VDS…

No Jumbo frames on your Management Network! (Updated!)

Duncan Epping · Jan 20, 2012 ·

I was just reading some of the comments posted today and Marc Sevigny, one of the vSphere HA developers, pointed out something which I did not know. I figured this is probably something that many are not aware of so I copied and pasted his comment:

~~Another thing to check if you experience this error is to see if you have jumbo frames enabled on the management network, since this interferes with HA communication.~~

~~This is document here in a note: http://kb.vmware.com/kb/2006729~~

To make it crystal clear: disable jumbo frames on your management network with vSphere 5.0 as there’s a problem with it! This problem is currently being investigated by the HA engineering team and will hopefully be resolved.

<Update> Just received an email that all the cases where we thought vSphere HA issues were caused by Jumbo Frames being enabled were actually caused by the fact that it was not configured correctly end-to-end. Please validate Jumbo Frame configuration on all levels when configuring. (Physical Switches, vSwitch, Portgroup, VMkernel etc)</Update>