ha

Testing your infrastructure!

Duncan Epping · Jul 16, 2013 ·

Last week I was helping someone on the VMTN community forums. They were hitting what appeared to be strange HA behavior. After some standard questions this person told me that all VMs were powered down after a network outage. Sounds like a familiar problem? Yes I can hear most of you think: Isolation response set to “power off” and no proper network redundancy?

Well yes and no. They had the isolation response indeed configured to “power off” all VMs when the host is isolated. They did however have proper network redundancy, so how on earth did this happen? With 2 physical NICs and 2 physical switches and only 1 being impacted this should not have happened right?!?

Wrong! In this case the fail-over from a “vmkernel” perspective worked fine. The first “path” went down, so the second was used for this management vmkernel. All VMs were up and running until this point, and they remained running until… network connection was restored and the vmnic returned to the original physical NIC. Meaning that the mac address that showed up on port 1 popped up on port 2 and then went back to 1 again. The switch was not impressed and went through the spanning tree process and traffic was blocked instantly as a result of it. Now when traffic is blocked bad things can happen, especially when you configure HA to “power off” VMs. Basically what caused this issue to happen was the fact the spanning tree was not set to the recommended “port fast”, more details here.

I knew instantly that this was the reason for this problem, not because I know stuff about HA but because I had seen this many times in the past while testing environments I configured and designed. Not just testing after implementing a new infrastructure, but also testing after making changes to an infrastructure or introducing a new version / feature. I guess this kind of comes back to the “disaster” scenario as well, test it if you want to know if it works as expected. Just a simple example, I want to introduce QoS for my vMotion network and make changes to my physical network. Now what? How do I test these changes? How many times do I run through my test scenarios? What kind of “problems” do I introduce during my tests?

So I guess by now some might wonder why on earth I brought this up… well the problem above could have been prevented by simply testing the infrastructure when implemented and after changes have been introduced, and maybe even on a regular basis. If HA / Networking was tested properly, those VMs would not have been powered off…

Unmounting datastore fails due to vSphere HA?

Duncan Epping · Jul 5, 2013 ·

On the VMware Community Forums someone reported he was having issues unmounting datastores when vSphere HA was enabled. Internally I contacted various folks to see what was going on. The error that this customer was hitting was the following:

The vSphere HA agent on host '<hostname>' failed to quiesce file activity on datastore '/vmfs/volumes/<volume id>'

After some emails back and forth with Support and Engineering (awesome to work with such a team by the way!) the issue was discovered and it seems that in two separate instances issues were resolved that had to do with unmounting of datastores. Keith Farkas explained on the forums how you can figure out if you are hitting those exact problems or not and in which release they are fixed, but at I realize those kind of threads are difficult to find I figured I would post it here for future reference:

You can determine if you are encountering this issue by searching the VC log files. Find the task corresponding to the unmount request, and see if the follow error message is logged during the task’s execution (Fixed in 5.1 U1a) :

2012-09-28T11:24:08.707Z [7F7728EC5700 error 'DAS'] [VpxdDas::SetDatastoreDisabledForHACallback] Failed to disable datastore /vmfs/volumes/505dc9ea-2f199983-764a-001b7858bddc on host [vim.HostSystem:host-30,10.112.28.11]: N3Csi5Fault16NotAuthenticated9ExceptionE(csi.fault.NotAuthenticated)

While we are on the subject, I’ll also mention that there is another know issue in VC 5.0 that was fixed in VC5.0U1 (the fix is in VC 5.1 too). This issue related to unmounting a force mounted VMFS datastore. You can determine whether you are hitting this error by again checking the VC log files. If you see an error message such as the following with VC 5.0, then you may be hitting this problem. A work around, like above, is to disable HA while you unmount the datastore.

2011-11-29T07:20:17.108-08:00 [04528 info 'Default' opID=19B77743-00000A40] [VpxLRO] -- ERROR task-396 -- host-384 -- vim.host.StorageSystem.unmountForceMountedVmfsVolume: vim.fault.PlatformConfigFault:

CPU Affinity and vSphere HA

Duncan Epping · Jun 27, 2013 ·

On the VMware Community Forums someone asked today if CPU Affinity and vSphere HA worked in conjunction and if it was supported. To be fair I never tested this scenario, but I was certain it was supported and would work… Never hurts to validate though before you answer a question like that. I connected to my lab and disabled a VM for DRS so I could enable CPU affinity. I pinned the CPUs down to core 0 and 1 as shown in the screenshot below:

After pinning the vCPUs to a set of logical CPUs I powered on the VM. The result was, as expected, a “Protected” virtual machine as shown in the screenshot below.

But would it get restarted if anything happened to the host? Yes it would, and I tested this of course. I switched the server off which was running this virtual machine and within a minute vSphere HA restarted the virtual machine on one of the other hosts in the cluster. So there you have it, CPU Affinity and vSphere HA work fine.

PS: Would I ever recommend using CPU Affinity? No I would not!

Reminder: Free Kindle copy of vSphere 5 and 4.1 Clustering Deepdive

Duncan Epping · Jun 5, 2013 ·

Just a reminder, if you want your free Kindle copy of the vSphere 5.0 Clustering Deepdive or the vSphere 4.1 HA and DRS Deepdive, make sure to check Amazon (US Kindle Store) today and tomorrow (Wednesday June the 5th and Thursday June the 6th)! You can download the Kindle copy of both these books for free, yes that is correct ZERO dollars.

So make sure you pick it up either before the promo expires…

** Note I have linked the US Kindle stores, it is also available for free in local Kindle stores, just do search! **

Free Kindle copy of vSphere 5.0 Clustering Deepdive?

Duncan Epping · May 28, 2013 ·

Do you want a free Kindle copy of the vSphere 5.0 Clustering Deepdive or the vSphere 4.1 HA and DRS Deepdive? Well make sure to check Amazon next week! I just put both of the books up for a promotional offer… For 48 hours, Wednesday June the 5th and Thursday June the 6th, you can download the Kindle (US Kindle Store) copy of both these books for free, yes that is correct ZERO dollars.

So make sure you pick it up either Wednesday June the 5th or Thursday June the 6th, it might be the only time this year it is on promo.