I had a question last week around HA’s datastore heartbeating, the question was if datastore heartbeating still worked if you only have 1 datastore in your environment. I can understand where the question comes from as HA throws this error that you need to have 2 datastores at a minimum for HA datastore heartbeating to function correctly. I want to point out that even though HA says that 2 datastores is the minimum, even when only one datastore is available it will be used for heartbeat purposes. Yes this error will be there on your cluster, and yes you can suppress it using “das.ignoreInsufficientHbDatastore“. I figured others might be hitting the same error and have the same question so why not document it?!
I have been digging for a long time now to figure out what the minimum bandwidth requirements are per concurrent vMotion. After a long time I finally managed to get a statement. In the past the statement was made that 622Mbps was the minimum required bandwidth for vMotion, it appears that this is incorrect for vSphere 5.0 and higher. With vSphere 5.0 a new feature called Stun During Page Send (SDPS) was introduced and this has decreased the bandwidth requirements from 622Mpbs down to 250Mbps per concurrent vMotion.
Always nice to know right?!
This question around adding different tiers of storage in a single Storage DRS datastore cluster keeps popping up every once in a while. I can understand where it is coming from as one would think that VM Storage Profiles combined with Storage DRS would allow you to have all types of tiers in one cluster, but then balance within that “tier” within that pool.
Truth is that that does not work with vSphere 5.1 and lower unfortunately. Storage DRS and VM Storage Profiles (Profile Driven Storage) are not tightly integrated. Meaning that when you provision a virtual machine in to a datastore cluster and Storage DRS needs to rebalance the cluster at one point, it will consider ANY datastore within that datastore cluster as a possible placement destination. Yes I agree, it is not what you hoped for… it is – what it is. (feature request filed) Frank visualized this nicely in his article a while back:
So when you architect your datastore clusters, there are a couple of things you will need to keep in mind. These are the design rules at a minimum, that is if you ask me:
- LUNs of the same storage tier
- See above
- More LUNs = more balancing options
- Do note size matters, a single LUN will need to be able to fit your largest VM!
- Preferably LUNs of the same array (so VAAI offload works properly)
- VAAI XCOPY (used by SvMotion for instance) doesn’t work when going from Array-A to Array-B
- When replication is used, LUNs that are part of the same consistency group
- You will want to make sure that VMs that need to be consistent from a replication perspective are not moved to a LUN that is outside of the consistency group
- Similar availability characteristics and performance characteristics
- You don’t want potential performance or availability to degrade when a VM is moved
Hope this helps,
Last week I was helping someone on the VMTN community forums. They were hitting what appeared to be strange HA behavior. After some standard questions this person told me that all VMs were powered down after a network outage. Sounds like a familiar problem? Yes I can hear most of you think: Isolation response set to “power off” and no proper network redundancy?
Well yes and no. They had the isolation response indeed configured to “power off” all VMs when the host is isolated. They did however have proper network redundancy, so how on earth did this happen? With 2 physical NICs and 2 physical switches and only 1 being impacted this should not have happened right?!?
Wrong! In this case the fail-over from a “vmkernel” perspective worked fine. The first “path” went down, so the second was used for this management vmkernel. All VMs were up and running until this point, and they remained running until… network connection was restored and the vmnic returned to the original physical NIC. Meaning that the mac address that showed up on port 1 popped up on port 2 and then went back to 1 again. The switch was not impressed and went through the spanning tree process and traffic was blocked instantly as a result of it. Now when traffic is blocked bad things can happen, especially when you configure HA to “power off” VMs. Basically what caused this issue to happen was the fact the spanning tree was not set to the recommended “port fast”, more details here.
I knew instantly that this was the reason for this problem, not because I know stuff about HA but because I had seen this many times in the past while testing environments I configured and designed. Not just testing after implementing a new infrastructure, but also testing after making changes to an infrastructure or introducing a new version / feature. I guess this kind of comes back to the “disaster” scenario as well, test it if you want to know if it works as expected. Just a simple example, I want to introduce QoS for my vMotion network and make changes to my physical network. Now what? How do I test these changes? How many times do I run through my test scenarios? What kind of “problems” do I introduce during my tests?
So I guess by now some might wonder why on earth I brought this up… well the problem above could have been prevented by simply testing the infrastructure when implemented and after changes have been introduced, and maybe even on a regular basis. If HA / Networking was tested properly, those VMs would not have been powered off…
On the VMware Community Forums someone reported he was having issues unmounting datastores when vSphere HA was enabled. Internally I contacted various folks to see what was going on. The error that this customer was hitting was the following:
The vSphere HA agent on host '<hostname>' failed to quiesce file activity on datastore '/vmfs/volumes/<volume id>'
After some emails back and forth with Support and Engineering (awesome to work with such a team by the way!) the issue was discovered and it seems that in two separate instances issues were resolved that had to do with unmounting of datastores. Keith Farkas explained on the forums how you can figure out if you are hitting those exact problems or not and in which release they are fixed, but at I realize those kind of threads are difficult to find I figured I would post it here for future reference:
You can determine if you are encountering this issue by searching the VC log files. Find the task corresponding to the unmount request, and see if the follow error message is logged during the task’s execution (Fixed in 5.1 U1a) :
2012-09-28T11:24:08.707Z [7F7728EC5700 error 'DAS'] [VpxdDas::SetDatastoreDisabledForHACallback] Failed to disable datastore /vmfs/volumes/505dc9ea-2f199983-764a-001b7858bddc on host [vim.HostSystem:host-30,10.112.28.11]: N3Csi5Fault16NotAuthenticated9ExceptionE(csi.fault.NotAuthenticated)
While we are on the subject, I’ll also mention that there is another know issue in VC 5.0 that was fixed in VC5.0U1 (the fix is in VC 5.1 too). This issue related to unmounting a force mounted VMFS datastore. You can determine whether you are hitting this error by again checking the VC log files. If you see an error message such as the following with VC 5.0, then you may be hitting this problem. A work around, like above, is to disable HA while you unmount the datastore.
2011-11-29T07:20:17.108-08:00 [04528 info 'Default' opID=19B77743-00000A40] [VpxLRO] -- ERROR task-396 -- host-384 -- vim.host.StorageSystem.unmountForceMountedVmfsVolume: vim.fault.PlatformConfigFault: