HA Admission Control the basics – Part 1/2

Last week I received three different questions about vSphere HA Admission control and I figured I would lay out the basics once more. What is admission control?

vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected.

Almost every thing you need to know about admission control is in that single sentence. But lets break it down in to more consumable bites:

  1. vCenter uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection.
  2. vCenter uses admission control to ensure that virtual machine resource reservations are respected.

So first and foremost… Admission control is not about resource management, I devoted a whole article to that so not going in to details, but HA admission control is all about reserving resources to allow for a failover.

Secondly, admission control ensures virtual machine resource reservations of powered-on VMs can be respected. This is because virtual machine resources reservations are required to be available in order for a power-on to successfully complete! Meaning that if you set a 5GB memory reservation there needs to be 5GB of unreserved memory available (+ reserved memory overhead) on a single host in order for this virtual machine to power-on. If that 5GB machine is actually actively using 40GB it might end up swapping / paging, as only those 5GB of reserved capacity is taken in to account!

Note the “+ reserved memory overhead”! Every virtual machine has a memory overhead. This is usually in the range of a couple hundred MBs. For a successful power-on attempt you will need to be able to reserve this memory. If there is not enough “unreserved memory capacity” the power-on attempt will fail. So in reality that 5GB could just be 5.15GB. Might seem irrelevant, but I will explain why it is relevant in a second. Did you spot the “powered-on”? Yes, admission control only takes the resource reservations of powered-on VMs in to account. So if you have a VM with a large memory reservation which is powered-off it will not impact your admission control calculations!

In summary:

  1. Admission Control is about reserving resources to allow for a fail-over.
  2. Admission Control is no resource management tool, it only takes reserved capacity of powered-on VMs in to account.

So now that you know what admission control is. There are three policies when it comes to admission control… and we will discuss these in Part 2 of this article.

Creating a nested lab

I was just building a nested lab to record some demo videos. I find myself googling for this every single time so I figured I would write about it so I can easily get it of my own website. Many have written about this before and all credits go to William Lam and Eric Gray, which are the two  main blogs I have used in the past to get this working.

After installing ESXi on my physical box I “ssh” in to it. In order to allow “nested ESXi” to boot a 64bit OS you will need to run the following:

echo 'vhv.allow = "TRUE"' >> /etc/vmware/config

After you have done that you will want to make sure you will get network connection. Go to your “VM Network” portgroup, or if you named it differently the portgroup that is used to connect the virtual ESXi hosts to. For each of the hosts do the following:

  1. Click on the host
  2. Go to “Configuration”
  3. Click on “Networking”
  4. Click “Properties” on the vSwitch
  5. Select the correct portgroup
  6. Click “Edit”
  7. Click “Security”
  8. Set “Promiscuous Mode” to “Accept”
  9. Click “Ok”
  10. Click “Close”

Now for each virtual ESXi host (note there is a “guest os” called ESXi 5 in there, use it!) that you have created do the following:

  1. Right click on the VM
  2. Click “Edit settings”
  3. Click the “Options” tab
  4. Click on “CPU/MMU virtualization”
  5. Select the 4th option “Use Intel VT-x / AMD-v…”

I am building this out to record a new of “DR of the Cloud”. In other words, 3 virtual clusters + vCloud Director + SRM + vSphere Replication + Virtual Storage Appliances… Cool stuff right.

vSphere 5.0 HA restarting of VMs with no access to storage?

I had a question today around the restart of VMs with no access to storage by HA. The question was if HA would try to restart the VM and time out after 5 times. With the follow up question, if HA would try again when the storage would return for duty.

By default HA will try to restart a VM up to 5 times in roughly 30 minutes. If the master does not exceed it will stop trying. On top of that  HA manages a “compatibility list”. This list will contain the details around which VM can be restarted and where. In other words; which hosts have access to the datastores and network portgroup required for this VM to successfully power-on. Now if for whatever reason there are no compatible hosts available for the restart then HA will not try to restart the VM.

But what if the problem is resolved? As soon as the problem is resolved, and reported as such, the compatibility list will be updated. When the list is updated HA will continue with the restarts again.

It might also be good to know that if for whatever reason the master fails, a new master will continue trying to restart the VM. It will start with 5 new attempts and not take the number of restart attempts that the previous master did in to account.

VM Monitoring only using VMware Tools heartbeat?

I had this question twice this week and did a quick search on my blog and I wrote an article about it a while back, but I figured it wouldn’t hurt to repeat some of that and expand on it. I copied / pasted this from part from our book as I think it it spot on!

VM/App monitoring uses a heartbeat mechanism kind of similar to HA. If heartbeats, and, in this case, VMware Tools heartbeats, are not received for a specific (and configurable) amount of time, the virtual machine will be restarted. These heartbeats are monitored by the HA agent and are not sent over a network, but stay local to the host.

Although the heartbeat produced by VMware Tools is reliable, VMware added a further verification mechanism. To avoid false positives, VM Monitoring also monitors I/O activity of the virtual machine. When heartbeats are not received AND no disk or network activity has occurred over the last 120 seconds, per default, the virtual machine will be reset. Changing the advanced setting “das.iostatsInterval” can modify this 120-second interval.

Which isolation response should I use?

I wrote this article about split brain scenarios for the vSphere Blog. Based on this article I received some questions around which “isolation response” to use. This is not something that can be answered by a simple “recommended practice” and applied to all scenarios out there. Note that below has got everything to do with your infrastructure. Are you using IP-Based storage? Do you have a converged network? All of these impact the decision around the isolation response.

The following table however could be used to make a decision:

Likelihood that host will retain access to VM datastores Likelihood that host will retain access to VM network Recommended Isolation policy Explanation
Likely Likely Leave Powered On VM is running fine so why power it off?
Likely Unlikely Either Leave Powered On or Shutdown Choose shutdown to allow HA to restart VMs on hosts that are not isolated and hence are likely to have access to storage
Unlikely Likely Power Off Use Power Off to avoid having two instances of the same VM on the VM network
Unlikely Unlikely Leave Powered On or Power Off Leave Powered on if the VM can recover from the network/datastore outage if it is not restarted because of the isolation, and Power Off if it likely can’t.