I have been having these discussions with our engineering teams for the last year around guaranteed restarts of virtual machines in a cluster. In the current shape / form we use Admission Control to guarantee virtual machines are restarted. Today Admission Control is all about guaranteeing virtual machine restarts by keeping track of Memory and CPU resource reservations, but you can imagine that in the Software Defined Datacenter this could be expanded with for instance storage or networking reservation.
Now why am I having these discussions, what is the problem with Admission Control today? Well first of all it is the perception that many appear to have of Admission Control. Many believe the Admission Control algorithm uses “used” resources. Reality however is that Admission Control is not that flexible, it uses resource reservations and as you know this is static. So what is the result of using reservations?
By using reservations for “admission control” vSphere HA has a simple way of guaranteeing a restart is possible at all times. Simply because it checks if sufficient “unreserved resources” are available and if so it allows the virtual machine to be powered-on. If not, then it won’t allow the power-on just to ensure that all virtual machines can be restarted in case of a failure. But what is the problem? Although we guarantee a restart we do not guarantee any type of performance after the restart! Unless, unless of course you are setting your reservations equal to what you provisioned… but I don’t know anyone doing this as it eliminates any form of overcommitment and will result in an increase of cost and a decrease in flexibility.
So that is the problem. Question is – what should we do about it? We (the engineering teams and I) would like to hear from YOU.
- What would you like admission control to be?
- What guarantees do you want HA to provide?
- After a failure, what criteria should HA apply in deciding which VMs to restart?
One idea we have been discussing is to have Admission Control use something like “used” resources… or for instance an “average of resources used” per virtual machine. What if you could say: I want to ensure that my virtual machines always get at least 80% of what they use on average? If so, what should HA do when there are not enough resources to meet the 80% demand of all VMs? Power on some of the VMs? Power on all with reduced share values?
Also, something we have discussed is having vCenter show how many resources are used on average taking your high availability N-X setup in to account, which should at least provide an insight around how your VMs (and applications) will perform after a fail-over. Is that something you see value in?
What do you think? Be open and honest, tell us what you think… don’t be scared, we won’t be bite, we are open for all suggestions.