** for vSphere 5.0 check this update! **
Today I received a very valid question around a full cluster failure. What happens when all the hosts in a cluster go down and at some point return? Will the VMs be restarted and what do I need to have in place to ensure they will?
It seems to be an urban myth that you need to use “auto-start” for a full cluster failure. But as you might have noticed that won’t work when HA is enabled. So what will?
VMware HA
Is it really that simple? Yes it is! When a full cluster fails and nodes start powering up HA will restart the VMs. As you know HA (or to be precise the primary nodes) maintains the host states, which includes the status of all VMs on those hosts. When one of the primary nodes returns to duty it will trigger the restarts based on the last known state. Make sure you set the restart priority correct so that any VMs hosting “management apps” will be booted up first.
It can’t get any simpler than that can it!






Hi Duncan, as you stated once one of the primaries is started up, it will also take over the fail-over coordinator roles to be able to coordinate the restart of the VMs based on the priorities if any…
However, if you don’t know what are your primaries, you might restart several secondaries before you finally start up a primary which will restart VMs. Thus it might be a good idea to keep track of your primaries and update that list whenever you put a host in maintenance mode, or disconnect it from the cluster or reconfigure your HA.
Rgds,
Didier
Well, I guess you would always want to start up some of the secondaries first in any case to make sure you have resources available to power on VMs. But indeed for situations like these it would be good to have a list of primary/secondary nodes and boot them up in the required order.
you could use this script to dump the info on a regular basis: http://www.virtu-al.net/2009/10/28/powercli-listing-cluster-primary-ha-nodes/
There is one very important exception to this, and that’s when you have an HA incident where isolation response kicks over the VMs on every node in the cluster, like in the event that you have a complete and full-out network failure. If HA powers everything off, it stays off, even as hosts reconnect to the cluster when the network comes back.
In a properly designed environment, this should never happen, but we’ve run across it in testing when all our hosts were hooked up to the same physical switch.
Yes, hence the reason there is also an “HA Maintenance Mode” these days
Thank god for HA maintenance mode. Was so annoying waiting for it to disable/enable back in the 3.5 days, especially with a lot of hosts.
Interesting scenarios I’ve never seen happen. I’m trying to think of a situation that this could happen. I guess if you lost power and the UPS and the backup generator all failed. Some places might not have that level of resiliency, but you should for having all your eggs in the VM basket. Maybe a network guy being incompetent could deploy a bad image to your network devices, all of them at the same time! That could do it too.
I guess the fact it could happen is all that matters, because that means it will.
Actually this just happened to us during a Generator test the UPS decided it needed to step in, then promptly failed to maintain power and the entire DC shut off.
I was surprised when the ESX hosts were coming online that all of the VMs powered up within minutes.
Had we known this would happen I would have had the vCenter DB server and vCenter server and a DC set up with priority… Still it saved the DC team about 2 hours when getting things up and running. Yay VMware.
What about DNS? When all hosts failed and it took out all your DNS VMs along with it, then HA would not work, whether you had anti-affinity or not. I guess I would probably have to turn on DNS server manually for HA to kick in? (Yes, I know about hosts file, but it gets cumbersome)
Probably another reason why you probably want an off-site DNS server or a backup physical DNS server locally.
This is a neat tip. Sometimes I really wonder how many people really use VMware HA to the fullest. It is such a great piece of technology.
@Solgae: You should anyway have an offsite DNS since that is one of the pieces of infrastructure that the ESX/vCenter itself will depend on.
If you have a network issue that causes an HA isolation event on all hosts in your cluster and you have HA configured to power off your VMs, HA will power off the VMs and the HA agent will be disabled. The HA agents will remain disabled until they can talk to vCenter which will reconfigure / start the HA agent on the hosts. Once HA is running it will start powering on your VMs. If vCenter is also a VM in this cluster and is set to power down, your VMs won’t be powered back on by HA until vCenter is up and running so it can reconfigure / start the HA agent on the hosts. Depending on your environment, it may make sense to set the vCenter VM’s isolation response to ‘Leave powered On’.
I don’t believe that DNS should play a role. I’m not 100% sure but I think the hosts will use DNS to do the initial HA configuration and then store the results in the FT_HOSTS file. Or maybe the hosts will try to use DNS but fall back to the FT_HOSTS file when necessary. Also, the hosts won’t need name resolution to the vCenter server since the IP of the vCenter server is stored in the vpxa.cfg file.
This is based on my testing with 4.1
I had the same situation a couple of weeks ago.
The entire datacenter went powerout, the UPS didnot work either…
The customers VI configuration and hardware infra was definitely not prepared for this.
When the power came backon shit started hitting the fan.
Here is what happened and some things to take into account;
IMPORTANT
Set the poweron options of the different hardware components in the right order, some cannot be controlled but the crucial often can.
The Switches;
The Ethernet and fiber switches usually come backonline after a powerrestore, there is normally no reason to keep them powered of.
The Enclosures:
The enclosures (in this case HP c7000 enclosures) start automatically there is no configuring there.
The Blades/Hosts;
The hosts (in this case HP blades) should be configured to NOT poweron after a powerrestore. In this case it was NOT set, causing the ESX hosts to come online without connection to the datastores. This wasnot really a problem the fact that the ESX servers were missing DNS (virtual) was.
The storage array;
The storage array (in this case a HP XP) is default set to NOT poweron after a powerrestore. This is best practice. Start it up and let it do all its checks before starting up your esx hosts.
VI infra;
If the storage array and the switches are back online you can begin starting up your esx hosts. Make sure you know what the primary HA hosts are.
DNS/AD;
.
Set the priority of your DNS and DC VMs, make sure they start first.
This is crucial when running these services virtual, HA needs DNS (still.., this is soon to change i was told in Kopenhagen, i think it was in some HA deepdive session
In our case the priority was not set and the blades were running long before the storage was.
The (longer) lack of DNS caused problems with HA on all the ESX hosts. When the storage array came back online HA did kickin, it began starting VMs, no problems there, the problems came when DRS had to do its work.
vCenter/Database;
.
Make sure your vCenter server is the next VM to start.
If it is NOT also running the vCenter Database, bring that up first.
HA works without vCenter, no problem there, DRS does need vCenter.
In our vSphere 4.0 version, HA and DRS still work independent of each other (this is changed in 4.1, the more reason to update).
Because of the mentioned HA problem, DRS was unable to move VMs causing HA to startup to many VMs on the primary host (over 30 vms on a 16GB Host). The customer did not configure a slotsize, I think they forgot to read your HA deepdive
The customer also had no startup priority set on the vCenter VM, also no higher shares. This would have been nice in this case because the entire infra was very unresponsive.
I finally disabled HA causing DRS to be able to move VMS arround.
We ended up rebooting all the VMs because of all kinds of problems, this time in the right order.
Duncan,
HA restarts VMs based on the last known state. What if all VMs were shuted down gracefully by powerchute or UPS shutdown script… VMs won`t start!!!
Really? Of course they won’t. That is not what HA is supposed to do is it, so no need to shout at me.
When you have a script that powers down all your VMs make sure you also have a script that powers them up,
vSphere already has builtin script to start virtual machines and cool interfaces to manage it.
/sbin/vmware-autostart.sh
Continued at
http://communities.vmware.com/message/1667819#1667819