ha

High Availability “Deepdive” page

Duncan Epping · Jan 26, 2009 ·

I’ve just created a new Page. This page will also deal about VMware HA. I threw all my “deepdive” posts into one page which makes it easier to find for you guys and search engines. But most important, easier to maintain. When I’ve got more technical in-depth information I will add it to the page.

Check it out and let me know what you think.

HA: who decides where a VM will be restarted?

Duncan Epping · Dec 15, 2008 ·

During the Dutch VMUG someone walked up to me and asked a question about High Availability. He read my article on Primary and Secondary nodes and was wondering who decided where and when VM would be restarted.

Let’s start with a short recap of the “primary/secondary” article: “The first five servers that join the cluster will become a primary node, and the others that will join will become a secondary node. Secondary nodes send their state info to primary nodes and also contact the primary nodes for their heartbeat notification. Primary nodes replicate their data with the other primary nodes and also send their heartbeat to other primary nodes.”

The question was, when a fail-over needs to take place cause an isolation occurred who decides on which host a specific VM will be restarted. The obvious answer is one of the primaries. One of the primaries will be selected as the “fail-over coordinator”. The fail-over coordinator coordinates the restart of virtual machines on the remaining hosts. The coordinator takes restart priorities in account. Keep in mind, when two hosts fail at the same time it will handle the restart sequentially. In other words, restart the VM’s of the first failed host(taking restart priorities in account) and then restart the VM’s of the host that failed as second(again taking restart priorities in account). If the fail-over coordinator fails one of the primaries will take over.

By the way, this is another reason why you can only account for 4 host failures. You need at least 1 primary, this primary will be the fail-over coordinator. When the last primary dies….

VM’s may unexpectedly reboot when using VMware HA with Virtual Machine Monitoring

Duncan Epping · Dec 12, 2008 ·

This KB article has just been published:

Virtual Machines may unexpectedly reboot after a VMotion migration to an ESX 3.5 Update 3 Host OR after a Power On operation on an ESX 3.5 Update 3 Host, when VMware HA feature with Virtual Machine Monitoring is active.

There’s a work around for the problem but I will not be posting them here cause they might change somewhere in time. Just read the KB article for more info on how to fix this issue.

Leave powered on…

Duncan Epping · Dec 4, 2008 ·

On Friday I got a question about the “Leave powered on” setting for HA. This setting is used for the Isolation Response. In other words, what does HA needs to do when a network isolation is detected. The question was pretty straight forward:”What happens when a host is isolated from the network or a host dies completely?”.

This question was asked because the default setting changed in ESX 3.5 / VirtualCenter 2.5 from “Power off vm” to “Leave powered on”. The old value was simple, what ever happens the VM will be powered off and restarted on a different host. [Read more…] about Leave powered on…

Make VirtualCenter highly available with VMware VI3

Duncan Epping · Nov 19, 2008 ·

Jason Boche is definitely going for that top ten blogs lists by Eric Siebert. Jason has been releasing high quality blogs over the last couple of week, keep it up!

I just noticed his latest addition:”
Make VirtualCenter highly available with VMware Virtual Infrastructure“. Which is a great article on the advantages of virtualizing your VirtualCenter server. Jason wasn’t the only one that picked up this “trend”, so did this “newcomer” and colleague Dave Lawrence aka VMGuy. (Newcomer in the blogosphere.)

Like always I’ve got my own view on making VirtualCenter, or should I say vCenter, highly available. I fully agree with both gents to use the VirtualInfrastructure technology to achieve this. I’m no big fan of Microsoft Clustering Services for this purpose.

When virtualizing VirtualCenter remember a couple of things:

Disable DRS(Change Automation Level!) for your VirtualCenter VM and make sure to document where the VirtualCenter server is located (My suggestion would be the first ESX Box)
Enable HA for the VirtualCenter server, and set the startup priority to high
Make sure the VirtualCenter server gets enough resources by setting the shares “high”, and maybe even set reservations
Make sure VirtualCenter starts up automatically when a power cut occurs (Configuration, Virtual Machine Startup and Shutdown)
Make sure other services and servers that VirtualCenter relies on are also starting automatically, with a high priority and in the correct order like:
1. Active Directory
2. DNS
3. SQL
At Least 2GB of Memory and 1 CPU

I know most of these best practices are documented here and pretty obvious, but somehow they are often overlooked. Especially number 1,4 and 5. Why are these important?

Well when a complete site fails, which will be a stressful situation, you don’t want to spend time looking on every single ESX host if the VirtualCenter was located there before the power cut. With DRS enabled it can and probably will be vmotioned around the environment and you need to start it from the command line with “vmware-cmd” when for some reason HA or the automatic startup fails. This can only be done on the host where the VM was located before the power cut / isolation.

So again, I do advise to virtualize your VirtualCenter Server, but make sure you know where the VirtualCenter Server resides at all time and write procedures for booting the environment manually!