ha

Future HA developments… (VMworld – BC3197)

Duncan Epping · Sep 15, 2009 ·

I was just listening to “BC3197 – High Availability – Internals and Best Practices” by Marc Sevigny. Marc is one of the HA engineers and is also my primary source of information when it comes to HA. Although most information can be found on the internet it’s always good to verify your understanding with the people who actually wrote it.

During the session Marc explains, and I’ve written about in this article, that when a dual host failure occurs the global startup order is not taking into account. The startup order will be processed per host with the current version. In other words “Host a” first with taking startup order into account and then “Host B” with taking startup order into account.

During the session however Marc revealed that in a future version of HA global startup settings(Cluster based) will be taken into account for any number of host failures! Great stuff, another thing to mention is that they are also looking into an option which would enable you to pick your primary hosts. For blade environment this will be really useful. Thanks Marc for the insights,

vCenter Server 4.0 Patch 1

Duncan Epping · Aug 21, 2009 ·

Don’t think many people have noticed this KB article yet or even experienced this issue with HA but nevertheless it’s worth mentioning. Apparently there’s an issue with HA in vCenter 4.0 when a class A network is being used. When a node fails this will not be detected and thus the fail-over of VMs will not occur. Although not many customers are using these class A ranges it is something I think you all should be aware of. This issue has been resolved and VMware released the following KB article which contains a link to the patch:

http://kb.vmware.com/kb/1013013
A vSphere 4.0 VMware High Availability cluster may not failover virtual machines when ESX is configured with certain IP addresses

You experience these symptoms:

In vCenter 4.0, VMware HA might not failover virtual machines when a host failure occurs.

When the ESX host’s IP address in a VMware HA enabled cluster is configured with certain IP addresses, the node failure detection algorithm fails.

You are susceptible to this issue when all of your Service Console Port(s) or Management Network IP address(s) on your ESX host fall within the following range:
3.x.x.x – 9.x.x.x
26.x.x.x – 99.x.x.x

Note: You are not affected if one of Service Console Port(s) or Management Network IP address(s) on your ESX host falls outside of this range.

HA Admission Control and DPM

Duncan Epping · Aug 20, 2009 ·

A couple of days ago we had a discussion on Admission Control and DPM internally at VMware. One of our customers had enabled DPM on a HA cluster. During the evening 4 out of 5 hosts where placed into standby mode because of this.

This customer, as many of our customers have these days, had vCenter running virtual. This of course led to the question; what happens if this one host fails and our virtual vCenter server is running on it?
That’s an easy one; nothing. It might not be the answer you are looking for but when the host fails that runs vCenter there’s no host or service left to get these hosts out of maintenance mode or restart your VMs.

Now maybe even more important; what causes this behavior?
This behavior is caused by the fact that admission control is disabled. If you disable admission control DPM will put hosts into standby mode even if it violates failover requirements. This means that if you have virtualized your vCenter server this is definitely something to be aware of.

For more info/background: http://kb.vmware.com/kb/1007006

vSphere and slotsizes

Duncan Epping · Aug 20, 2009 ·

I discussed slot sizes a week ago but forgot to add a screenshot of a great new vSphere feature which reports slot info of a cluster.

I just love vSphere!

HA and Slot sizes

Duncan Epping · Aug 12, 2009 ·

This has always been a hot topic, HA and Slot sizes/Admission Control. One of the most extensive (Non-VMware) articles is by Chad Sakac aka Virtual Geek, but of course since then a couple of things has changed. Chad commented on my HA Deepdive if I could address this topic, here you go Chad.

Slot sizes

Lets start with the basics.

What is a slot?

A slot is a logical representation of the memory and CPU resources that satisfy the requirements for any powered-on virtual machine in the cluster.

In other words a slot size is the worst case CPU and Memory reservation scenario in a cluster. This directly leads to the first “gotcha”:

HA uses the highest CPU reservation of any given VM and the highest memory reservation of any given VM.

If VM1 has 2GHZ and 1024GB reserved and VM2 has 1GHZ and 2048GB reserved the slot size for memory will be 2048MB+memory overhead and the slot size for CPU will be 2GHZ.

Now how does HA calculate how many slots are available per host?

Of course we need to know what the slot size for memory and CPU is first. Then we divide the total available CPU resources of a host by the CPU slot size and the total available Memory Resources of a host by the memory slot size. This leaves us with a slot size for both memory and CPU. The most restrictive number is the amount of slots for this host. If you have 25 CPU slots but only 5 memory slots the amount of available slots for this host will be 5.

As you can see this can lead to very conservative consolidation ratios. With vSphere this is something that’s configurable. If you have just one VM with a really high reservation you can set the following advanced settings to lower the slot size being used during these calculations: das.slotCpuInMHz or das.slotMemInMB. To avoid not being able to power on the VM with high reservations these VM will take up multiple slots. Keep in mind that when you are low on resources this could mean that you are not able to power-on this high reservation VM as resources are fragmented throughout the cluster instead of located on a single host.

Host Failures?

Now what happens if you set the number of allowed host failures to 1?
The host with the most slots will be taken out of the equation. If you have 8 hosts with 90 slots in total but 7 hosts each have 10 slots and one host 20 this single host will not be taken into account. Worst case scenario! In other words the 7 hosts should be able to provide enough resources for the cluster when a failure of the “20 slot” host occurs.

And of course if you set it to 2 the next host that will be taken out of the equation is the host with the second most slots and so on.

What more?

One thing worth mentioning, as Chad stated with vCenter 2.5 the number of vCPUs for any given VM was also taken in to account. This led to a very conservative and restrictive admission control. This behavior has been modified with vCenter 2.5 U2, the amount of vCPUs is not taken into account.