ha

New Academic/Tech Paper on FT

Duncan Epping · Jul 19, 2010 ·

I received this Paper a while back and think it is an excellent read. I just copied a random part of the paper to give you an idea of what it covers. There’s not much more to say about it then just read it, it is as in-depth as it can get on FT. I read it several times by now and still discover new things every time I read it.

The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines

There are many possible ways to attempt to detect failure of the primary and backupVMs. VMware FT uses UDP heartbeating between servers that are running fault-tolerantVMs to detect when a server may have crashed. In addition, VMware FT monitors thelogging traffic that is sent from the primary to the backup VM and the acknowledgmentssent from the backup VM to the primary VM.

vSphere 4.1 HA feature, totally unsupported but too cool

Duncan Epping · Jul 16, 2010 ·

Early 2009 I wrote an article on the impact of Primary Nodes and Secondary Nodes on your design. This was primarily focussed on Blade environments and basically it discussed how to avoid having all your primary nodes in a single chassis. If that single chassis would fail, no VMs would be restarted as one of the primary nodes is the “failover coordinator” and without a primary node to assign this role to a failover can’t be initiated.

With vSphere 4.1 a new advanced setting has been introduced. This setting is not even experimental, it is unsupported. I don’t recommend anyone using this in a production environment, if you do want to play around with it use your test environment. Here it is:

das.preferredPrimaries = hostname1 hostname2 hostname3
or
das.preferredPrimaries = 192.168.1.1,192.168.1.2,192.168.1.3

The list of hosts that are preferred as primary can either be space or comma separated. You don’t need to specify 5 hosts, you can specify any number of hosts. If you specify 5 and all 5 are available they will be the primary nodes in your cluster. If you specify more than 5, the first 5 of your list will become primary.

Please note that I haven’t personally tried it and I can’t guarantee it will work.

VMware View without HA?

Duncan Epping · Jul 15, 2010 ·

I was discussing something with one of my former colleagues a couple of days ago. He asked me what the impact was of running VMware View in an environment without HA.

To be honest I am not a View SME, but I do know a thing or two about HA/vSphere in general. So the first thing that I mentioned was that it wasn’t a good idea. Although VDI in general is all about density not running HA in these environments could lead to serious issues when a host fails.

Now, just imagine you have 80 Desktop VMs per host running and roughly 8 hosts in a DRS only cluster on NFS based storage. One of those hosts is isolated from the network…. what happens?

User connection is dropped
VMDK Lock times out
User tries to reconnect
Broker powers on the VM on a new host

Now that sounds great doesn’t it? Well yeah in a way it does, but what happens when the host is not isolated anymore?

Indeed, the VMs were still running. So basically you have a split brain scenario. The only way in the past to avoid this was to make sure you had HA enabled and had set HA to power off the VM.

But with vSphere 4 Update 2 a new mechanism has been introduced. I wanted to stress this, as some people have already made assumption that it is part of AAM/HA. It actually isn’t… The question for powering off the VM to recover from the split brain scenario is generated by “hostd” and answered by “vpxa”. In other words, with or without HA enabled ESX(i) will recover the split brain

Again, I am most definitely not a Desktop/View guy so I am wondering how the View experts out there look against disabling HA on your View Compute Cluster. (Note that on the Management Layer this should be enabled.)

vSphere 4.1, VMware HA New maximums and DRS integration will make our life easier

Duncan Epping · Jul 14, 2010 ·

I guess there are a couple of keypoints I will need to stress for those creating designs:

New HA maximums

32 host clusters
320 virtual machines per host
3,000 virtual machines per cluster

In other words:

you can have 10 hosts with 300 VMs each
or 20 hosts with 150 VMs each
or 32 with 93 VMs….

as long as you don’t go beyond 320 per host or 3000 per cluster you are fine!

DRS Integration

HA integrates on multiple levels with DRS as of vSphere 4.1. It is a huge improvement and it is something in my opinion that everyone should know about.

Resource Fragmentation

As of vSphere 4.1 HA is closely integrated with DRS. When a failover occurs HA will first check if there are resources available on that host for the failover. If resources are not available HA will ask DRS to accommodate for these where possible. Think about a VM with a huge reservations and fragmented resources throughout your cluster as described in my HA Deepdive. HA, as of 4.1, will be able to request a defragmentation of resources to accommodate for this VMs resource requirements. How cool is that?! One thing to note though is that HA will request it, but a guarantee can not be given so you should still be cautious when it comes to resource fragmentation.

DPM

In the past there barely was integration between DRS/DPM and HA. Especially when DPM was enabled this could lead to some weir behaviour when resources where scarce and an HA failover would need to happen. With vSphere 4.1 this has changed. In such cases, VMware HA will use DRS to try to adjust the cluster (for example, by bringing hosts out of standby mode or migrating virtual machines to defragment the cluster resources) so that HA can perform the failovers.

I didn’t even found out about this one until I read the Availability Guide again. Prior to vSphere 4.1, an HA failed over virtual machine could be granted more resource shares then it should causing resource starvation until DRS balanced the load. As of vSphere 4.1 HA calculates normalized shares for a virtual machine when it is powered on after an isolation event!

vSphere 4 U2 and recovering from HA Split Brain

Duncan Epping · Jul 2, 2010 ·

A couple of months ago I wrote this article about a future feature that would enable HA to recover from a Split Brain scenario. vSphere 4.0 Update 2 recently was released but the release notes or documentation did not mention this new feature.

I had never noticed this until I was having a discussion around this feature with one of my colleagues. I asked our HA Product Manager and one of our developers and it appears that this mysteriously has slipped the release notes. As I personally believe that this is a very important feature of HA I wanted to rehash some of the info stated in that article. I did rewrite it slightly though. Here we go:

One of the most common issues experienced in an iSCSI/NFS environment with VMware HA pre vSphere 4.0 Update 2 is a split brain situation.

First let me explain what a split brain scenario is, lets start with describing the situation which is most commonly encountered:

4 Hosts
iSCSI / NFS based storage
Isolation response: leave powered on

When one of the hosts is completely isolated, including the Storage Network, the following will happen:

Host ESX001 is completely isolated including the storage network(remember iSCSI/NFS based storage!) but the VMs will not be powered off because the isolation response is set to “leave powered on”.
After 15 seconds the remaining, non isolated, hosts will try to restart the VMs.
Because of the fact that the iSCSI/NFS network is also isolated the lock on the VMDK will time out and the remaining hosts will be able to boot up the VMs.
When ESX001 returns from isolation it will still have the VMX Processes running in memory and this is when you will see a “ping-pong” effect within vCenter, in other words VMs flipping back and forth between ESX001 and any of the other hosts.

As of version 4.0 Update 2 ESX(i) detects that the lock on the VMDK has been lost and issues a question which is automatically answered. The VM will be powered off to recover from the split-brain scenario and to avoid the ping-pong effect. Please note that HA will generate an event for this auto-answer which is viewable within vCenter.

Don’t you just love VMware HA!

ha

New HA maximums

DRS Integration

Resource Fragmentation

DPM

Shares