ESX

Cool vSphere 4.1 Feature: Cluster Operational Status

Duncan Epping · Jul 20, 2010 ·

There’s a cool new feature added to vSphere 4.1 for HA. If an error occurs you can easily check what the issue is by going to your cluster and clicking the “Cluster Operational Issues” line on the Summary tab.

If there are no issues the screen will be completely gray. I forced an issue though so you can see what is shows. Note that is also shows the “Role” of the host and in this case it is a Secondary Node!

New Academic/Tech Paper on FT

Duncan Epping · Jul 19, 2010 ·

I received this Paper a while back and think it is an excellent read. I just copied a random part of the paper to give you an idea of what it covers. There’s not much more to say about it then just read it, it is as in-depth as it can get on FT. I read it several times by now and still discover new things every time I read it.

The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines

There are many possible ways to attempt to detect failure of the primary and backupVMs. VMware FT uses UDP heartbeating between servers that are running fault-tolerantVMs to detect when a server may have crashed. In addition, VMware FT monitors thelogging traffic that is sent from the primary to the backup VM and the acknowledgmentssent from the backup VM to the primary VM.

vSphere 4.1 HA feature, totally unsupported but too cool

Duncan Epping · Jul 16, 2010 ·

Early 2009 I wrote an article on the impact of Primary Nodes and Secondary Nodes on your design. This was primarily focussed on Blade environments and basically it discussed how to avoid having all your primary nodes in a single chassis. If that single chassis would fail, no VMs would be restarted as one of the primary nodes is the “failover coordinator” and without a primary node to assign this role to a failover can’t be initiated.

With vSphere 4.1 a new advanced setting has been introduced. This setting is not even experimental, it is unsupported. I don’t recommend anyone using this in a production environment, if you do want to play around with it use your test environment. Here it is:

das.preferredPrimaries = hostname1 hostname2 hostname3
or
das.preferredPrimaries = 192.168.1.1,192.168.1.2,192.168.1.3

The list of hosts that are preferred as primary can either be space or comma separated. You don’t need to specify 5 hosts, you can specify any number of hosts. If you specify 5 and all 5 are available they will be the primary nodes in your cluster. If you specify more than 5, the first 5 of your list will become primary.

Please note that I haven’t personally tried it and I can’t guarantee it will work.

VMware View without HA?

Duncan Epping · Jul 15, 2010 ·

I was discussing something with one of my former colleagues a couple of days ago. He asked me what the impact was of running VMware View in an environment without HA.

To be honest I am not a View SME, but I do know a thing or two about HA/vSphere in general. So the first thing that I mentioned was that it wasn’t a good idea. Although VDI in general is all about density not running HA in these environments could lead to serious issues when a host fails.

Now, just imagine you have 80 Desktop VMs per host running and roughly 8 hosts in a DRS only cluster on NFS based storage. One of those hosts is isolated from the network…. what happens?

User connection is dropped
VMDK Lock times out
User tries to reconnect
Broker powers on the VM on a new host

Now that sounds great doesn’t it? Well yeah in a way it does, but what happens when the host is not isolated anymore?

Indeed, the VMs were still running. So basically you have a split brain scenario. The only way in the past to avoid this was to make sure you had HA enabled and had set HA to power off the VM.

But with vSphere 4 Update 2 a new mechanism has been introduced. I wanted to stress this, as some people have already made assumption that it is part of AAM/HA. It actually isn’t… The question for powering off the VM to recover from the split brain scenario is generated by “hostd” and answered by “vpxa”. In other words, with or without HA enabled ESX(i) will recover the split brain

Again, I am most definitely not a Desktop/View guy so I am wondering how the View experts out there look against disabling HA on your View Compute Cluster. (Note that on the Management Layer this should be enabled.)

vSphere 4.1, VMware HA New maximums and DRS integration will make our life easier

Duncan Epping · Jul 14, 2010 ·

I guess there are a couple of keypoints I will need to stress for those creating designs:

New HA maximums

32 host clusters
320 virtual machines per host
3,000 virtual machines per cluster

In other words:

you can have 10 hosts with 300 VMs each
or 20 hosts with 150 VMs each
or 32 with 93 VMs….

as long as you don’t go beyond 320 per host or 3000 per cluster you are fine!

DRS Integration

HA integrates on multiple levels with DRS as of vSphere 4.1. It is a huge improvement and it is something in my opinion that everyone should know about.

Resource Fragmentation

As of vSphere 4.1 HA is closely integrated with DRS. When a failover occurs HA will first check if there are resources available on that host for the failover. If resources are not available HA will ask DRS to accommodate for these where possible. Think about a VM with a huge reservations and fragmented resources throughout your cluster as described in my HA Deepdive. HA, as of 4.1, will be able to request a defragmentation of resources to accommodate for this VMs resource requirements. How cool is that?! One thing to note though is that HA will request it, but a guarantee can not be given so you should still be cautious when it comes to resource fragmentation.

DPM

In the past there barely was integration between DRS/DPM and HA. Especially when DPM was enabled this could lead to some weir behaviour when resources where scarce and an HA failover would need to happen. With vSphere 4.1 this has changed. In such cases, VMware HA will use DRS to try to adjust the cluster (for example, by bringing hosts out of standby mode or migrating virtual machines to defragment the cluster resources) so that HA can perform the failovers.

I didn’t even found out about this one until I read the Availability Guide again. Prior to vSphere 4.1, an HA failed over virtual machine could be granted more resource shares then it should causing resource starvation until DRS balanced the load. As of vSphere 4.1 HA calculates normalized shares for a virtual machine when it is powered on after an isolation event!

ESX

New HA maximums

DRS Integration

Resource Fragmentation

DPM

Shares