drs

HA and a Metrocluster

Duncan Epping · Aug 20, 2010 ·

I was reading an excellent article on NetApp metroclusters and vm-host affinity rules Larry Touchette the other day. That article is based on the tech report TR-3788 which covers the full solution but did not include the 4.1 enhancements.

The main focus of the article is on VM-Host Affinity Rules. Great stuff and it will “ensure” you will keep your IO local. As explained when a Fabric Metrocluster is used the increased latency when going across for instance 80KM of fibre will be substantial. By using VM-Host Affinity Rules where a group of VMs are linked to a group of hosts this “overhead” can be avoided.

Now, the question of course is what about HA? The example NetApp provided shows 4 hosts. With only four hosts we all know, hopefully at least, that all of these hosts will be primary. So even if a set of hosts fail one of the remaining hosts will be able to take over the failover coordinator role and restart the VMs. Now if you have up to an 8 host cluster that is still very much true as with a max of 5 primaries and 4 hosts on each side at least a single primary will exist in each site.

But what about 8 hosts or more? What will happen when the link between sites fail? How do I ensure each of the sites has primaries left to restart VMs if needed?

Take a look at the following diagram I created to visualize all of this:

We have two datacenters here, Datacenter A and B. Both have their own FAS with two shelves and their own set of VMs which run on that FAS. Although storage will be mirrored there is still only one real active copy of the datastore. In this case VM-Host Affinity rules have been created to keep the VMs local in order to avoid IO going across the wire. This is very much similar to what NetApp described.

However in my case there are 5 hosts in total which are a darker color green. These hosts were specified as the preferred primary nodes. This means that each site will have at least 2 primary nodes.

Lets assume the link between Datacenter A and B dies. Some might assume that this will trigger an HA Isolation Response but it actually will not.

The reason for this being is the fact that an HA primary node still exists in each site. Isolation Response is only triggered when no heartbeats are received. As a primary node sends a heartbeat to both the primary and secondary nodes a heartbeat will always be received. Again as I can’t emphasize this enough, an Isolation Response will not be triggered.

However if the link dies between these Datacenter’s, it will appear to Datacenter A as if Datacenter B is unreachable and one of the primaries in Datacenter A will initiate restart tasks for the allegedly impacted VMs and vice versa. However as the Isolation Response has not been triggered a lock on the VMDK will still exist and it will be impossible to restart the VMs.

These VMs will remain running within their site. Although it might appear on both ends that the other Datacenter has died HA is “smart” enough to detect it hasn’t and it will be up to you to decide if you want to failover those VMs or not.

I am just so excited about these developments, that I can’t get enough of it. Although the “das.preferredprimaries” setting is not supported as of writing, I thought this was cool enough to share it with you guys. I also want to point out that in the diagram I show 2 isolation addresses, this of course is only needed when a gateway is specified which is not accessible at both ends when the network connection between sites is dead. If the gateway is accessible at both sites even in case of a network failure only 1 isolation address, which can be the default gateway, is required.

HA/DRS and Flattened Shares

Duncan Epping · Jul 22, 2010 ·

A week ago I already touched on this topic but I wanted to get a better understand for myself what could go wrong in these situations and how vSphere 4.1 solves this issue.

Pre-vSphere 4.1 an issue could arise when shares had been set custom on a virtual machine. When HA fails over a virtual machine it will power-on the virtual machine in the Root Resource Pool. However, the virtual machine’s shares were scaled for its appropriate place in the resource pool hierarchy, not for the Root Resource Pool. This could cause the virtual machine to receive either too many or too few resources relative to its entitlement.

A scenario where and when this can occur would be the following:

VM1 has a 1000 shares and Resource Pool A has 2000 shares. However Resource Pool A has 2 VMs and both will have 50% of those “2000” shares.

When the host would fail both VM2 and VM3 will end up on the same level as VM1. However as a custom shares value of 10000 was specified on both VM2 and VM3 they will completely blow away VM1 in times of contention. This is depicted in the following diagram:

This situation would persist until the next invocation of DRS would re-parent the virtual machine to it’s original Resource Pool. To address this issue as of vSphere 4.1 DRS will flatten the virtual machine’s shares and limits before fail-over. This flattening process ensures that the VM will get the resources it would have received if it would have been failed over to the correct Resource Pool. This scenario is depicted in the following diagram. Note that both VM2 and VM3 are placed under the Root Resource Pool with a shares value of 1000.

Of course when DRS is invoked both VM2 and VM3 will be re-parented under Resource Pool A and will receive the amount of shares they had originally assigned again. I hope this makes it a bit more clear what this “flattened shares” mechanism actually does.

vSphere 4.1, VMware HA New maximums and DRS integration will make our life easier

Duncan Epping · Jul 14, 2010 ·

I guess there are a couple of keypoints I will need to stress for those creating designs:

New HA maximums

32 host clusters
320 virtual machines per host
3,000 virtual machines per cluster

In other words:

you can have 10 hosts with 300 VMs each
or 20 hosts with 150 VMs each
or 32 with 93 VMs….

as long as you don’t go beyond 320 per host or 3000 per cluster you are fine!

DRS Integration

HA integrates on multiple levels with DRS as of vSphere 4.1. It is a huge improvement and it is something in my opinion that everyone should know about.

Resource Fragmentation

As of vSphere 4.1 HA is closely integrated with DRS. When a failover occurs HA will first check if there are resources available on that host for the failover. If resources are not available HA will ask DRS to accommodate for these where possible. Think about a VM with a huge reservations and fragmented resources throughout your cluster as described in my HA Deepdive. HA, as of 4.1, will be able to request a defragmentation of resources to accommodate for this VMs resource requirements. How cool is that?! One thing to note though is that HA will request it, but a guarantee can not be given so you should still be cautious when it comes to resource fragmentation.

DPM

In the past there barely was integration between DRS/DPM and HA. Especially when DPM was enabled this could lead to some weir behaviour when resources where scarce and an HA failover would need to happen. With vSphere 4.1 this has changed. In such cases, VMware HA will use DRS to try to adjust the cluster (for example, by bringing hosts out of standby mode or migrating virtual machines to defragment the cluster resources) so that HA can perform the failovers.

I didn’t even found out about this one until I read the Availability Guide again. Prior to vSphere 4.1, an HA failed over virtual machine could be granted more resource shares then it should causing resource starvation until DRS balanced the load. As of vSphere 4.1 HA calculates normalized shares for a virtual machine when it is powered on after an isolation event!

Reservations primer

Duncan Epping · Jul 8, 2010 ·

My colleague Craig Risinger wrote the below and was kind enough to share it with us. Thanks Craig!

A quick primer on VMware Reservations (not that anyone asked)…

A Reservation is a guarantee.

There’s a difference between reserving a resource and using it. A VM can use more or less than it has reserved. Also, if a reservation-holder isn’t using all the reserved resource, it will share CPU but not RAM. In other words, CPU reservations are friendly but memory reservations are greedy.

Reservation admission control:

If a VM has a reservation defined, the ESX host must have at least that much resource unreserved (not just unused, but unreserved) or else it will refuse to power on the VM. Reservations cannot overlap. A chunk of resource can be reserved by only one entity at a time; there can’t be two reservations on it.

Scenario #1

Given:

An ESX host has 16 GHz (= 8 x 2 GHz cores) and 16 GB.
VM-1 and VM-2 each have 8 vCPUs and 16 GB of vRAM.
VM-1 has reserved 13 GHz of CPU resources and 10 GB of memory.
VM-1 is currently using 11 GHz of CPU resources and 9 GB of memory. (Using != reserving.)

Consequently:

VM-2 can use up to 5 GHz. (Not 3 GHz, CPU reservations are friendly.)
VM-2 can reserve up to 3 GHz. (Using != reserving. Reservations don’t overlap.)
VM-2 can use up to 6 GB. (Not 7 GB. Memory reservations are greedy.)
VM-2 can reserve up to 6 GB. (Reservations don’t overlap.)

Please note that if VM-2 had a 7 GB reservation defined, it would not power on. (Reservation admission control.)

It’s also possible for VM-1 to use more resources than it has reserved. That makes the discussion a bit more complex. VM-1 is guaranteed whatever it’s reserved, and it also gets to fight VM-2 for more resources, assuming VM-2 hasn’t reserved the excess. I’ll come up with example scenarios for that too if you like.

There’s good reason why CPU reservations are friendly but memory reservations are greedy. Say a reservation holder is not using all of a resource, and it lets an interloper use the resource for a while; later, the reservation holder wants to use all it has reserved. An interloper can be kicked off a pCPU quickly. CPU instructions are transient, quickly finished. But RAM holds data. If an interloper was holding pRAM, its data would have to be swapped to disk before the reservation holder could repurpose that pRAM to satisfy its reservation. That swapping would take significant time and delay the reservation holder unfairly. So ESX doesn’t allow reserved pRAM to be used by an interloper.

For a more detailed discussion that gets into Resource Pools, how memory reservations do or don’t prevent host-level swapping, and more, see the following post I wrote several months ago, http://www.yellow-bricks.com/2010/03/03/cpumem-reservation-behaviour/.

Author: Craig Risinger

Are you using DPM? We need you!

Duncan Epping · Jul 7, 2010 ·

I just received a request from our Product Management team. Gil Adato is is working on the next generation of DPM and is seeking DPM current and past users and he asked me to post the following message on my blog. I hope you can help Gil and VMware taking DPM to the next level!

Dear VMware Customers,

Product management is starting planning efforts for the 2012 vSphere release, and we are considering numerous DPM improvements. In order to make the right decision, it’s critical that we get a better understanding of our customers’ experiences with the current product, and get customers’ feedback on what needs to be on the roadmap. We’re also interested in customers’ improvement suggestions and in sharing with them several improvement ideas that we have in mind. Customers’ input will be very helpful to us, and customers would see the benefit of communicating their comments, requirements and suggestions/wish-list directly to the product team.

If you’re a current or past DPM user and you’d like to be a part of the process and help shape the next generation of VMware’s products, please contact me directly.

My contact info is the following:

Gil Adato
gadato@vmware.com

When contacting me, please send me the following initial information:

Company Name
Customer contact name
Location (Country)
Customer’s email address

VMware’s product management team is looking forward to hearing from you and making you a part of the product development process.

Thank you for your cooperation,

Gil Adato