ha

Isolation / Partition scenario with VSAN cluster, how is this handled?

Duncan Epping · Sep 19, 2013 ·

After explaining how a disk or host failure worked in a VSAN cluster, it only made sense to take the next step… How are Isolations or Partitions in a Virtual SAN cluster handled? I guess lets start with the beginning, and I am going to try to keep it simple, first a recap of what we learned in the disk/host failures article.

Virtual SAN (VSAN) has the ability to create mirrors of objects. This ability is defined within a policy (VM Storage Policy aka Storage Policy Based Management). You can define option called “failures to tolerate” anywhere between 0 and 3 at the moment. By default this option is set to 1. This means you will have two copies of your data. On top of that VSAN will need a witness / quorum to help figuring out who takes ownership in the case of an event. So what does this look like? Note that in the below diagram I used the term “vmdk” and “witness” to simplify things, in reality this could be any type of component of a VM.

So what did we learn from this (hopefully) simple diagram?

A VM does not necessarily have to run on the same host as where its storage objects are sitting
The witness lives on a different host than the components it is associated with in order to create an odd number of hosts involved for tiebreaking under a network partition
The VSAN network is used for communication, IO and HA

Lets recap some of the HA changes first for a VSAN cluster before we dive in to the details:

When HA is turned on in the cluster, FDM agent (HA) traffic uses the VSAN network and not the Management Network. However, when a potential isolation is detected HA will ping the default gateway (or specified isolation address) using the Management Network.
When enabling VSAN ensure vSphere HA is disabled. You cannot enable VSAN when HA is already configured. Either configure VSAN during the creation of the cluster or disable vSphere HA temporarily when configuring VSAN.
When there are only VSAN datastores available within a cluster then Datastore Heartbeating is disabled. HA will never use a VSAN datastore for heartbeating as the VSAN network is already used for network heartbeating using the Datastore for heartbeating would not add anything,
When changes are made to the VSAN network it is required to re-configure vSphere HA!

As you can see the VSAN network plays a big roll here, and even bigger then you might realize as it is also used by HA for network heartbeating. So what if the host on which the VM is running gets isolated from the rest of the network? The following would happen:

HA will detect there are no network heartbeats received from “esxi-01”
HA master will try to ping the slave “esxi-01”
HA will declare the slave “esxi-01” is unavailable
VM will be restarted on one of the other hosts… “esxi-02” in this case, but that could be any, depicted in the diagram below

Simple right? Before I forget, for these scenarios it is important to ensure that your isolation response is set to power-off. But I guess the question now arises… what if “esxi-01” and “esxi-02” would be part of the same partition? What happens then? Well that is where the witness comes in to play. Let show the diagram first, as that will make it a bit easier to understand!

Now this scenario is slightly more complex. There are two partitions, one of the partition is running the VM with its VMDK and the other partition has a VMDK and a witness. Guess what happens? Right, VSAN uses the witness to see which partition has quorum and based on that fact one of the two will win. In this case Partition-2 has more than 50% of the components of this object and as such is the winner. This means that the VM will be restarted on either “esxi-03” or “esxi-04” by HA. Note that the VM in Partition-1 will not be powered off, even if you have configured the isolation response to do so, as this partition would re-elect a master and would be able to see each other!

But what if “esxi-01” and “esxi-04” were isolated, what would happen then? This is what it would look like:

Remember that rule which I slipped in to the previous paragraph? The winner is declared based on the % of components available within that partition. If the partition has access to more than 50% it has won. Meaning that when “esxi-01” and “esxi-04” are isolated, either “esxi-02” or “esxi-03” can restart the VM because 66% of the components reside within this part of the cluster. Nice right?!

I hope this makes isolations / partitions a bit clearer, I realize though concepts will be tough for the first weeks/months… I will try to explore some more (complex) scenarios in the near future.

vSphere HA Futures: Restart Order

Duncan Epping · Sep 13, 2013 ·

At VMworld I hosted a group discussion together with Keith Farkas (HA Lead Engineer) on the topic of HA Futures. Based on this discussion group session Keith and I decided to gather more feedback from the field, this post will hopefully help us with that. Please do not hesitate to comment. I will have a couple of articles following this one, but lets get started with HA futures for the Restart Order first.

A topic that has come up at various sessions is HA restart ordering / priorities. Today HA provides four levels of restart priority: High, Medium, Low, Disabled. The thing to note with the current restart priority though is that there is no guarantee VMs are actually restarted in that order when the VMs are started on more than one host. Even when HA would restart them in the right order there is also no guarantee around when the boot cycle completes. Typically large virtual machines with for instance a database will take longer to boot than a server just running DNS. So what do we propose? We propose restart orders instead of restart priority. What does this mean, and what would we like to now from you?

There are two complementary ways of implementing this and we would like your feedback including which one you think would be most useful.

Global Restart Order aka Bucketing
VM to VM dependency Chains

Lets explain these two options and then I let you guys chime in.

Global Restart Order aka Bucketing is basically what you have today with “restart priorities” only it will actually enforce the restart order and it will allow for more flexibility. So with this option you could for instance create 5 buckets, and then add virtual machines to these buckets appropriately. These buckets could be: Priority 1, Priority 2 and so on. When a failure has occurred vSphere HA would then restart all VMs in the bucket “Priority 1” first and when that bucket has finished starting (e.g., wait for VMware Tools Heartbeat to report “alive” for each VM) vSphere HA would continue with the next bucket and so on. Waiting for VMtools to report “alive” is one way to determine that a VM is “ready”. We are thinking of providing three other “wait” options — wait for an application heartbeat, wait a certain amount of time after the VM powers on, or today’s behavior, wait for the power on task to complete”.

I guess a couple of questions we have:

How many levels would you like to see?
Which of the wait conditions (e.g., wait on VMtools) are most useful for you?
Suppose HA could not power on a “Priority 1” VM. Do you want HA to stop powering on the “Priority 2” etc VMs until it can, move to the “Priority 2” group after a timeout, or something else?

The second option is VM to VM dependency Chains. These can be seen as an explicit restart order for a specific group of VMs which typically would form a service. I guess not unlike the vApp construct today, but then without all the caveats and restrictions around this. (vApps are essential resource pools, and we don’t want resource management in this case… just restart orderering.) In the simplest form, you could imagine specifying ordered lists of VMs, each list specifying the restart order for that set — the VMs in a list would be powered on sequentially. For example, something like the following:

Database VM –> Application Server –> Web Server

As you can see that would offer a significant amount of granularity, but also potentially a lot of operational complexity. How far would you like to go I guess is the question? Questions we have for you:

Is an ordered list sufficient to express dependencies in a chain of VMs or do you need more sophistication?
A VM with a dependent fails, do you expect HA to restart that child VM even though the previous has failed?
What if HA could not be able to restart a VM with dependents — should HA restart these dependent VMs after a delay or only after the first VM is restarted?

A final question. We think bucketing will be easier to manage operationally but it introduces artificial dependencies between VMs and will make it take much longer to restart all VMs after a failure. How significant are these limitations?

That is it for now… Please chime in, as your response will help us define the future of vSphere HA.

vSphere 5.5 nuggets: High Availability Enhancement

Duncan Epping · Sep 4, 2013 ·

There aren’t a lot of changes in 5.5 when it comes to vSphere High Availability aka HA, but one is worth noting. As most of you are probably aware of, vSphere HA in the past did nothing with VM to VM Affinity or Anti Affinity rules. Typically for people using “affinity” rules this was not an issue, but those using “anti-affinity” rules did see this as an issue. They created these rules to ensure specific virtual machines would never be running on the same host, but vSphere HA would simply ignore the rule when a failure had occurred and just place the VMs “randomly”. With vSphere 5.5 this has changed! vSphere HA is now “anti affinity” aware. In order to ensure anti-affinity rules are respected you will need to set an advanced setting:

das.respectVmVmAntiAffinityRules - Values: "false" (default) and "true"

Now note that this also means that when you configure anti-affinity rules and have this advanced setting configured to “true” and somehow there aren’t sufficient hosts available to respect these rules… then rules will be respected and it could result in HA not restarting a VM. Make sure to understand this potential impact when configuring this setting and configuring these rules.

With a single Datastore can I still use HA’s Datastore heartbeating?

Duncan Epping · Aug 20, 2013 ·

I had a question last week around HA’s datastore heartbeating, the question was if datastore heartbeating still worked if you only have 1 datastore in your environment. I can understand where the question comes from as HA throws this error that you need to have 2 datastores at a minimum for HA datastore heartbeating to function correctly. I want to point out that even though HA says that 2 datastores is the minimum, even when only one datastore is available it will be used for heartbeat purposes. Yes this error will be there on your cluster, and yes you can suppress it using “das.ignoreInsufficientHbDatastore“. I figured others might be hitting the same error and have the same question so why not document it?!

ESXi “Management traffic” tickbox, what does it do?

Duncan Epping · Aug 14, 2013 ·

I have seen this popping up various times over the last few years. That little tickbox on your VMkernel NIC that says “Management traffic” (aka management network) what is it for? What if I untick it, will SSH to that VMkernel still work? Will the HA heartbeat still work? Can I still ping the VMkernel NIC? Those are all questions I have had in the past, and I can understand why… I would say that the term “Management traffic” is really really poorly chosen, but why?

The feature described as “Management traffic” does nothing more than enabling that VMkernel NIC for HA heartbeat traffic. Yes that is it. Even if you disable this feature, management traffic, you can still use the VMkernel’s associated IP address for adding it to vCenter Server. You can still SSH that VMkernel associated IP address if you have SSH enabled. So keep that in mind.

Yes I fully agree, very confusing but there you have it: the “management traffic” enables the HA heartbeat network, nothing more and nothing less.