Isolation detection in vSphere 5.1 versus 5.0

Duncan Epping · Dec 31, 2012 ·

I received a question today from someone who wanted to know the difference for isolation detection between vSphere 5.0 and 5.1. I described this in our book, but I figured I would share it here as well. Note that this is an outtake from the book.

The isolation detection mechanism has changed substantially since previous versions of vSphere. The main difference is the fact that HA triggers a master election process before it will declare a host is isolated. In this timeline, “s” refers to seconds. The following timeline is the timeline for a vSphere 5.0 host:

T0 – Isolation of the host (slave)
T10s – Slave enters “election state”
T25s – Slave elects itself as master
T25s – Slave pings “isolation addresses”
T30s – Slave declares itself isolated and “triggers” isolation response

For a vSphere 5.1 host this timeline slightly differs due the insertion of a minimum 30s delay after the host declares itself isolated before it applies the configured isolation response. This delay can be increased using the advanced option das.config.fdm.isolationPolicyDelaySec.

T0 – Isolation of the host (slave)
T10s – Slave enters “election state”
T25s – Slave elects itself as master
T25s – Slave pings “isolation addresses”
T30s – Slave declares itself isolated
T60s – Slave “triggers” isolation response

Or as Frank would say euuuh show:

When the isolation response is triggered, with both 5.0 and 5.1, HA creates a “power-off” file for any virtual machine HA powers off whose home datastore is accessible. Next it powers off the virtual machine (or shuts down) and updates the host’s poweron file. The power-off file is used to record that HA powered off the virtual machine and so HA should restart it. These power-off files are deleted when a virtual machine is powered back on or HA is disabled.

After the completion of this sequence, the master will learn the slave was isolated through the “poweron” file as mentioned earlier, and will restart virtual machines based on the information provided by the slave.

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

Comments

Jackie says

2 January, 2013 at 05:50

Hi Duncan, what are the benefits of this change?
- Duncan Epping says
  
  2 January, 2013 at 11:03
  
  Change was introduced to counter some cornercase scenarios that could lead to unneeded trigger of isolation response. Also some requested the ability back to define the detection time again.
  - Marko says
    
    2 January, 2013 at 17:24
    
    Hi Duncan,
    
    could you describe this scenarios a little bit more? Is it possible to avoid them and to decrease the time until isolation response is triggered? The new behaviour leeds to an impairment of the time a VM isn’t available.
    - Duncan Epping says
      
      2 January, 2013 at 21:54
      
      No not possible to decrease it lower then the above specified.
Fred Peterson says

2 January, 2013 at 06:01

I assume once isolation is set it doesn’t completely give up and the host could be brought back non-isolated in the event isolation was triggered due to a network issue unrelated to the virtual infrastructure?
- Duncan Epping says
  
  2 January, 2013 at 11:04
  
  Yes it doesn’t give up at all. If host returns for duty it is back in the cluster for operations.
Harry says

2 January, 2013 at 09:33

Hi Duncan

I need one clarification on Virtual Mac pinning .

if virtual mac pinning is a requirement than i can use only Source based Mac address Load Balancing Policy only. Which means i cannot use virtual Port id or Load Based Teaming.
- Duncan Epping says
  
  2 January, 2013 at 11:05
  
  Even if you do load balancing based on mac it doesn’t guarantee a VM is always on a specific nic right. When a failure occurs it will still switch.
  
  Also, why is this a requirement? Wondering what the usecase it, will help me understand the problem better.
Fredi Yao says

2 January, 2013 at 13:02

hi Duncan
why election occurs before isolation ion both 5.0 and 5.1? as I know, election occurs in the situation of master host failure.
thanks!
Dov Brajtman says

3 January, 2013 at 18:01

Hi Duncan,
An excellent post. It literally answered all my questions. I am in the process of setting up a new 5.1 data center ( directly from 4.1) and was wondering about the difference in HA. I did 5.0 but this is my first production experience with 5.1
Thanks for this excellent info. I will be following your posts much closer.
Dov
Sandro Tavella says

19 July, 2013 at 12:44

Hi Duncan,
I need one clarification. In a HA cluster (suppose 2 hosts) i have this poweron files:

—————-First HOST—————-
#cat host-11-poweron
6283821198792
0
1
73 /vmfs/volumes/51220639-4269f9df-dfc9-e41f13cc1688/libreplan/libreplan.vmx

————Second HOST
#cat host-87-poweron

6304590196472
0
0

If i run a vmotion from one server to the another i find the vm registered in all two poweron files:

—————-First HOST—————-
#cat host-11-poweron
6283821198792
0
1
73 /vmfs/volumes/51220639-4269f9df-dfc9-e41f13cc1688/libreplan/libreplan.vmx

————Second HOST
#cat host-87-poweron

6304590196472
0
1
73 /vmfs/volumes/51220639-4269f9df-dfc9-e41f13cc1688/libreplan/libreplan.vmx

I would expect to find the vm registered only in one poweron file (still after 30 mins).
where am I wrong?

Thanks!
Sandro

Related

Reader Interactions

Comments