• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

high availability

Replaced certificates and get vSphere HA Agent unreachable?

Duncan Epping · May 24, 2013 ·

Replaced certificates and get vSphere HA Agent unreachable? I have heard this multiple times in the last couple of weeks. I started looking in to it and it seems that in many of these scenarios the common issue was the thumbprints. The log files typically give a lot of hints that look like this:

[29904B90 verbose 'Cluster' opID=SWI-d0de06e1] [ClusterManagerImpl::IsBadIP] <ip of the ha master> is bad ip

Also note that the UI will state “vSphere HA agent unreachable” in many of these cases. Yes I know, these error messages can be improved for sure.

You can simply solve this by disconnecting and reconnecting the hosts. Yes it really is as simple as that, and you can do this without any downtime. No need to move the VMs off even, just right click the host and disconnect it. Then when the disconnect task is finished reconnect it.

 

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

Death to false myths: Admission Control lowers consolidation ratio

Duncan Epping · Dec 11, 2012 ·

Death to false myths probably sounds a bit euuhm well Dutch probably, or “direct” as others would label it. Lately I have seen some statements floating around which are either false or misused. One of them is around Admission Control and how it impacts consolidation ratio even if you are not using reservations. I have had multiple questions around this in the last couple of weeks and noticed this thread on VMTN.

The thread referred to is all about which Admission Control policy to use, as the selected policy potentially impacts the amount of virtual machines you can run on a cluster. Now lets take a look at the example in this VMTN thread, and I have rounded up some of the numbers to simplify things:

  • 7 host cluster
  • 512 GB of memory
  • 132 GHz of CPU resources
  • 217 MB of Memory Overhead (no reservations used)

So if you do the quick math. According to Admission Control (host failures example) you can power-on about ~2500 virtual machines. That is without taking N-1 resiliency in to account. When I take out the largest host we are still talking about ~1800 virtual machines that can be powered on. Yes that is 700 slots/virtual machines less due to the N-1, admission control needs to be able to guarantee that even if the largest host fails all virtual machines can be restarted.

Considering we have 512GB in total that means that if those 1800 virtual machines on average actively use 280MB we will see TPS / swapping / ballooning / compression. (512GB / 1800 VMs) Clearly you want to avoid most of these, swapping / ballooning / compression that is. Especially considering most VMs are typically provisioned with 2GB of memory or more.

So what does that mean or did we learn? Two things:

  • Admission Control is about guaranteeing virtual machine restarts
  • If you set no reservation you can power-on an insane amount of virtual machines

Let me reemphasize the last bullet, you can power-on an INSANE amount of virtual machines on just a couple of hosts when no reservations are used. In this case HA would allow for 1800 virtual machines to be powered-on before it starts screaming it is out of resources. Is that going to work in real life, would your virtual machines be happy with the amount of resources they are getting? I don’t think so… I don’t believe that 280MB of physically backed memory is sufficient for most workloads. Yes, maybe TPS can help a bit, but chances of hitting the swap file are substantial.

Let it be clear, admission control is no resource management solution. It is only guaranteeing virtual machines can be restarted and if you have no reservations set then the numbers you will see are probably not realistic. At least not from a user experience perspective. I bet your users / customers would like to have a bit more resources available than just the bare minimum required to power-on a virtual machine! So don’t let these numbers fool you.

vSphere HA fail-over in action – aka reading the log files

Duncan Epping · Oct 17, 2012 ·

I had a discussion with Benjamin Ulsamer at VMworld and he had a question about the state of a host when both the management network and storage network was isolated. My answer was that in that case the host will be reported as “dead” as there is no “network heartbeat” and no “datastore heartbeat”. (more info about heartbeating here) Funny thing is when you look at the log files you do see isolated instead of dead. Why is that? Before we answer it lets go through the log files and paint the picture:

Two hosts (esx01 and esx02) with a management network and an iSCSI storage network. vSphere 5.0 is used and Datastore Heartbeating is configured. For whatever reason for the network of esx02 is isolated (both storage and management as it is a converged environment. So what can you see in the log files?

Lets look at “esx02” first:

  • 16:08:07.478Z [36C19B90 info ‘Election’ opID=SWI-6aace9e6] [ClusterElection::ChangeState] Slave => Startup : Lost master
    • At 16:08:07 the network is isolated
  • 16:08:07.479Z [FFFE0B90 verbose ‘Cluster’ opID=SWI-5185dec9] [ClusterManagerImpl::CheckElectionState] Transitioned from Slave to Startup
    • The host recognizes it is isolated and drops from Slave to “Startup” so that it can elect itself as master to take action
  • 16:08:22.480Z [36C19B90 info ‘Election’ opID=SWI-6aace9e6] [ClusterElection::ChangeState] Candidate => Master : Master selected
    • The host has elected itself as master
  • 16:08:22.485Z [FFFE0B90 verbose ‘Cluster’ opID=SWI-5185dec9] [ClusterManagerImpl::CheckHostNetworkIsolation] Waited 5 seconds for isolation icmp ping reply. Isolated
    • Can I ping the isolation address?
  • 16:08:22.488Z [FFFE0B90 info ‘Policy’ opID=SWI-5185dec9] [LocalIsolationPolicy::Handle(IsolationNotification)] host isolated is true
    • No I cannot, and as such I am isolated!
  • 16:08:22.488Z [FFFE0B90 info ‘Policy’ opID=SWI-5185dec9] [LocalIsolationPolicy::Handle(IsolationNotification)] Disabling execution of isolation policy by 30 seconds.
    • Hold off for 30 seconds as “das.config.fdm.isolationPolicyDelaySec” was configured
  • 16:08:52.489Z [36B15B90 verbose ‘Policy’] [LocalIsolationPolicy::GetIsolationResponseInfo] Isolation response for VM /vmfs/volumes/a67cdaa8-9a2fcd02/VMWareDataRecovery/VMWareDataRecovery.vmx is powerOff
    • There is a VM with an Isolation Response configured to “power off”
  • 16:10:17.507Z [36B15B90 verbose ‘Policy’] [LocalIsolationPolicy::DoVmTerminate] Terminating /vmfs/volumes/a67cdaa8-9a2fcd02/VMWareDataRecovery/VMWareDataRecovery.vmx
    • Lets kill that VM!
  • 16:10:17.508Z [36B15B90 info ‘Policy’] [LocalIsolationPolicy::HandleNetworkIsolation] Done with isolation handling
    • And it is gone, done with handling the isolation

Lets take a closer look at “esx01”, what does this host see with regards to the management network and storage network isolation of “esx02”:

  • 16:08:05.018Z [FFFA4B90 error ‘Cluster’ opID=SWI-e4e80530] [ClusterSlave::LiveCheck] Timeout for slave @ host-34
    • The host is not reporting itself any longer, the heartbeats are gone…
  • 16:08:05.018Z [FFFA4B90 verbose ‘Cluster’ opID=SWI-e4e80530] [ClusterSlave::UnreachableCheck] Beginning ICMP pings every 1000000 microseconds to host-34
    • Lets ping the host itself, it could be the FDM agent is dead.
  • 16:08:05.019Z [FFFA4B90 verbose ‘Cluster’ opID=SWI-e4e80530] Reporting Slave host-34 as FDMUnreachable
  • 16:08:05.019Z [FFD5BB90 verbose ‘Cluster’] ICMP reply for non-existent pinger 3 (id=isolationAddress)
    • As it is just a 2 node cluster, lets make sure I am not isolated myself, I got a reply so I am not isolated!
  • 16:08:10.028Z [FFFA4B90 verbose ‘Cluster’ opID=SWI-e4e80530] [ClusterSlave::UnreachableCheck] Waited 5 seconds for icmp ping reply for host host-34
  • 16:08:14.035Z [FFFA4B90 verbose ‘Cluster’ opID=SWI-e4e80530] [ClusterSlave::PartitionCheck] Waited 15 seconds for disk heartbeat for host host-34 – declaring dead
    • There is also no datastore heartbeat so the host must be dead. (Note that it cannot see the difference between a fully isolated host and a dead host when using IP based storage on the same network.)
  • 16:08:14.035Z [FFFA4B90 verbose ‘Cluster’ opID=SWI-e4e80530] Reporting Slave host-34 as Dead
    • It is officially dead!
  • 16:08:14.036Z [FFE5FB90 verbose ‘Invt’ opID=SWI-42ca799] [InventoryManagerImpl::RemoveVmLocked] marking protected vm /vmfs/volumes/a67cdaa8-9a2fcd02/VMWareDataRecovery/VMWareDataRecovery.vmx as in unknown power state
    • We don’t know what is up with this VM, power state unknown…
  • 16:08:14.037Z [FFE5FB90 info ‘Policy’ opID=SWI-27099141] [VmOperationsManager::PerformPlacements] Sending a list of 1 VMs to the placement manager for placement.
    • We will need to restart one VM, lets provide its details to the Placement Manager
  • 16:08:14.037Z [FFE5FB90 verbose ‘Placement’ opID=SWI-27099141] [PlacementManagerImpl::IssuePlacementStartCompleteEventLocked] Issue failover start event
    • Issue a failover event to the placement manager.
  • 16:08:14.042Z [FFE5FB90 verbose ‘Placement’ opID=SWI-e430b59a] [DrmPE::GenerateFailoverRecommendation] 1 Vms are to be powered on
    • Lets generate a recommendation on where to place the VM
  • 16:08:14.044Z [FFE5FB90 verbose ‘Execution’ opID=SWI-898d80c3] [ExecutionManagerImpl::ConstructAndDispatchCommands] Place /vmfs/volumes/a67cdaa8-9a2fcd02/VMWareDataRecovery/VMWareDataRecovery.vmx on __localhost__ (cmd ID host-28:0)
    • We know where to place it!
  • 16:08:14.687Z [FFFE5B90 verbose ‘Invt’] [HalVmMonitor::Notify] Adding new vm: vmPath=/vmfs/volumes/a67cdaa8-9a2fcd02/VMWareDataRecovery/VMWareDataRecovery.vmx, moId=12
    • Lets register the VM so we can power it on
  • 16:08:14.714Z [FFDDDB90 verbose ‘Execution’ opID=host-28:0-0] [FailoverAction::ReconfigureCompletionCallback] Powering on vm
    • Power on the impacted VM

That is it, nice right… and is just a short version of what is actually in the log files. It contains a massive amount of details! Anyway, back to the question… if not already answered, the remaining host in the cluster sees the isolated host as dead as there is no:

  • network heartbeat
  • response to a ping to the host
  • datastore heartbeat

The only thing the master can do at that point is to assume the “isolated” host is dead.

 

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

What’s new in vSphere 5.1 for High Availability

Duncan Epping · Sep 12, 2012 ·

As vSphere High Availability was completely revamped in 5.0 not a lot of changes have been introduced in 5.1. There are some noteworthy changes though that I figured I would share with you. So what’s cool?

  • Ability to set slot size for “Host failures tolerated” through the vSphere Web Client
  • Ability to retrieve a list of the virtual machines that span multiple slots
  • Support for Guest OS Sleep mode
  • Including the Application Monitoring SDK  in the Guest SDK (VMware Tools SDK)
  • vSphere HA (FDM) VIB is automatically added to Auto-Deploy image profile
  • Ability to delay isolation response throught the use of “das.config.fdm.isolationPolicyDelaySec”

Although many of these speak for itself, I will elaborate on why these enhancements are useful and when to use them.

The ability  to set slot size for “Host failures tolerated” allows you to manually dictate how many virtual machines you can power-on in your cluster. Many have used advanced settings to achieve more or less the same, but through the UI things are a lot easier I guess.

Now if you do this, it could happen that a virtual machine needs multiple slots in order to successfully power-on. That is where the second bullet point comes in to play. In the vSphere Web Client you can now see a list of all the virtual machines that currently span multiple slots.

Support for Guest OS “Sleep Mode” in environments where VM Monitoring is used was added. This was reported by Sudharsan a while back and I addressed it with the HA engineering team. As a result they added in the logic that recognizes the “state” of the virtual machine to avoid unneeded restarts. Thanks Sudharsan for reporting! (I can’t find this in the release notes however)

With 5.0 the Application Monitoring SDK was opened up to the broader audience. It was still a separate installer though. As of vSphere 5.1 the App Monitoring SDK is part of the VMware Tools SDK. This will make your life easier when you use Application Monitoring.

Those running stateless will be happy about the fact that the FDM VIB is now part of the Auto-Deploy image profile. This will avoid the need to manually add it every time you create a new image.

Last but not least, in 5.1 we re-introduce “das.failuredetectiontime”… well not exactly but a similar concept with a different name. This new advanced setting named “das.config.fdm.isolationPolicyDelaySec” will allow you to extend the time it takes before the isolation response is triggered. By default the isolation response is triggered after ~30 seconds with vSphere 5.x. If you have a requirement to increase this then this new advanced setting can be used.

Answering some admission control questions

Duncan Epping · Jul 3, 2012 ·

I received a bunch of questions on HA admission control in this blog post and I figured I would answer them in a blog post so that everyone would be able to find / read it. This was the original set of questions:

There are 4 ESXi Hosts in the network and 4 VMs (Same CPU, RAM Reservation for all VMs) on each Host. Admission Control is policy is set to ‘Host failure cluster tolerates’ to 1. All the available 12 slots have been used by the powered ON VMs, except the 4 reserved slots for failover.
1) What happens if 2 ESXi Hosts fails now? ( 2 * 4 VMs needs to fail over). Will HA restart only 4 VMs as it has only 4 slots available? And Restart of the remaining 4 VM fails?
Same Scenario, but Policy is set to ‘% of cluster resources reserved’ = 25%. All the available 75 % resources have been utilized by all the 16 VMs, except 25 % reserved for failover
2) What happens if 2 ESXi Hosts fails now? ( 2 * 4 VMs needs to fail over). Will HA restart only 4 VMs as it consumes 25 % of resources? And Restart of the other 4 VM fails?
3) Does HA check the VM reservation (or any other factor) at the time of restart ?
4) HA only restart a VM if the Host could guarantee the reserved resources or restart Fails?
5) What if no VM reservations are set VM level ?
6)What does HA takes into consideration when it has to restart VMs which has no reservation ?
7)Will it guarantee the configured Resources for each VMs ?
8)If not, How HA can restart 8 VMs (as per our eg) when it only has configured reserved resources for just 4 VM
9)Will it share the reserved resources across 8 VMs and will not care about the resource crunch or is it about first come first serve
10)Admission control doesn’t have any role at all in the event of HA failover ?

Let me tackle these questions one by one:

  1. In this scenario 4 VMs will be restarted and 4 VMs might be restarted! Note that the “slot size” policy is used and that this is based on the worst case scenario. So if your slot is 1GB and 2GHz but your VMs require way less than that to power-on it could be all VMs are restarted. However, HA guarantees the restart of 4 VMs. Keep in mind that this scenario doesn’t happen too often, as you would be overcommitting to the extreme here. As said HA will restart all VMs it can. It just needs to be able to satisfy the resource reservations on memory and CPU!
  2. Again, also in this HA will do its best to restart. It can restart new VMs until all “unreserved capacity” is used. As HA only needs to guarantee reserved resources chances of hitting this is very slim, as most people don’t use reservations at a VM level it would mean you are overcommiting extremely
  3. Yes it will validate if there is a host which can back the the resource reservations before it tries the restart
  4. Yes it will only restart the VM when this can be guaranteed. If it cannot be then HA can call,”DRS” to defragment resources for this VM
  5. If there are no reservations then HA will only look at the “memory overhead” in order to place this VM
  6. HA ensures the portgroup and datastore are available on the host.
  7. It will not guarantee configured resources, HA is about restarting virtual machines not about resource management. DRS is about resource management and guaranteeing access to resources.
  8. HA will only be able to restart the VM if there are unreserved resources available to satisfy the VMs request
  9. All resources required for a virtual machine need to be available on a single host! Yes resources will be shared on a single host, just as long as no reservations are defined.
  10. No Admission Control doesn’t have any role in an HA failover. Admission Control happens on a vCenter level, HA failovers happen on an ESX(i) level.
  • « Go to Previous Page
  • Go to page 1
  • Go to page 2
  • Go to page 3
  • Go to page 4
  • Go to page 5
  • Go to page 6
  • Go to page 7
  • Go to Next Page »

Primary Sidebar

About the author

Duncan Epping is a Chief Technologist in the Office of CTO of the HCI BU at VMware. He is a VCDX (# 007) and the author of multiple books including "vSAN Deep Dive" and the “vSphere Clustering Technical Deep Dive” series.

Upcoming Events

04-Feb-21 | Czech VMUG – Roadshow
25-Feb-21 | Swiss VMUG – Roadshow
04-Mar-21 | Polish VMUG – Roadshow
09-Mar-21 | Austrian VMUG – Roadshow
18-Mar-21 | St Louis Usercon Keynote

Recommended reads

Sponsors

Want to support us? Buy an advert!

Advertisements

Copyright Yellow-Bricks.com © 2021 · Log in