5.0

vSphere HA compatibility list, how do I check it?

Duncan Epping · Nov 8, 2012 ·

Someone reported issues that in their environment VMs could not be restarted as there were no compatible hosts available. The relevant part of the error message was:

N3Vim5Fault16NoCompatibleHostE

I don’t know why in this case it happened as the log files unfortunately don’t provide these details. This person had manually restarted all of his VMs and that actually worked okay. This could mean that some how the “compatibility list” that vSphere HA maintains was not complete or it wasincorrect. So the question would be how do you validate that if you ever end up in a scenario like this?

First of all before I forget, create a support dump. That way VMware Global Support Services can help pinpointing your problems and provide tips on how to prevent these from occurring.

On a host, and you will have to SSH in to one, you can actually run a script that provides you with some nice details around this. Lets go through the options of the script and explain what you can get out of it. The script is called “prettyPrint.sh” can be found in “/opt/vmware/fdm/fdm/”.

./prettyPrint.sh hostlist

The hostlist option provides all relevant details about the hosts which are part of this cluster including “hostId”, host name, ip address etc.

./prettyPrint.sh clusterconfig

The clusterconfig option provides all configuration info of your cluster like admission control and isolation response.

./prettyPrint.sh compatlist

The compatlist option provides the list of VMs and host they are compatible with, only for vSphere 5.0.

./prettyPrint.sh vmmetadata

The vmmetadata option provides the list of VMs and host they are compatible with, only for vSphere 5.1.

So in this case “vmmetadata” was important as it lists VMs compatible with which host. In this case “<index>0</index> refers to a VM and “<compatMask>0,1,2,3</compatMask> refers to the hosts it is compatible with. Nice right?!

   <compatMatrix>
      <restartCompat>
         <index>0</index>
         <compatMask>0,1,2,3</compatMask>
      </restartCompat>
      <restartCompat>
         <index>1</index>
         <compatMask>0,1,2,3</compatMask>
      </restartCompat>
      <restartCompat>
         <index>2</index>
         <compatMask>0,1,2,3</compatMask>
      </restartCompat>
   </compatMatrix>

** Update: Added Portgroup Test **

On VMTN someone asked if HA also takes networking in to account when restarting VMs. If a given portgroup is not available on specific hosts will HA smartly place VMs? In my test I removed the “VM Network” portgroup from one of my hosts (host with ID 2). When listing the compatibility list the following shows up:

<restartCompat>
       <index>0</index>
       <compatMask>0,1,3</compatMask>
</restartCompat>

As you can see host with ID 2 is missing.

How do I configure an HA vpxd.das advanced setting?

Duncan Epping · Nov 7, 2012 ·

On the community forums someone asked a question around how to set “config.vpxd.das.electionWaitTimeSec”. I was looking at the documentation and it is indeed not really clear on what / where / how to set an HA vpxd.das advanced setting. This KB article kind explains it, but let me summarize it and simplify it.

There are various sorts of advanced settings, but for HA three in particular:

das.* –> Cluster level advanced setting.
fdm.* –> FDM host level advanced setting (FDM = Fault Domain Manager = vSphere HA)
vpxd.* –> vCenter level advanced setting.

How do you configure these?

Cluster Level
- In the vSphere Client: Right click your cluster object, click “edit settings”, click “vSphere HA” and hit the “Advanced Options” button.
- In the Web Client: Click “Hosts and Clusters”, click your cluster object, click the “Manage” tab, click “Settings” and “vSphere HA”, hit the “Edit” button
FDM Host Level
- Open up an SSH session to your host and edit “/etc/opt/vmware/fdm/fdm.cfg”
vCenter Level
- In the vSphere Client: Click “Administration” and “vCenter Server Settings”, click “Advanced Settings”
- In the Web Client: Click “vCenter”, click “vCenter Servers”, select the appropriate vCenter Server and click the “Manage” tab, click “Settings” and “Advanced Settings”

By the way, this KB also lists all HA advanced settings that are relevant… might be worth reading as well. Hope this helps configuring your HA vpxd.das advanced setting.

Working with CA signed certificates in your vSphere environment?

Duncan Epping · Oct 30, 2012 ·

Are you working with CA signed certificates in your vSphere environment? You might want to check out these recently published KB articles. They will definitely help understanding the whole process around installing and configuring them. (Thanks Simon for pointing these out!)

Configuring CA signed certificates for VMware vCenter Server 5.0.x
http://kb.vmware.com/kb/2015421
Configuring certificates signed by a Certificate Authority (CA) for vCenter Server Appliance 5.1
http://kb.vmware.com/kb/2036744
Configuring CA signed SSL certificates for vSphere Update Manager in vCenter 5.1
http://kb.vmware.com/kb/2037581
Creating certificate requests and certificates for the vCenter 5.1 components
http://kb.vmware.com/kb/2037432
Configuring CA signed SSL certificates for vCenter SSO in vCenter 5.1
http://kb.vmware.com/kb/2035011
Configuring CA signed SSL certificates for the Web Client and Log Browser in vCenter 5.1
http://kb.vmware.com/kb/2035010
Configuring CA signed SSL certificates for the Inventory service in vCenter 5.1
http://kb.vmware.com/kb/2035009
Configuring OpenSSL for installation and configuration of CA signed certificates in the vSphere environment
http://kb.vmware.com/kb/2015387
Configuring CA signed certificates for ESXi 5.x hosts
http://kb.vmware.com/kb/2015499
Configuring CA signed certificates for vCenter 5.1
http://kb.vmware.com/kb/2035005
Implementing CA signed SSL certificates with vSphere 5.0
http://kb.vmware.com/kb/2015383
Implementing CA signed SSL certificates with vSphere 5.1
http://kb.vmware.com/kb/2034833

vSphere HA fail-over in action – aka reading the log files

Duncan Epping · Oct 17, 2012 ·

I had a discussion with Benjamin Ulsamer at VMworld and he had a question about the state of a host when both the management network and storage network was isolated. My answer was that in that case the host will be reported as “dead” as there is no “network heartbeat” and no “datastore heartbeat”. (more info about heartbeating here) Funny thing is when you look at the log files you do see isolated instead of dead. Why is that? Before we answer it lets go through the log files and paint the picture:

Two hosts (esx01 and esx02) with a management network and an iSCSI storage network. vSphere 5.0 is used and Datastore Heartbeating is configured. For whatever reason for the network of esx02 is isolated (both storage and management as it is a converged environment. So what can you see in the log files?

Lets look at “esx02” first:

16:08:07.478Z [36C19B90 info ‘Election’ opID=SWI-6aace9e6] [ClusterElection::ChangeState] Slave => Startup : Lost master
- At 16:08:07 the network is isolated
16:08:07.479Z [FFFE0B90 verbose ‘Cluster’ opID=SWI-5185dec9] [ClusterManagerImpl::CheckElectionState] Transitioned from Slave to Startup
- The host recognizes it is isolated and drops from Slave to “Startup” so that it can elect itself as master to take action
16:08:22.480Z [36C19B90 info ‘Election’ opID=SWI-6aace9e6] [ClusterElection::ChangeState] Candidate => Master : Master selected
- The host has elected itself as master
16:08:22.485Z [FFFE0B90 verbose ‘Cluster’ opID=SWI-5185dec9] [ClusterManagerImpl::CheckHostNetworkIsolation] Waited 5 seconds for isolation icmp ping reply. Isolated
- Can I ping the isolation address?
16:08:22.488Z [FFFE0B90 info ‘Policy’ opID=SWI-5185dec9] [LocalIsolationPolicy::Handle(IsolationNotification)] host isolated is true
- No I cannot, and as such I am isolated!
16:08:22.488Z [FFFE0B90 info ‘Policy’ opID=SWI-5185dec9] [LocalIsolationPolicy::Handle(IsolationNotification)] Disabling execution of isolation policy by 30 seconds.
- Hold off for 30 seconds as “das.config.fdm.isolationPolicyDelaySec” was configured
16:08:52.489Z [36B15B90 verbose ‘Policy’] [LocalIsolationPolicy::GetIsolationResponseInfo] Isolation response for VM /vmfs/volumes/a67cdaa8-9a2fcd02/VMWareDataRecovery/VMWareDataRecovery.vmx is powerOff
- There is a VM with an Isolation Response configured to “power off”
16:10:17.507Z [36B15B90 verbose ‘Policy’] [LocalIsolationPolicy::DoVmTerminate] Terminating /vmfs/volumes/a67cdaa8-9a2fcd02/VMWareDataRecovery/VMWareDataRecovery.vmx
- Lets kill that VM!
16:10:17.508Z [36B15B90 info ‘Policy’] [LocalIsolationPolicy::HandleNetworkIsolation] Done with isolation handling
- And it is gone, done with handling the isolation

Lets take a closer look at “esx01”, what does this host see with regards to the management network and storage network isolation of “esx02”:

16:08:05.018Z [FFFA4B90 error ‘Cluster’ opID=SWI-e4e80530] [ClusterSlave::LiveCheck] Timeout for slave @ host-34
- The host is not reporting itself any longer, the heartbeats are gone…
16:08:05.018Z [FFFA4B90 verbose ‘Cluster’ opID=SWI-e4e80530] [ClusterSlave::UnreachableCheck] Beginning ICMP pings every 1000000 microseconds to host-34
- Lets ping the host itself, it could be the FDM agent is dead.
16:08:05.019Z [FFFA4B90 verbose ‘Cluster’ opID=SWI-e4e80530] Reporting Slave host-34 as FDMUnreachable
16:08:05.019Z [FFD5BB90 verbose ‘Cluster’] ICMP reply for non-existent pinger 3 (id=isolationAddress)
- As it is just a 2 node cluster, lets make sure I am not isolated myself, I got a reply so I am not isolated!
16:08:10.028Z [FFFA4B90 verbose ‘Cluster’ opID=SWI-e4e80530] [ClusterSlave::UnreachableCheck] Waited 5 seconds for icmp ping reply for host host-34
16:08:14.035Z [FFFA4B90 verbose ‘Cluster’ opID=SWI-e4e80530] [ClusterSlave::PartitionCheck] Waited 15 seconds for disk heartbeat for host host-34 – declaring dead
- There is also no datastore heartbeat so the host must be dead. (Note that it cannot see the difference between a fully isolated host and a dead host when using IP based storage on the same network.)
16:08:14.035Z [FFFA4B90 verbose ‘Cluster’ opID=SWI-e4e80530] Reporting Slave host-34 as Dead
- It is officially dead!
16:08:14.036Z [FFE5FB90 verbose ‘Invt’ opID=SWI-42ca799] [InventoryManagerImpl::RemoveVmLocked] marking protected vm /vmfs/volumes/a67cdaa8-9a2fcd02/VMWareDataRecovery/VMWareDataRecovery.vmx as in unknown power state
- We don’t know what is up with this VM, power state unknown…
16:08:14.037Z [FFE5FB90 info ‘Policy’ opID=SWI-27099141] [VmOperationsManager::PerformPlacements] Sending a list of 1 VMs to the placement manager for placement.
- We will need to restart one VM, lets provide its details to the Placement Manager
16:08:14.037Z [FFE5FB90 verbose ‘Placement’ opID=SWI-27099141] [PlacementManagerImpl::IssuePlacementStartCompleteEventLocked] Issue failover start event
- Issue a failover event to the placement manager.
16:08:14.042Z [FFE5FB90 verbose ‘Placement’ opID=SWI-e430b59a] [DrmPE::GenerateFailoverRecommendation] 1 Vms are to be powered on
- Lets generate a recommendation on where to place the VM
16:08:14.044Z [FFE5FB90 verbose ‘Execution’ opID=SWI-898d80c3] [ExecutionManagerImpl::ConstructAndDispatchCommands] Place /vmfs/volumes/a67cdaa8-9a2fcd02/VMWareDataRecovery/VMWareDataRecovery.vmx on __localhost__ (cmd ID host-28:0)
- We know where to place it!
16:08:14.687Z [FFFE5B90 verbose ‘Invt’] [HalVmMonitor::Notify] Adding new vm: vmPath=/vmfs/volumes/a67cdaa8-9a2fcd02/VMWareDataRecovery/VMWareDataRecovery.vmx, moId=12
- Lets register the VM so we can power it on
16:08:14.714Z [FFDDDB90 verbose ‘Execution’ opID=host-28:0-0] [FailoverAction::ReconfigureCompletionCallback] Powering on vm
- Power on the impacted VM

That is it, nice right… and is just a short version of what is actually in the log files. It contains a massive amount of details! Anyway, back to the question… if not already answered, the remaining host in the cluster sees the isolated host as dead as there is no:

network heartbeat
response to a ping to the host
datastore heartbeat

The only thing the master can do at that point is to assume the “isolated” host is dead.

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

Protecting vCenter Server – HA or Heartbeat?

Duncan Epping · Sep 19, 2012 ·

At VMworld during one of my group discussions there was a discussion around using vSphere HA or vCenter Heartbeat to protect the vCenter Server. Coincidentally it is something that we recently discussed internally on Socialcast and I figured I would give my thoughts on this topic. My answer was short and simple: It depends.

Yes I bet some of you saw that coming… But let me elaborate. vCenter availability is crucial in my opinion when it comes to operating your environment. However your environment is not about vSphere. Your environment is not really about virtual machines. Your environment is about the services that you offer!

Your service level agreement typically is based on up-time of the service, makes sense right. No one really cares about the management platform, well I do and you do but your customers probably do not. Your customers care about the availability of their service.

Will their service have an interruption when vCenter is down is the question you will need to ask yourself. In most cases the answer will probably be no, and in those cases you will need to ask yourself what the downtime is you can afford from a management perspective. Is a minute or two okay? Than vSphere HA can help you and there is no need for Heartbeat or other complex clustering solutions. If a couple of minutes is not acceptable than Heartbeat is an option.

If there is a service interruption for the customer when vCenter is down (for instance in a test / dev cloud where provisioning processes are key, vCloud Director, View) you should consider using vCenter Heartbeat. Again, it all depends on your service level agreement. In some cases vCenter availability is crucial, in other cases a downtime of minutes is within the defined boundaries. The answer remains, it depends… it depends on your use case and service level agreement.