ha

This host currently has no network management redundancy

Duncan Epping · May 21, 2015 ·

Bumped in to this a billion times by now, and I wouldn’t recommend applying this in production but for your lab when you need to take clean screenshots it works great. I’ve mentioned this setting before but as it was part of a larger article it doesn’t stand out when searching so I figured I would dedicate a short and simple article to it. Here is what you will need to do if you see the following message in the vSphere Web Client: this host currently has no network management redundancy.

Go to your Cluster object
Go to Settings
Go to “vSphere HA”
Click “Edit”
Add an advanced setting called “das.ignoreRedundantNetWarning”
Set the advanced setting to “true”
On each host right click and select “reconfigure for vSphere HA”

This is what it should look like in the UI:

You can also do this in PowerCLI by the way, note that “Stretched-Bluefin-Frimley” is the name of my cluster.

New-AdvancedSetting -Entity Stretched-Bluefin-Frimley -type ClusterHA -Name "das.ignoreRedundantNetWarning" -Value "true" -force

vCenter Server Appliance watchdog

Duncan Epping · Apr 9, 2015 ·

I was reviewing a paper on vCenter availability for 6.0 and it listed a watchdog service which monitors “VPXD” (the vCenter Server service) on the vCenter Server Appliance. I had seen the service before but never really looked in to it. With 5.5 the watchdog service (/usr/bin/vmware-watchdog) was only used to monitor vpxd and tomcat but in 6.0 the watchdog service seems to monitor some more services. I did a “grep” of vmware-watchdog within the 6.0 appliance and the below is the outcome, it shows the services which are being watched:

ps -ef | grep vmware-watchdog
 root 7398 1 0 Mar27 ? 00:00:00 /bin/sh /usr/bin/vmware-watchdog -s rhttpproxy -u 30 -q 5 /usr/sbin/rhttpproxy -r /etc/vmware-rhttpproxy/config.xml -d /etc/vmware-rhttpproxy
 root 11187 1 0 Mar27 ? 00:00:00 /bin/sh /usr/bin/vmware-watchdog -s vws -u 30 -q 5 /usr/lib/vmware-vws/bin/vws.sh
 root 12041 1 0 Mar27 ? 00:09:58 /bin/sh /usr/bin/vmware-watchdog -s syslog -u 30 -q 5 -b /var/run/rsyslogd.pid /sbin/rsyslogd -c 5 -f /etc/vmware-rsyslog.conf
 root 12520 1 0 Mar27 ? 00:09:56 /bin/sh /usr/bin/vmware-watchdog -b /storage/db/vpostgres/postmaster.pid -u 300 -q 2 -s vmware-vpostgres su -s /bin/bash vpostgres
 root 29201 1 0 Mar27 ? 00:00:00 /bin/sh /usr/bin/vmware-watchdog -a -s vpxd -u 3600 -q 2 /usr/sbin/vpxd

As you can see vmware-watchdog is ran with a couple of parameters, which seem to different for some services. As it is the most important service, lets have a look at VPXD. It shows the following parameters:

-a
-s vpxd
-u 3600
-q 2

What the above parameters result in is the following: the service, named vpxd (-s vpxd), is monitored for failures and will be restarted twice (-q 2) at most. If it fails for a third time within 3600 seconds/one hour (-u 3600) the guest OS will be restarted (-a).

Note that the guest OS will only be restarted when vpxd has failed multiple times. With other services this is not the case as the “grep” above shows. There are some more watchdog related processes, but I am not going to discuss those at this point as the white paper which is being worked on by Technical Marketing will discuss these in a bit more depth and should be the authoritative resource.

** Please do not make changes to ANY of the above parameters as this is totally unsupported, I am mere showing the details for educational purposes and to provide a better insight around vCenter availability when it comes to the VCSA. **

DRS rules still active when DRS disabled?

Duncan Epping · Mar 30, 2015 ·

I just received a question around DRS rules and why they are still active when DRS is disabled. I was under the impression this was something I already blogged about, but I cannot find it. I know some others did, but they reported this behaviour as a bug… which it isn’t actually.

Below is a screenshot of the VM/Host Rules screen for vSphere 6.0, it allows you to create rules for clusters… Now note I said “clusters” not DRS in specific. In 6.0 the wording in the UI has changed to align with the functionality vSphere offers. These are not DRS rules, but rather cluster rules. Whether you use HA or DRS, these rules can be used when either of the two is configured.

Note that not all types of rules will automatically be respected by vSphere HA. One thing which you can now also do in the UI is specify if HA should ignore or respect rules, very useful if you ask me and makes life a bit easier:

vSphere HA respecting VM-Host should rules?

Duncan Epping · Mar 5, 2015 ·

A long time ago I authored this white paper around stretched clusters. During out testing the one thing where we felt HA was lacking was the fact that it would not respect VM-Host should rules. So if you had these configured in a cluster and a host would fail then VMs could be restarted on ANY given host in the cluster. The first time that DRS would then run it would move the VMs back to where they belong according to the configured VM-Host should rules.

I guess one of the reasons for this was the fact that originally the affinity and anti-affinity rules were designed to be DRS rules. Over time I guess we realized that these are not DRS rules but rather Cluster rules. Based on the findings we did when authoring the white paper we filed a bunch of feature requests and one of them just made vSphere 6.0. As of vSphere 6.0 it is possible to have vSphere HA respecting VM-Host should rules through the use of an advanced setting called “das.respectVmHostSoftAffinityRules”.

When “das.respectVmHostSoftAffinityRules” is configured then vSphere HA will try to respect the rule when it can. So if there are any hosts in the cluster which belong to the same VM-Host group then HA will restart the respective VM on that host. Of course as this is a “should rule” HA has the ability to ignore the rule when needed. You can imagine that there could be a scenario where none of the hosts in the VM-Host should rule is available, in that case HA will restart the VM on any other host in the cluster. Useful? Yes, I think so!

What’s new for HA in vSphere 6.0?

Duncan Epping · Feb 4, 2015 ·

Instead of one generic post with a bunch of data I picked a couple of features and dug a little bit deeper, today I will be discussing what is new for HA in vSphere 6.0. Lets start with a list and then look at the features / enhancements individually:

Support for Virtual Volumes – With Virtual Volumes a new type of storage entity is introduced in vSphere 6.0.
VM Component Protection – This allows HA to respond to a scenario where the connection to the virtual machine’s datastore is impacted temporarily or permanently.
- “Response for Datastore with All Paths Down”
- “Response for Datastore with Permanent Device Loss”
Increased scale – Cluster limit has grown from 32 to 64 hosts and to a max of 8000 VMs per cluster
Registration of “HA Disabled” VMs on hosts after failure

Lets start with support for Virtual Volumes. It may sound like this is a given but as the whole concept of a VMFS volume no longer exists with Virtual Volumes and VMs have “virtual volumes” instead of VMDKs you can imagine that some work was needed to allow for HA to restart virtual machines stored on a VVOL enabled storage system.

VM Component Protection (VMCP) is in my opinion THE big thing that got added to vSphere HA. What this feature basically allows you to do is protect yourself against storage failures. There are two types of failures VMCP will respond to and those are PDL and APD. Before we look at some of the details, I want to point out that configuring is extremely simple… Just one tickbox to enable it.

In the case of a PDL (permanent device loss), this is something HA already was capable of doing when configured through the command line, a VM will be restarted instantly when a PDL signal is issued by the storage system. For an APD (all paths down) this is a bit different. A PDL more or less indicates that the storage device does not expect the device to return any time soon. An APD is more of an unknown situation, it may return… it may not… and no clue how long it takes. With vSphere 5.1 some changes were introduced to the way APD is handled by the hypervisor in this mechanism is leveraged by HA to allow for a response. (Cormac wrote an excellent post about this APD handling here.) When an APD occurs a timer starts. After 140 seconds the APD is declared and the device is marked as APD time out. When the 140 seconds has passed HA will start counting. The HA time out is 3 minutes. When the 3 minutes has passed HA can restart the virtual machine, but you can configure VMCP to respond differently if you want it to. You could for instance specify that events are issued that a PDL or APD has occurred. You can also specify how aggressively HA needs to try to restart VMs that are impacted by an APD. Note that aggressive / conservative refers to the likelihood of HA being able to restart VMs. When set to “conservative” HA will only restart the VM that is impacted by the APD if it knows another host can restart it. In the case of “aggressive” HA will try to restart the VM even if it doesn’t know the state of the other hosts, which could lead to a situation where your VM is not restarted as there is no host that has access to the datastore the VM is located on. It is also good to know that if the APD is lifted and access to the storage is restored during the total of roughly 5 minutes and 20 seconds it would take to reboot the VM, that HA will not do anything unless you explicitly configure it do so. This is where the “Response for APD recovery after APD timeout” comes in to play.

Increased scale is pretty straight forward, from 32 to 64 hosts and a total of 8000 VMs per cluster. I don’t know too many customers hitting this boundaries but I do come across a request like this occasionally. So if you want to grow your cluster, you can now do so. Do note that you may hit other limits like the LUN limit or the VM limit or…

Registration of HA Disabled VMs after a failure is a feature I have requested a long time ago. I am glad to see this made it in to the release. Basically when you have HA disabled on a specific VM this feature will make sure that the VM gets registered on another host after a failure. This will allow you to easily power-on that VM when needed without needed to manually re-register it yourself. Note, HA will not do a power-on of the VM but it will just register it for you.

That was it for now…