ha

Primary and Secondary nodes, pick one!

Duncan Epping · Aug 7, 2009 ·

I get this question at least once every two weeks; Is it possible to handpick a primary node? Until now the answer has always been; No, this is not possible. But is this really the case? Can’t I manually promote a server to primary or demote a server to secondary? Those who have been paying attention might have noticed there’s a way to actually list the current primaries and secondaries from the HA command-line interface.

/opt/vmware/aam/bin/Cli (ftcli on earlier versions)
AAM> ln

Now that makes you wonder what else is possible… Lets start with a warning. I don’t know if this is supported. Lets assume it is not. Also keep in mind that the supported limit of primaries is 5, I repeat 5. This is a soft limit, so you can manually add a 6th, but this is not supported. Now here’s the magic…

To promote a node:

/opt/vmware/aam/bin/Cli (ftcli on earlier versions)
AAM> promotenode <nodename>

To demote a node:

/opt/vmware/aam/bin/Cli (ftcli on earlier versions)
AAM> demotenode <nodename>

HA deepdive

Duncan Epping · Jul 27, 2009 ·

I just refreshed my HA Deepdive page. I had it on my “to do” list for a long time but never got to it. Well it took me a couple of evenings but it’s finally done and I’m happy about it. I just hope you guys find the refresh useful and enjoy it. I also flushed all the comments on the page, if you’ve got any question don’t hesitate to ask them. I might even add a FAQ one day… who knows 🙂

Re: RTFM “What I learned today – HA Split Brain”

Duncan Epping · Jul 22, 2009 ·

I’m going to start with a quote from Mike’s article “What I learned today…“:

Split brain is HA situation where an ESX host becomes “orphaned” from the rest of the cluster because its primary service console network has failed. As you might know the COS network is used in the process of checking if an ESX host has suffered an untimely demise. If you fail to protect the COS network by giving vSwitch0 two NICs or by adding a 2nd COS network to say your VMotion switch, under-desired consequences can occour. Anyway, the time for detecting split brain used to be 15 seconds, for some reason this has changed to 12 seconds. I’m not 100% why, or if in fact the underlying value has changed – or that VMware has merely corrected its own documentation. You see its possible to get split brain in Vi3.5 happening if the network goes down for more than 12 seconds, but comes back up on the 13th, 14th or 15th second. I guess I will have to do some research on this one. Of course, the duration can be changed – and split brain is trivial matter if you take the neccessary network redundency steps…

I thought this issue was something that was common knowledge but if Mike doesn’t know about it my guess is that most of you don’t know about this. Before we dive into Mike’s article, technically this is not a split brain, it is an “orphaned vm” but not a scenario where the disk files and the in memory VM are split between hosts.

Before we start this setting is key in Mike’s example:

das.failuredetectiontime = This is the time period when a host has received no heartbeats from another host, that it waits before declaring the other host dead.

The default value is 15 seconds. In other words the host will be declared dead on the fifteenth second and a restart will be initiated by one of the primary hosts.

For now let’s assume the isolation response is “power off”. These VMs can only be restarted if the current VMs have been powered off. Here’s the clue, the “power off”(isolation response) will be initiated by the isolated host 2 seconds before the das.failuredetectiontime.

Does this mean that you can end up with your VMs being down and HA not restarting them?
Yes, when the heartbeat returns between the 13th and 15th second shutdown could already have been initiated. The restart however will not be initiated because the heartbeat indicates that the host is not isolated.

How can you avoid this?
Pick “Leave VM powered on” as an isolation response. Increasing the das.failuredetectiontime will also decrease the chances of running in to issues like these.

Did this change?
No, it’s been like this since it has been introduced.

Up to 80 virtual machines per host in an HA Cluster (3.5 vs vSphere)

Duncan Epping · Jul 16, 2009 ·

I was re-reading the KB article on how to improve HA scaling. Apparently pre vCenter 2.5 U5 there was a problem when the amount of VMs that needs to fail-over exceeds 35. Keep in mind that it’s a soft limit, you can run more than 35 VMs on a single host in a HA cluster if you want to though.

To increase scalability up to 80VMs per host vCenter needs to be upgraded to 2.5 U5 and the following configuration changes are recommended:

To increase the maximum vCPU limit to 192
To increase the Service Console memory limit to 512 MB.
To increase the memory resource reservation of the vim resource pool to 1024 MB.
To include/edit the host agent memory configuration values. (hostdStopMemInMB=380 and hostdWarnMemInMB=300)

A question that I immediately had was what about vSphere. What are the values for vSphere and do I need to increase them as well? Here are the vSphere default settings:

512
300MB
0 MB
hostdStopMemInMB=380 and hostdWarnMemInMB=300

As you can see 1 and 4 are already the new default on vSphere. I would always recommend to set the Service Console memory to 800MB. With most hosts having 32GB or more the costs of assigning an extra 500MB to the Service Console is minimal. That leaves the recommendation to increase the memory reservation for the vim resource pool. I would recommend to leave it set to the default value. vSphere scales up to 100 VMs per host in a HA cluster and chances are that this will be increased when U1 hits the streets. (These values usually change with every release.)

MSCS VM’s in a HA/DRS cluster

Duncan Epping · Jun 3, 2009 ·

We(VMware PSO) had a discussion yesterday on the fact whether it’s supported to have MSCS(Microsoft Clustering Services) VM’s in a HA/DRS cluster with both HA and DRS set to disable. I know many people struggle with this because it doesn’t make sense in a way. In short: No, this is not supported. MSCS VM’s can’t be part of a VMware HA/DRS cluster, even if they are set to disabled.

I guess you would like to have proof:

For ESX 3.5:
http://www.vmware.com/pdf/vi3_35/esx_3/r35u2/vi3_35_25_u2_mscs.pdf
Page 16 – “Clustered virtual machines cannot be part of VMware clusters (DRS or HA).”

For vSphere:
http://www.vmware.com/pdf/vsphere4/r40/vsp_40_mscs.pdf
Page 11 – “The following environments and functionality are not supported for MSCS setups with this release of vSphere:
Clustered virtual machines as part of VMware clusters (DRS or HA).”

As you can see certain restrictions apply, make sure to read the above documents for all the details.