ha

High Availability change

Duncan Epping · Jul 29, 2008 ·

I just noticed the following, when creating a new(!) HA cluster on VirtualCenter 2.5 Update 2 the default isolation response is set to “Leave powered on”. In other words, when your ESX host hasn’t got a network connection any more the VM’s remain on. This is a huge change because the default used to be “Power Off”.

Besides “Power Off” and “Leave powered on” there’s a new option introduced. And this is one I was looking for, “Shutdown VM”. Shutdown VM doesn’t just pull the cord, it tries to shutdown the VM in a decent fashion, via the OS.

ESX 3.5 U2 and HA error

Duncan Epping · Jul 28, 2008 ·

Erik Zandboer just posted a topic on the VMTN forum about an HA error he received when he updates his machines to 3.5 U2. The error was one almost everyone has probably seen by now “could not contact primary HA agent”. This is normally solved by pressing “reconfigure for ha” or disabling and enabling HA. This wasn’t the case this time. after some research Erik discovered that the host file entry for the ESX host did not match the DNS name, one of them started with a capital while other did not. This caused HA to fail, after changing the hostname/dns name and a reboot everything worked fine again.

I can imagine this happens because of the fact that VirtualCenter is actually performing as a DNS/Hosts file for HA. Inconsistent naming has always been, and probably will always be a problem. So before upgrading, check your hostname and /etc/hosts file!

Previously, enabling VMware High Availability required DNS resolution of all ESX Server hosts in a High Availability cluster. This was done using configuring DNS records or by adding all of the host names and IP addresses to the /etc/hosts file on each server.
Starting with the ESX Server 3.5 Update2 release, DNS resolution or /etc/hosts file entries are no longer required to configure High Availability. The host name and IP address information will now be provided by the managing VirtualCenter Server. the source

Changing the IP-address of an ESX host and HA

Duncan Epping · Jun 4, 2008 ·

Monday evening a colleague changed the ip-address of three VMware ESX hosts. He followed the standard VMware procedure, which usually works like a charm. In this case after the ip-address was changed HA did not work anymore. Disabling and enabling the HA resulted in the following error: “Configuration of host IP address is inconsistent on host …”

After a close inspection the following error was found in /var/log/vmware/vpx-rupgrade.log:

VMwareerrortext=ft_gethostbyname and hostname -i return different addresses: 10.21.10.81, 10.21.5.12 and 10.21.1.21

The command “hostname -i” resulted in the following:

[root@bla-01 /var/log/vmware]# hostname -i
10.21.1.21

The command “ft_gethostbyname” returned the following:

[root@bla-01 /opt/vmware/aam/bin]# ./ft_gethostbyname
10.21.10.81 bla-01
10.21.5.12 bla-01

So for some reason ESX resolved the wrong address. The hosts file wasn’t the problem, but FT_HOSTS which is automatically generated by the AAM Client(High Availability) was:

[root@bla-01 /etc]# more FT_HOSTS
# Auto-generated FT_HOSTS file. Timestamp: Mon Jun 2 19:05:09 2008
10.21.10.81 bla-01
10.21.5.12 bla-01
10.21.10.82 bla-02
10.21.5.14 bla-02
10.21.10.83 bla-03
10.21.5.16 bla-03

So I moved the FT_HOSTS to FT_HOSTS.BAK:

[root@bla-01 /etc]# mv FT_HOSTS FT_HOSTS.BAK

Reconfigured the cluster for HA and everything works like expected again:

[root@bla-01 /etc]# more FT_HOSTS
# Auto-generated FT_HOSTS file. Timestamp: Wed Jun 4 10:39:52 2008
10.21.1.21 bla-01
10.21.5.12 bla-01
10.21.1.22 bla-02
10.21.5.14 bla-02
10.21.1.23 bla-03
10.21.5.16 bla-03

Deleting the cluster, removing the hosts from the cluster and or reconfiguring HA did not once update the FT_HOSTS file. I would expect that with every “reconfigure for HA” action an update or check of the FT_HOSTS file would be done.

VC 2.5 HA constraints

Duncan Epping · May 20, 2008 ·

VMTN user “ian4563” recently posted a thread about problems with the HA constraints. The error that was pulled from the log files:

Das admission check failed. Configured failover: 2, Expected new failover: 0

And the solution according to VMTN user “eziskind”, who also is a VMware employee:

Looks like you have some 4-cpu vms in the clusters too. That will really skew things. You’re being hit by the combination of 2 new things in the HA admission control for VC 2.5:

1) If no reservation is set for a vm (or it is set to 0), use default of 256MHz, 256MB. (these values can be changed using HA advanced options: das.vmMemoryMinMB, das.vmCpuMinMHz)
2) For the cpu component of the slot, use (max MHz per virtual cpu) * (max number of vcpu’s per vm)

The HA admission control algorithm is overly conservative in non-homogenous clusters, ie. ones with vms which have different reservations and/or vcpu number. #2 above makes it more conservative. Given these limitations, its best to try to keep the cluster as homogenous as possible. Is it possible to put the 4-cpu vms in a separate cluster? If not, you can try setting the default vm resources to 0 (using the advanced options in #1). This is how things worked in VC 2.0.

Thanks goes out to my colleague Remco for pointing this topic out.

MS Virtualization blogs and VMotion

Duncan Epping · Apr 21, 2008 ·

Recently there was an article published on the Microsoft Virtualization Blog which compared Hyper-V’s High Availability/Quick Migration capabilities to VMware’s VMotion. (VMblog pointed me towards the article) In the second article the writer responds on a large amount of reactions he had regarding VMotion being superior:

After my last blog I received almost two dozen email telling me that VMotion was far superior for unplanned host downtime and that it was a much better HA solution because it could live migrate virtual machines. I’ve heard this fallacy espoused for many years and, folks, this simply isn’t the case.

In the case of unplanned downtime, VMotion can’t live migrate because there is no warning. Instead you must have VMware HA configured and the best it can do is restart the affected virtual machines on other nodes which is the same as what is provided with Windows Server 2008 Hyper-V and Failover Clustering.

I can imagine why people reacted, in the first post the writer only mentioned VMotion. For unplanned downtime VMware doesn’t use VMotion because when it’s unplanned the VM’s get cutoff and will be restarted on another host with the use of HA(VMware High Availability). There’s no need for a migration when a VM is powered off.

Indeed Microsoft can do the same with the use of Clustering. But can you live migrate virtual machines when a server needs maintenance? No, at this moment that’s not possible. In other words, you will have to wait for a suitable moment… planned downtime, probably after business hours. But in a 24×7 environment will there ever be a suitable moment? Even when your business isn’t 24×7, if there’s a possible hardware failure would you want to wait? But when you have a 8:1 consolidation ratio you probably will not be the most popular system engineer when “quick migrating” the file server or the mail server especially when these VM’s have a lot of RAM assigned.

Besides that, with the upcoming new product, Continuous Availability, even unplanned downtime will not crash your VM. CA will constantly mirror your VM to another host, like a continous VMotion I guess, and when the active host fails the standby host will become active. In other words, no unplanned downtime anymore.