VMware

vSphere HA Waiting for cluster election to complete Operation timed out?

Duncan Epping · Jan 4, 2012 ·

I noticed this thread on the VMTN communtity which discussed a time-out during a cluster election process. The one thing all scenarios described in the topic is that they upgraded from 4.1 to 5.0 or 5.0 base to a higher patch level. Marc Sevigny posted in the same thread that it is a known issue which the HA team is currently investigating…

After an upgrade, under conditions we’re still investigating, an error is occurring when issuing a start request of the HA service on the upgraded host. When that fails, HA then tries to re-install HA, and the re-install does nothing because the service is already there (and the right version) but we’re left without an HA service running.

This is the way to fix it if you are experiencing this issue. Now, if you do experience this issue please report it to VMware and submit log files as that will help the HA team fixing the problem.

Place host into Maintenance Mode
Take a copy of /opt/vmware/uninstallers/VMware-fdm-uninstall.sh (we copied to /tmp)
From the location you made a copy of the file, run the command (./VMware-fdm-uninstall.sh)
You should see a short pause before it gets back to the prompt (you’ll see why I mention this below)
Exit host out of Mainenance Mode and within the “Recent Tasks” area you should see the client being pulled from vCenter and installing

vSphere HA Isolation response when using IP Storage

Duncan Epping · Dec 15, 2011 ·

I had a question from one of my colleagues last week about the vSphere HA Isolation Response and IP Storage. His customer had an ISCSI storage infrastructure (applies to NFS also) and recently implemented a new vSphere environment. When one of the hosts was isolated virtual machines were restarted and users started reporting strange problems.

What happened was that the vSphere HA Isolation Response was configured to “Leave Powered On” and as both the Management Network and the iSCSI Network were isolated there was no “datastore heartbeating” and no “network heartbeating”. Because the datastores were unavailable the lock on the VMDKs expired (virtual disk files) and HA would be able to restart the VMs.

Now please note that HA/ESXi will power-off (or kill actually) the “ghosted VM” (the host which runs the VMs that has lost network connection) when it detects the locks cannot be re-acquired. It still means that the time between when the restart happens and the time when the isolation event is resolved potentially the IP Address and the Mac Address of the VM will pop up on the network. Of course this will only happen when your virtual machine network isn’t isolated, and as you can imagine this is not desired.

When you are running IP based storage, it is highly (!!) recommend to configure the isolation response to: power-off! For more details on configuring the isolation response please read this article which lists the best practices / recommendations.

Multi NIC vMotion, how does it work?

Duncan Epping · Dec 14, 2011 ·

I had a question last week about multi NIC vMotion. The question was if multi NIC vMotion was a multi initiator / multi target solution. Meaning that, if available, on both the source and the destination multiple NICs are used for the vMotion / migration of a VM. Yes it is!

It is complex process as we need vMotion to able to handle mixes of 10GbE and 1GbE NICs.

When we start the process we will check, from the vCenter side, each host and determine the total combined pool of bandwidth available for vMotion. In other words, if you have 2x1GbE NICs and 1x10GbE NIC, then that host has a pool of 12GbE worth of bandwidth. We will do the same for the source and the destination host. Then, we will walk down each host’s list of vMotion vmknics, pairing off NICs until we’ve exhausted the bandwidth pool.

There are many combinations possible, but lets discuss a few just to provide a better idea of how this works:

If the source host has 1x1GbE NIC and the dest 1x1GbE NIC, we’ll open one connection between the these two hosts.
If the source has 3x1GbE NICs and the destination 1x10GbE NIC, then we’ll open one connection from each source-side 1GbE NIC to the destination’s 10GbE NIC – so a total of three socket connections all to the dest’s single 10GbE NIC.
If the source has 15x1GbE NICs and the destination 1x10GbE NIC and 5x1GbE NICs, then we’ll direct the first 10 source-side 1GbE NICs to connect to the dest’s 10GbE NIC, then the remaining pair of 5 1GbE vmknics will connect to each other – 15 connections in all.

Keep in mind that if the hosts are mismatched, we will create connections between vmknics until one of the sides is “depleted”. In other words if the source has 2 x 1GbE and the destination 1 x 1GbE only 1 connection would be opened.

Using Storage IO Control and Network IO Control together?

Duncan Epping · Dec 7, 2011 ·

I had a question today from someone who asked if there was any point in enabling SIOC (Storage IO Control) when you have NIOC (Network IO Control) enabled and configured. Lets start with the answer: Yes there is! NIOC controls traffic on a single NIC port level. In other words, when you have 10GbE NIC ports and vMotion, VMs and NFS (for instance) use the same NIC port it will prevent one of the streams from claiming all bandwidth while others need it. It basically is the police officer who controls a group of people getting too loud in a single room.

As not many people realize this lets repeat it… NIOC controls traffic on a NIC port level. Not on a NIC pair, not on a host level and not on a cluster wide level. On a NIC port level!

SIOC does IO control on a Datastore-VM layer. Meaning that when a certain threshold is reached it will determine on a datastore wide level which hosts and essentially which VMs get a specific chunk of the resources. SIOC prevents a single VM from claiming all IO resources for a datastore in a cluster. SIOC is cluster wide on a datastore level! It basically is the police officer who asks your neighbor to tone it down when as he is bothering the rest of the street.

Yes, enabling SIOC and NIOC together makes a lot of sense!

Some more nuggets about handling VIB files

Duncan Epping · Nov 30, 2011 ·

After I posted my article yesterday Jason Boche posted a comment about the reboot that was required according to the screenshot. I looked in to it and quickly realized that if I would alter my “descriptor.xml” I would not get this message. In other words, it depends on the package that is installed if a reboot is required or not, in my case I made the following simple changes to install the package without the need to reboot:

<live-install-allowed>true</live-install-allowed>
<live-remove-allowed>true</live-remove-allowed>

In other words, I am allowed to install it without a reboot and remove it without a reboot. Simple huh? Of course I tested it and this is the result:

~ # esxcli software vib install -v /test.vib
Installation Result
   Message: Operation finished successfully.
   Reboot Required: false
   VIBs Installed: Duncan_bootbank_firewallrule_1.0
   VIBs Removed:
   VIBs Skipped:
~ #

After clicking refresh in the vCenter client the firewall rule I created pops up as expected.

Now if you would like to know before installing what the package contains and what the requirements are you can simply figure that out by doing the following:

~ # esxcli software sources vib get -v file:/test.vib
Duncan_bootbank_firewallrule_1.0
   Name: firewallrule
   Version: 1.0
   Type: bootbank
   Vendor: Duncan
   Acceptance Level: CommunitySupported
   Summary: Firewall rule
   Description: Firewall rule
   Release Date: 2011-06-01
   Depends:
   Conflicts:
   Replaces:
   Provides:
   Maintenance Mode Required: False
   Hardware Platforms Required:
   Live Install Allowed: True
   Live Remove Allowed: True
   Stateless Ready: False
   Overlay: False
   Tags: driver, module
   Payloads: test
~ #

As you can see in this case, “live install allowed” is set to true. The vendor is “Duncan” and the Acceptance Level is “CommunitySupported”, these are important details in my opinion! Another one to keep an eye on is if the package is “Stateless Ready” or not. In my case I defined it as “false”.

Of course you can also remove a VIB file after installing it. This is pretty straight forward, first of all list all the installed VIBs:

~ # esxcli software vib list
Name                  Version                             Vendor  Acceptance Level  Install Date
--------------------  ----------------------------------  ------  ----------------  ------------
ata-pata-amd          0.3.10-3vmw.500.0.0.456551          VMware  VMwareCertified   2011-06-06
ata-pata-atiixp       0.4.6-3vmw.500.0.0.456551           VMware  VMwareCertified   2011-06-06
ata-pata-cmd64x       0.2.5-3vmw.500.0.0.456551           VMware  VMwareCertified   2011-06-06

After listing all installed VIBs you can easily remove them by using the following command:

~ # esxcli software vib remove -n ata-pata-amd

This would remove the VIB named “ata-pata-amd”. You could even do a “dry-run” to see what the result would be:

~ # esxcli software vib remove -n ata-pata-amd --dry-run
Removal Result
   Message: Dryrun only, host not changed. The following installers 
   will be applied: [BootBankInstaller]
   Reboot Required: true
   VIBs Installed:
   VIBs Removed: VMware_bootbank_ata-pata-amd_0.3.10-3vmw.500.0.0.456551
   VIBs Skipped:
~ #

I hope this provides some more details around how to handle VIB files. Don’t hesitate to leave a comment if you have any questions at all.