vSphere HA Waiting for cluster election to complete Operation timed out?

Duncan Epping · Jan 4, 2012 ·

I noticed this thread on the VMTN communtity which discussed a time-out during a cluster election process. The one thing all scenarios described in the topic is that they upgraded from 4.1 to 5.0 or 5.0 base to a higher patch level. Marc Sevigny posted in the same thread that it is a known issue which the HA team is currently investigating…

After an upgrade, under conditions we’re still investigating, an error is occurring when issuing a start request of the HA service on the upgraded host. When that fails, HA then tries to re-install HA, and the re-install does nothing because the service is already there (and the right version) but we’re left without an HA service running.

This is the way to fix it if you are experiencing this issue. Now, if you do experience this issue please report it to VMware and submit log files as that will help the HA team fixing the problem.

Place host into Maintenance Mode
Take a copy of /opt/vmware/uninstallers/VMware-fdm-uninstall.sh (we copied to /tmp)
From the location you made a copy of the file, run the command (./VMware-fdm-uninstall.sh)
You should see a short pause before it gets back to the prompt (you’ll see why I mention this below)
Exit host out of Mainenance Mode and within the “Recent Tasks” area you should see the client being pulled from vCenter and installing

Comments

Kevin says

5 January, 2012 at 00:14

I am experiencing this same issue, but am very new to VMware. I’m assuming that I can do these steps using the PowerCLI but I do not know how. Could you please provide some quick guidance? I can connect to the host, but cannot figure out how to access the file system. The prompt is on my local computer. Do I use a cmdlet to do this?

Thanks for any help.
Kevin says

5 January, 2012 at 00:20

Got it. Used Putty instead of PowerCLI.
TBailey says

6 January, 2012 at 21:04

It has definitely been an odd occurrence for us…
MarcS says

19 January, 2012 at 16:11

If anyone is seeing this issue, it would be helpful to look at the ESXi host logs after the failure to see if the issue is the same as the known problem. The known problem has the following signature in the /var/run/log/hostd.log (or it may have been zipped into a hostd.[x].gz if too much time has elapsed):

2011-12-30T17:12:16.511Z [25B03B90 info ‘TaskManager’ opID=701D770F-000013A5-75-76] Task Completed : haTask-ha-host-vim.host.ServiceSystem.updatePolicy-19463 Status error
2011-12-30T17:12:16.512Z [25F10B90 info ‘TaskManager’ opID=SWI-ec15c64b] Task Completed : haTask-ha-host-vim.host.FirewallSystem.disableRuleset-19465 Status success
2011-12-30T17:12:16.512Z [25B03B90 info ‘Vmomi’ opID=701D770F-000013A5-75-76] Activation [N5Vmomi10ActivationE:0x70a3978] : Invoke done [updatePolicy] on [vim.host.ServiceSystem:serviceSystem]
2011-12-30T17:12:16.512Z [25B03B90 verbose ‘Vmomi’ opID=701D770F-000013A5-75-76] Arg id:
–> “vmware-fdm”
2011-12-30T17:12:16.512Z [25B03B90 verbose ‘Vmomi’ opID=701D770F-000013A5-75-76] Arg policy:
–> “off”
2011-12-30T17:12:16.512Z [25B03B90 info ‘Vmomi’ opID=701D770F-000013A5-75-76] Throw vmodl.fault.SystemError
2011-12-30T17:12:16.512Z [25B03B90 info ‘Vmomi’ opID=701D770F-000013A5-75-76] Result:
–> (vmodl.fault.SystemError) {
–> dynamicType = ,
–> faultCause = (vmodl.MethodFault) null,
–> reason = “”,
–> msg = “”,
–> }

If you don’t see anything like this in your host’s /var/run/log directory, then you may have a different issue and should open a support request so we can look into it.
MarcS says

19 January, 2012 at 16:14

In the above log entry, I should have removed

2011-12-30T17:12:16.512Z [25F10B90 info ‘TaskManager’ opID=SWI-ec15c64b] Task Completed : haTask-ha-host-vim.host.FirewallSystem.disableRuleset-19465 Status success

That is another task that may or may not be in your log file around the time of the error, and has no relevance to the issue.
MarcS says

19 January, 2012 at 17:22

Another thing to check if you experience this error is to see if you have jumbo frames enabled on the management network, since this interferes with HA communication.
cwjking says

19 January, 2012 at 18:51

This is going to be something I am going to have to bookmark for sure. Moving to v5 over here soon..
Maximiliano says

23 January, 2012 at 15:27

Hi!! I had the same issue, but i can’t find the path /opt/vmware/uninstallers/VMware-fdm-uninstall.sh
In the /opt/vmware i only have the folder vpxa.

I reinstalled a new vsphere 5 in a test environment, searched the folder and wasn’t there.

Anyone know why???
txolson says

6 April, 2012 at 18:21

The “fix” did not work for me, but disabling HA at a cluster level, then re-enabling did work.
marcel says

21 May, 2013 at 05:38

+1 The “fix” did not work for me, but disabling HA at a cluster level, then re-enabling did work.
Ashley Smoot says

3 July, 2013 at 17:26

At different times I got this exact same issue with two ESX 5.0 U2 hosts that were nuke and pave new installs with all updates and patches. This procedure didn’t work in my case. Removing the host from the reconfiguring for HA, removing and re-adding to cluster, restarting mgmt services and a reboot did not work. Removing from vCenter, then re-adding to vCenter did fix the issue. All is well with both now.
Ashley Smoot says

3 July, 2013 at 20:06

Correcting typo — Above comment in sentence 2 should read: “Removing the host from cluster, reconfiguring for HA,…”
akın akalan says

25 December, 2013 at 00:37

for this error please disable anti-ddos protection from your switch or router.
hiney says

12 August, 2014 at 09:03

needed to remove the host from the VC and readd it before it worked, but it finally did. tried lots of things before this. thanks very much.

My issue appeared after recreating the rui.key and rui.crt. so i’m not sure if i cause it or not. HostReconnect.pl didn’t fix it, so maybe i didn’t.

P

Related

Reader Interactions

Comments