Today I witnessed something weird. For reason VirtualCenter was totally lost. There were 3 ESX 3.5 hosts in a cluster. One of them failed and it seemed that all the vm’s failed over to the other two. This could be confirmed in VirtualCenter cause all VM’s were registered on either the first or the second host. I could not double check it on the third host because it was impossible to run “vmware-cmd -l” or contact is via the VI Client.
This also meant that I did not have the opportunity to put the host in maintenance mode, because it was also disconnected. Seeing all these symptoms one would expect that the host was completely empty so I decided to reboot the host. Well I guess that was a big mistake because around 15 VM’s got shutdown. Although according to VirtualCenter they were running on a different ESX host the third host decided to kill them.
When I restarted the machine VirtualCenter still showed me wrong information. So I decided to kill the cluster and recreated it. When added the ESX hosts to the cluster everything functioned like it should. Anyway, it’s really tough troubleshooting when you can’t seem to rely on the management tools. Hope this is something VMware fixes soon, or create a workaround like “forced database update”….
Daniel Hernandez says
Id be curious to hear what you find out root cause to be.
Share a bit more if you dont mind.
DB size before it hosed up. SQL or Oracle
Any thing happen on the DB box event logs
what does /var/log/messages state ? did VPXA get unregistered ?
Since it was a fresh cluster sounds like you lost data ? Including historic performance data.
tom miller says
Seeing similar behavor but not the same. I work for a VAC so I see a lot of environments mostly performing upgrades from 3.0x to 3.5. One client a VM would show up registered to a host then disapear then reappear on its own?? One host would show gray in VC then black. We issued “service mgmt-vmware restart” and “service vmware-vpxa restart” and things settled down. Seems like I have to issue these two commands more often than usual.
Duncan Epping says
I don’t know what the root cause is. All historic cluster is lost indeed. it just happened and no logs that relate to this behavior.
Roderick Derks says
Hi,
Recently I experienced the same thing in our 3.5 environment. A cluster of six hosts, and one of them was not managable via VC. I had to delete core dump files on the host because the filesystem was full. I rebooted the host because the VM’s seemed to be running on other hosts, but then I noticed that this was not true. No fun at all. VC was not giving me the right info.
I restored everything and then installed the most recent patches. And now I hope it will never happen again. Still, it’s amazing to see that a host could be in big trouble and the VM’s are still running smoothly, allthough you wish they would have been migrated.
Grtz
Joaquin Avellan says
ditto,
Same issue but with VC 2.5 and ESX 3.02, cluster of 4, had two of them eventually show ‘not responding’ with full VMs. VMware had us restart vmware-vmkauthd, vmware-vpxa, mgmt-vmware and xinetd. That fixed it but not happy with VC 2.5, not at all. If it wasn’t for checking esxtop we would have rebooted over 50 guests. ‘vmware-cmd -l’ did not work or show guests registered.
VMware looked at our logs, everything is still inconclusive.
kimono says
I could tell you half a dozen stories where an ESX host is sending virtual center a poisoned pill and crashing it. Most are a total nightmare to root cause and fix.
Most recent, would you believe a template attached to a host in a cluster was crashing virtual center service every-time it started? I had to dig into the hostd logs on the host to find it was exceptioning at the same time as the crash.
Many other stories like this one.
There **needs** to be something like a debug mode for VCENTER, that lets you start it up isolated, with all hosts disconnected, and then let you selectively start adding the hosts in one host or cluster at a time.
Free Avatars says
I enjoyed this. Needed more pictures though.