I got a question around VM Monitoring (aka virtual machine level HA) this week. A customer wanted to test if VM Monitoring worked and as such disabled the NIC of the virtual machine and waited for 30 seconds for the VM Monitoring response to kick in…. nothing happened.
VM Monitoring restarts individual virtual machines when needed. VM monitoring uses a similar concept as HA, heartbeats. If heartbeats, and in this case VMware Tools heartbeats are not received for a specific amount of time the virtual machine will be rebooted. An example of when this will happen for instance is when a Windows virtual machine shows a BSOD.
The big question of course was why didn’t this trigger a response?
The answer is simple: The VMware Tools heartbeat does not use the virtual machine NIC. This heartbeat is “caught” by hostd and passed on to vCenter. vCenter uses this to show those “green/yellow/red” alarm dots. The same heartbeat is used by VM Monitoring to detect the failure of a virtual machine. Even without any NIC attached to your virtual machine these heartbeats will still be received.
One thing to keep in mind though is that when heartbeats are no longer received, by default sent out every second, VM Monitoring will check if there is any Network or Storage I/O to avoid false positives.
Question for you guys! One thing that I always wondered is how many people use VM Monitoring? And if you use it, do you use it on all VMs in every cluster?






We use it on every cluster with this settings (we encountered some reset during long reboot process) :
Failure interval = 300 sec
Min uptime = 300 sec
Max per-VM resets = 3
Max reset time windows = 1 hour
We don’t use this for the most part. We try to use application aware HA methods (such as clustering) for most applications like Exchange, SQL, etc. Having never used it I never developed a comfort level with it, but maybe now is the time to start testing it..
I don’t enable it anymore by default, some customers had a lot of problems with the VM Monitoring BUG in version 3.x. Only if the customer wants, i turn it on.
Good you clarified the issue.
I noticed that I have much much less BSODs since I virtualized Windows than when it was running on physical servers doh!
Now I’ve also noticed that Windows servers that were P2V’ed are more ‘fragile’ and occasionaly went into BSOD mode. Here comes the benefit of that HA feature for those type of VMs.
Duncan, I had an issue with a client that has VM monitoring enabled. Initially, they enabled VM monitoring on all VMs related to a “mission critical” application. They have run into problems with Java apps running in the VM that consume all the CPU cycles available to the VM and then vCenter restarts the VM. In this case we were able to work with the app developers/admins to throttle the JVM’s to allow CPU cycles for the guest OS and moderating the VM monitoring sensitivity thus preventing future “HA events” related to starving the VM tools/heartbeat.
It sounds like it’s safe to use VM Monitoring in SMB environments where VMs are not being “heavily used” the same as in big companies.
Are there any kinds of apps/VMs/servers that are NOT good to use VM monitoring??
Such as VMs used only infrequently??
Thank you, Tom
slow down here…. Its Friday. my initial question is, are you honestly referring to HA? as HA does not require vCenter (or just vCenter within an HA cluster?). I’m a bit confused here as vCenter heartbeat and HA are two seperate items… which one are we talking about? if we’re talking about a vCenter server in an HA cluster, you’d have to take the entire host ‘offline’ and that is related to pNICs… or am i just WAY off here?? did I mention its Friday?
TGIF
Maybe little bit off-topic, but my 2 cents are:
We at Veeam have a pretty often situation with guest heartbeat flip-flops…
If this flip-flop is detected by nworks MP for SCOM or SPI for OpenView, it can generate 1000s of alerts a day in larger environments. To assure the problem we recommend going to MOB of VC (https://vcenter.tld/mob/?moid=vm-%the-vm-id% ) and look at guestHeartbeatStatus if it flip-flops.
I seen frequency of 1 change in couple of seconds. Not sure if this would affect VM HA functionality.
On the PSO side of the house. I’ve seen one client use it before with the advanced settings that Nitro has implemented. The client’s reasoning for using it was that it helps when an FC HBA failure occurs.
Am currently testing it at a NFS shop to combat an HA issue I have. But for a majority of clients I’ve been to I normally see it disabled.
I explain VM monitoring as being like the old Compaq Automatic Server Restart (ASR) feature that would reset a failed OS if it hung for ten minutes.
I think people are getting issues when they set VM monitoring too aggressive.
With conservative settings it could be a big help for flaky VMs. It can also be disabled for the heavy Java VM without turning it off for the whole cluster.
Was talking with a customer today about how we could test VM HA and referenced this article to backup my argument.
Thanks for making me look good. Everyone should read Yellow Bricks.
On a related topic I would be grateful if you could clarify something…
If a FT enabled VM encounters a blue screen, since they are in lock step, does the secondary also blue screen?
I’m guessing that it will, since FT is designed to protect against host failures?
But lots of VMWare’s stuff is so close to actual magic that I’m not sure.
Wow, I would like an answer to Suttoi’s question. Anyone still following this thread?
Yes it wil,
question: we have a virtual machine with no vmwaretools installed. But we see that this machine sometime’s get reset because of a VMware Tools heartbeat failure. Now i am wondering if VM Monitoring also checks machine’s with no vmwaretool!?
If you want to test this with a Windows VM you can try… http://www.wikihow.com/Force-a-Blue-Screen-in-Windows