I got a question around VM Monitoring (aka virtual machine level HA) this week. A customer wanted to test if VM Monitoring worked and as such disabled the NIC of the virtual machine and waited for 30 seconds for the VM Monitoring response to kick in…. nothing happened.
VM Monitoring restarts individual virtual machines when needed. VM monitoring uses a similar concept as HA, heartbeats. If heartbeats, and in this case VMware Tools heartbeats are not received for a specific amount of time the virtual machine will be rebooted. An example of when this will happen for instance is when a Windows virtual machine shows a BSOD.
The big question of course was why didn’t this trigger a response?
The answer is simple: The VMware Tools heartbeat does not use the virtual machine NIC. This heartbeat is “caught” by hostd and passed on to vCenter. vCenter uses this to show those “green/yellow/red” alarm dots. The same heartbeat is used by VM Monitoring to detect the failure of a virtual machine. Even without any NIC attached to your virtual machine these heartbeats will still be received.
One thing to keep in mind though is that when heartbeats are no longer received, by default sent out every second, VM Monitoring will check if there is any Network or Storage I/O to avoid false positives.
Question for you guys! One thing that I always wondered is how many people use VM Monitoring? And if you use it, do you use it on all VMs in every cluster?
NiTRo says
We use it on every cluster with this settings (we encountered some reset during long reboot process) :
Failure interval = 300 sec
Min uptime = 300 sec
Max per-VM resets = 3
Max reset time windows = 1 hour
Matt Liebowitz says
We don’t use this for the most part. We try to use application aware HA methods (such as clustering) for most applications like Exchange, SQL, etc. Having never used it I never developed a comfort level with it, but maybe now is the time to start testing it..
ibeerens says
I don’t enable it anymore by default, some customers had a lot of problems with the VM Monitoring BUG in version 3.x. Only if the customer wants, i turn it on.
PiroNet says
Good you clarified the issue.
I noticed that I have much much less BSODs since I virtualized Windows than when it was running on physical servers doh!
Now I’ve also noticed that Windows servers that were P2V’ed are more ‘fragile’ and occasionaly went into BSOD mode. Here comes the benefit of that HA feature for those type of VMs.
Dennis Bray says
Duncan, I had an issue with a client that has VM monitoring enabled. Initially, they enabled VM monitoring on all VMs related to a “mission critical” application. They have run into problems with Java apps running in the VM that consume all the CPU cycles available to the VM and then vCenter restarts the VM. In this case we were able to work with the app developers/admins to throttle the JVM’s to allow CPU cycles for the guest OS and moderating the VM monitoring sensitivity thus preventing future “HA events” related to starving the VM tools/heartbeat.
Tom says
It sounds like it’s safe to use VM Monitoring in SMB environments where VMs are not being “heavily used” the same as in big companies.
Are there any kinds of apps/VMs/servers that are NOT good to use VM monitoring??
Such as VMs used only infrequently??
Thank you, Tom
Sketch says
slow down here…. Its Friday. my initial question is, are you honestly referring to HA? as HA does not require vCenter (or just vCenter within an HA cluster?). I’m a bit confused here as vCenter heartbeat and HA are two seperate items… which one are we talking about? if we’re talking about a vCenter server in an HA cluster, you’d have to take the entire host ‘offline’ and that is related to pNICs… or am i just WAY off here?? did I mention its Friday?
Tom says
TGIF 🙂
Arseny says
Maybe little bit off-topic, but my 2 cents are:
We at Veeam have a pretty often situation with guest heartbeat flip-flops…
If this flip-flop is detected by nworks MP for SCOM or SPI for OpenView, it can generate 1000s of alerts a day in larger environments. To assure the problem we recommend going to MOB of VC (https://vcenter.tld/mob/?moid=vm-%the-vm-id% ) and look at guestHeartbeatStatus if it flip-flops.
I seen frequency of 1 change in couple of seconds. Not sure if this would affect VM HA functionality.
Daniel Hernandez says
On the PSO side of the house. I’ve seen one client use it before with the advanced settings that Nitro has implemented. The client’s reasoning for using it was that it helps when an FC HBA failure occurs.
Am currently testing it at a NFS shop to combat an HA issue I have. But for a majority of clients I’ve been to I normally see it disabled.
Alastair Cooke says
I explain VM monitoring as being like the old Compaq Automatic Server Restart (ASR) feature that would reset a failed OS if it hung for ten minutes.
I think people are getting issues when they set VM monitoring too aggressive.
With conservative settings it could be a big help for flaky VMs. It can also be disabled for the heavy Java VM without turning it off for the whole cluster.
Suttoi says
Was talking with a customer today about how we could test VM HA and referenced this article to backup my argument.
Thanks for making me look good. Everyone should read Yellow Bricks.
On a related topic I would be grateful if you could clarify something…
If a FT enabled VM encounters a blue screen, since they are in lock step, does the secondary also blue screen?
I’m guessing that it will, since FT is designed to protect against host failures?
But lots of VMWare’s stuff is so close to actual magic that I’m not sure.
Kelley says
Wow, I would like an answer to Suttoi’s question. Anyone still following this thread?
Duncan Epping says
Yes it wil,
david says
question: we have a virtual machine with no vmwaretools installed. But we see that this machine sometime’s get reset because of a VMware Tools heartbeat failure. Now i am wondering if VM Monitoring also checks machine’s with no vmwaretool!?
Todd Wright says
If you want to test this with a Windows VM you can try… http://www.wikihow.com/Force-a-Blue-Screen-in-Windows
Vikash Kumar Jha says
Hi Duncan
We found that there are some idle Windows 7 VMs, which are showing high CPU utilization. When we logged in to the machine we found that in different VMs there are different kind of service which are raising the high CPU utilization, these services are:
1. Msiexec.exe
2. Spoolsv.exe
3. Savservice.exe
4. Dgagent.exe
5. Iexplorer.exe
6. System idle process
7. Screen agent
What we supposed to do here to stop the alerts?
We have 12 hosts, 164 GB of RAM and 12CPUs for each host and on average each host has 85-90 windows 7 VMs.
Duncan says
Not sure why this is happening. I haven’t experienced it. Where these VMs created from scratch or P2V’d?
Vikash Kumar Jha says
These VMs are created from scratch. Also i want to know is there any possibility that we get inside the idle VMs without login into them and see what is there which increase the load?
Vikash Kumar Jha says
Hi
Is there anyway to see and kill the processes that are raising high CPU utilization in the idle VMs?
Chance Ballato says
Hello, you used to write fantastic, but the last few posts have been kinda boring¡K I miss your tremendous writings. Past several posts are just a little out of track! come on!
Justin McD says
I just enabled VM Monitoring on vSphere 5.1 and it is working great for me as a workaround to a 2008 R2 VM that keeps becoming unresponsive nearly once a day. The VMware Tools stop running but you can still ping and telnet to the RDP port but RDP doesnt actually work. I have tried changing the video drivers, reinstalling VMware Tools and nothing works. But using this allows the VM Reset to be automated instead of manual and this means less downtime for the application running in the VM. This is my first time using it so hopefully I won’t run into any other issues later.
Suresh Siwach says
Hi Duncan,
The use case for VM heartbeat monitoring is when folks don’t have the access on the VM and they can’t install there monitoring tool on it and change there configuration and still they want to have monitor the Virtual Machine.