VM Monitoring (aka VM HA) heartbeat

Duncan Epping · Jun 4, 2010 ·

I got a question around VM Monitoring (aka virtual machine level HA) this week. A customer wanted to test if VM Monitoring worked and as such disabled the NIC of the virtual machine and waited for 30 seconds for the VM Monitoring response to kick in…. nothing happened.

VM Monitoring restarts individual virtual machines when needed. VM monitoring uses a similar concept as HA, heartbeats. If heartbeats, and in this case VMware Tools heartbeats are not received for a specific amount of time the virtual machine will be rebooted. An example of when this will happen for instance is when a Windows virtual machine shows a BSOD.

The big question of course was why didn’t this trigger a response?

The answer is simple: The VMware Tools heartbeat does not use the virtual machine NIC. This heartbeat is “caught” by hostd and passed on to vCenter. vCenter uses this to show those “green/yellow/red” alarm dots. The same heartbeat is used by VM Monitoring to detect the failure of a virtual machine. Even without any NIC attached to your virtual machine these heartbeats will still be received.

One thing to keep in mind though is that when heartbeats are no longer received, by default sent out every second, VM Monitoring will check if there is any Network or Storage I/O to avoid false positives.

Question for you guys! One thing that I always wondered is how many people use VM Monitoring? And if you use it, do you use it on all VMs in every cluster?

Comments

NiTRo says

4 June, 2010 at 14:19

We use it on every cluster with this settings (we encountered some reset during long reboot process) :
Failure interval = 300 sec
Min uptime = 300 sec
Max per-VM resets = 3
Max reset time windows = 1 hour
Matt Liebowitz says

4 June, 2010 at 14:24

We don’t use this for the most part. We try to use application aware HA methods (such as clustering) for most applications like Exchange, SQL, etc. Having never used it I never developed a comfort level with it, but maybe now is the time to start testing it..
ibeerens says

4 June, 2010 at 14:27

I don’t enable it anymore by default, some customers had a lot of problems with the VM Monitoring BUG in version 3.x. Only if the customer wants, i turn it on.
PiroNet says

4 June, 2010 at 14:32

Good you clarified the issue.

I noticed that I have much much less BSODs since I virtualized Windows than when it was running on physical servers doh!

Now I’ve also noticed that Windows servers that were P2V’ed are more ‘fragile’ and occasionaly went into BSOD mode. Here comes the benefit of that HA feature for those type of VMs.
Dennis Bray says

4 June, 2010 at 14:35

Duncan, I had an issue with a client that has VM monitoring enabled. Initially, they enabled VM monitoring on all VMs related to a “mission critical” application. They have run into problems with Java apps running in the VM that consume all the CPU cycles available to the VM and then vCenter restarts the VM. In this case we were able to work with the app developers/admins to throttle the JVM’s to allow CPU cycles for the guest OS and moderating the VM monitoring sensitivity thus preventing future “HA events” related to starving the VM tools/heartbeat.
Tom says

4 June, 2010 at 14:40

It sounds like it’s safe to use VM Monitoring in SMB environments where VMs are not being “heavily used” the same as in big companies.

Are there any kinds of apps/VMs/servers that are NOT good to use VM monitoring??

Such as VMs used only infrequently??

Thank you, Tom
Sketch says

4 June, 2010 at 14:46

slow down here…. Its Friday. my initial question is, are you honestly referring to HA? as HA does not require vCenter (or just vCenter within an HA cluster?). I’m a bit confused here as vCenter heartbeat and HA are two seperate items… which one are we talking about? if we’re talking about a vCenter server in an HA cluster, you’d have to take the entire host ‘offline’ and that is related to pNICs… or am i just WAY off here?? did I mention its Friday?
Tom says

4 June, 2010 at 14:47

TGIF 🙂
Arseny says

4 June, 2010 at 20:14

Maybe little bit off-topic, but my 2 cents are:

We at Veeam have a pretty often situation with guest heartbeat flip-flops…

If this flip-flop is detected by nworks MP for SCOM or SPI for OpenView, it can generate 1000s of alerts a day in larger environments. To assure the problem we recommend going to MOB of VC (https://vcenter.tld/mob/?moid=vm-%the-vm-id% ) and look at guestHeartbeatStatus if it flip-flops.

I seen frequency of 1 change in couple of seconds. Not sure if this would affect VM HA functionality.
Daniel Hernandez says

4 June, 2010 at 20:42

On the PSO side of the house. I’ve seen one client use it before with the advanced settings that Nitro has implemented. The client’s reasoning for using it was that it helps when an FC HBA failure occurs.
Am currently testing it at a NFS shop to combat an HA issue I have. But for a majority of clients I’ve been to I normally see it disabled.
Alastair Cooke says

9 June, 2010 at 06:53

I explain VM monitoring as being like the old Compaq Automatic Server Restart (ASR) feature that would reset a failed OS if it hung for ten minutes.

I think people are getting issues when they set VM monitoring too aggressive.

With conservative settings it could be a big help for flaky VMs. It can also be disabled for the heavy Java VM without turning it off for the whole cluster.
Suttoi says

10 June, 2010 at 19:30

Was talking with a customer today about how we could test VM HA and referenced this article to backup my argument.

Thanks for making me look good. Everyone should read Yellow Bricks.

On a related topic I would be grateful if you could clarify something…

If a FT enabled VM encounters a blue screen, since they are in lock step, does the secondary also blue screen?

I’m guessing that it will, since FT is designed to protect against host failures?
But lots of VMWare’s stuff is so close to actual magic that I’m not sure.
Kelley says

19 November, 2010 at 20:22

Wow, I would like an answer to Suttoi’s question. Anyone still following this thread?
Duncan Epping says

19 November, 2010 at 21:44

Yes it wil,
david says

17 March, 2011 at 16:39

question: we have a virtual machine with no vmwaretools installed. But we see that this machine sometime’s get reset because of a VMware Tools heartbeat failure. Now i am wondering if VM Monitoring also checks machine’s with no vmwaretool!?
Todd Wright says

11 April, 2011 at 19:42

If you want to test this with a Windows VM you can try… http://www.wikihow.com/Force-a-Blue-Screen-in-Windows
Vikash Kumar Jha says

4 May, 2012 at 20:47

Hi Duncan

We found that there are some idle Windows 7 VMs, which are showing high CPU utilization. When we logged in to the machine we found that in different VMs there are different kind of service which are raising the high CPU utilization, these services are:
1. Msiexec.exe
2. Spoolsv.exe
3. Savservice.exe
4. Dgagent.exe
5. Iexplorer.exe
6. System idle process
7. Screen agent

What we supposed to do here to stop the alerts?
We have 12 hosts, 164 GB of RAM and 12CPUs for each host and on average each host has 85-90 windows 7 VMs.
Duncan says

4 May, 2012 at 21:46

Not sure why this is happening. I haven’t experienced it. Where these VMs created from scratch or P2V’d?
Vikash Kumar Jha says

22 May, 2012 at 14:54

These VMs are created from scratch. Also i want to know is there any possibility that we get inside the idle VMs without login into them and see what is there which increase the load?
Vikash Kumar Jha says

27 May, 2012 at 17:40

Hi

Is there anyway to see and kill the processes that are raising high CPU utilization in the idle VMs?
Chance Ballato says

23 June, 2012 at 13:36

Hello, you used to write fantastic, but the last few posts have been kinda boring¡K I miss your tremendous writings. Past several posts are just a little out of track! come on!
Justin McD says

14 December, 2012 at 22:42

I just enabled VM Monitoring on vSphere 5.1 and it is working great for me as a workaround to a 2008 R2 VM that keeps becoming unresponsive nearly once a day. The VMware Tools stop running but you can still ping and telnet to the RDP port but RDP doesnt actually work. I have tried changing the video drivers, reinstalling VMware Tools and nothing works. But using this allows the VM Reset to be automated instead of manual and this means less downtime for the application running in the VM. This is my first time using it so hopefully I won’t run into any other issues later.
Suresh Siwach says

29 July, 2014 at 18:14

Hi Duncan,

The use case for VM heartbeat monitoring is when folks don’t have the access on the VM and they can’t install there monitoring tool on it and change there configuration and still they want to have monitor the Virtual Machine.

Related

Reader Interactions

Comments