VMware

RE: Maximum Hosts Per Cluster (Scott Drummonds)

Duncan Epping · Nov 29, 2010 ·

I love blogging because of the discussions you some times get into. One of the bloggers I highly respect and closely follow is EMC’s vSpecialist Scott Drummonds (former VMware Performance Guru). Scott posted a question on his blog about what the size of a cluster should be. Scott discussed this with Dave Korsunksy and Dan Anderson, both VMware employee, and more or less came to the conclusion that 10 is probably a good number.

So, have I given a recommendation? I am not sure. If anything I feel that Dave, Dan and I believe that a minimum cluster size needs should be set to guarantee that the CPU utilization target, and not the HA failover capacity, is the defining the number of wasted resources. This means a minimum cluster of something like four or five hosts. While neither of us claims a specific problem that will occur with very large clusters, we cannot imagine the value of a 32-host cluster. So, we think the right cluster size is somewhere shy of 10.

And of course they have a whole bunch of arguments for both Large( 12+) and small (8-) clusters… which I summarized below for your convenience

Pro Large: DRS efficiency. This was my primary claim in favor of 32-host clusters. My reasoning is simple: with more hosts in the cluster there are more CPU and memory resource holes into which DRS can place running virtual machines to optimize the cluster’s performance. The more hosts, the more options to the scheduler.
Pro Small: DRS does not make scheduling decisions based on the performance characteristics of the server so a new, powerful server in a cluster is just as likely to receive a mission-critical virtual machine as older, slower host. This would be unfortunate if a cluster contained servers with radically different–although EVC compatible–CPUs like the Intel Xeon 5400 and Xeon 5500 series.
Pro Small: By putting your mission-critical applications in a cluster of their own your “server huggers” will sleep better at night. They will be able to keep one eye on the iron that can make or break their job.
Pro Small: Cumbersome nature of their change control. Clusters have to be managed to a consistent state and the complexity of this process is dependent on the number of items being managed. A very large cluster will present unique challenges when managing change.
Pro Small: To size a 4+1 cluster to 80% utilization after host failure, you will want to restrict CPU usage in the five hosts to 64%. Going to a 5+1 cluster results in a pre-failure CPU utilization target of 66%. The increases slowly approach 80% as the clusters get larger and larger. But, you can see that the incremental resource utilization improvement is never more than 2%. So, growing a cluster slightly provides very little value in terms of resource utilization.

It is probably an endless debate and all the arguments for both “Pro Large” and “Pro Small” are all very valid although I seriously disagree with their conclusion as in not seeing the value of a 32-host cluster. As always it fully depends. On what in this case you might say, why would you ever want a 32-host cluster? Well for instance when you are deploying vCloud Director. Clusters are currently your boundary for your vDC, and who wants to give his customer 6 vDCs instead of just 1 because you limited your cluster size to 6 hosts instead of leaving the option open to go to the max. This might just be an exception and nowhere near reality for some of you but I wanted to use this as an example to show that you will need to take many factors into account.
Now I am not saying you should, but at least leave the option open.

One of the arguments I do want to debate is the Change Control argument. Again, this used to be valid in a lot of Enterprise environments where ESX was used. Now I am deliberately using “ESX” and “Enterprise” here as reality is that many companies don’t even have a change control process in place. (I worked for a few large insurance companies which didn’t!) On top of that there is a large discrepancy when it comes to the amount of work associated with patching ESX vs ESXi. I have spent many weekends upgrading ESX but today literally spent minutes upgrading ESXi. The impact and risks associated with patching has most certainly decreased with ESXi in combination with VUM and the staging options. On top of that many organizations treat ESXi as an appliance, and with with stateless ESXi and the Auto-Deploy appliance being around the corner I guess that notion will only grow to become a best practice.

A couple of arguments that I have often seen being used to restrict the size of a cluster are the following:

HA limits (different max amount of VMs when cluster are > 8 hosts)
SCSI Reservation Conflicts
HA Primary nodes

Let me start with saying that for every new design you create, challenge your design considerations and best practices… are the still valid?

The first one is obvious as most of you know by now that there is no such a thing anymore as an 8 host boundary with HA. The second one needs some explanation. Around the VI3 time frame cluster sizes were often limited because of possible storage performance issues. These alleged issues were mainly blamed on SCSI Reservation Conflicts. The conflicts were caused by having many VMs on a single LUN in a large cluster. Whenever a metadata update was required the LUN would be locked by a host and this would/could increase overall latency. To avoid this, people would keep the amount of VMs per VMFS volume low (10/15) and keep the amount of VMFS volumes per cluster low…. Also resulting in a fairly low consolidation factor, but hey 10:1 beats physical.

Those arguments used to be valid, however things have changed. vSphere 4.1 brought us VAAI; which is a serious game changer in terms of SCSI Reservations. I understand that for many storage platforms VAAI is currently not supported… However, the original mechanism which is used for SCSI Reservations has also severely improved over time (Optimistic Locking) which in my opinion reduced the need to have many small LUNs, which eventually would limit you from a max amount of LUNs per host perspective. So with VAAI or Optimistic Locking, and of course NFS, the argument to have small clusters is not really valid anymore. (Yes there are exceptions)

The one design consideration, which is crucial, that is missing in my opinion though is HA node placement. Many have limited their cluster sizes because of hardware and HA primary node constraints. As hopefully known, if not be ashamed, HA has a maximum of 5 primary nodes in a cluster and a primary is required for restarts to take place. In large clusters the chances of losing all primaries also increase if and when the placement of the hosts is not taken into account. The general consensus usually is, keep your cluster limited to 8 and spread across two racks or chassis so that each rack always has at least a single primary node to restart VMs. But why would you limit yourself to 8? Why, if you just bought 48 new blades, would you create 6 clusters of 8 hosts instead of 3 clusters of 16 hosts? By simply layering your design you can mitigate all risks associated with primary nodes placements while benefiting from additional DRS placement options. (Do note that if you “only” have two chassis, your options are limited.) Which brings us to another thing I wanted to discuss…. Scott’s argument against increased DRS placement was that hundreds of VMs in an 8 host cluster already leads to many placement options. Indeed you will have many load balancing options in an 8 host cluster, but is it enough? In the field I also see a lot of DRS rules. DRS rules will restrict the DRS Load Balancing algorithm when looking for suitable options, as such more opportunities will more than likely result in a better balanced cluster. Heck, I have even seen cluster imbalances which could not be resolved due to DRS rules in a five host cluster with 70 VMs.

Don’t get me wrong, I am not advocating to go big…. but neither am I advocating to have a limited cluster size for reasons that might not even apply to your environment. Write down the requirements of your customer or your environment and don’t limit yourself to design considerations around Compute alone. Think about storage, networking, update management, max config limits, DRS&DPM, HA, resource and operational overhead.

vStorage APIs for Array Integration aka VAAI

Duncan Epping · Nov 23, 2010 ·

It seems that a lot of vendors are starting to update their firmware to enable virtualized workloads from the vStorage APIs for Array Integration, also known as VAAI. Not only the vendors are starting to show interest, also the bloggers are picking up on it. Hence the reason I wanted to reiterate some of the excellent details out there and wanted to make sure everyone understands what VAAI brings. Although currently there are “only” three major improvements they can and probably will make a huge difference:

Hardware Offloaded Copy
Up to 10x faster VM deployment, cloning, Storage vMotion etc. VAAI offloads the copy task to the array, enabling the usage of native storage based mechanism resulting in a decrease of deployment time but equally important reducing the amount of data flowing between the array and server. Check this post by Bob Plankers and this one by Matt Liebowitz which clearly demonstrates the power of hardware offloaded copies! (reducing cloning from 19Minutes to 6Minutes!)
Write Same/Zero
10 x times less I/O for common tasks. Take for instance a zero-out process. It typically sends the same SCSI command several times. By enabling this option the same command will be repeated by the storage platform resulting in reduced utilization of the server while decreasing the time span of the action.
Hardware Offloaded Locking
SCSI Reservation Conflicts…. How many times have I heard that during Health Checks / Design Reviews and while troubleshooting performance related issues. Well VAAI solves those issues as well by offloading the locking mechanism to the array, also known as Atomic Test & Set aka ATS. It will more than likely reduce latency in an environment where thin-provisioned disks are used or linked clones, or even where VMware based snapshots are used. ATS removes the need to lock the full VMFS volume but instead locks a block when an update needs to occur.

One thing I wanted to point out here, which I haven’t seen mentioned yet, is that VAAI will actually allow you to have larger VMFS volumes. Now don’t get me wrong, I am not saying that you can go beyond 2TB-512b by enabling VAAI… My point is that by having VAAI enabled you will reduce the “load” on the array and on the servers. I placed quotes around load as it will not reduce the load from a VM perspective. What I am trying to get at is that many people have limited the amount of VMs per VMFS volume because of “SCSI Reservation Conflicts”. With VAAI this will change. Now you can keep your calculations “simple” and base your VMFS size on the amount of eggs you can have in a single basket and the sum of all VMs IOPS requirements.

After reading about all of this goodness I bet many of you want to use it straight away, well of course your array will need to support it first. Tomi Hakala created a nice list of all storage platforms that are currently supported and those that will be supported soon including a time frame. If your array is supported this KB explains perfectly how to enable/disable it.

I started out with saying that there are currently only three major enhancements…. that means indeed that there is more coming up in the future. Some of which I can’t discuss and others that I can as those were already mentioned at VMworld. (If you have access to TA7121 watch it!) I can’t say when they will be available or in which release, but I think it is great to know more enhancements are being worked on.

Dead Space Reclamation
Dead space is previously written blocks that are no longer used by the VM. Currently in order to reclaim diskspace (for instance when you’ve deleted a lot of files) these blocks you will need to zero them out with for instance sdelete and then Storage vMotion the VM. Dead Space Reclamation will enable the storage system to reclaim these dead blocks by giving block liveness information.
Out-of-space conditions notifications
This is very much an improvement for day-to-day operations. It will enable notification of possible “out-of-space” conditions on both the array vendor’s tool both also within the vSphere client!

Must reads:

Chad Sakac – What does VAAI mean to you?
Bob Plankers – If you ever needed convincing about VAAI
AndreTheGiant – VAAI
VMware KB – VAAI FAQ
VMware Support Blog – VAAI changes the way storage is handled
Matt Liebowitz – Exploring the performance benefits of VAAI
Bas Raayman – What is VAAI, and how does it add spice to my life as a VMware admin?

das.slotCpuInMHz and das.slotMemInMB

Duncan Epping · Nov 18, 2010 ·

I was reading some threads on the VMTN forum and noticed a question on an advanced HA setting called “das.slotMemInMB”. The setting is briefly mentioned in my deep-dive, but after re-reading the section I think I could have been more clear in describing what is does, how it works and when to use it. Of course anything that goes for das.slotMemInMB also goes for das.slotCpuInMHz.

This is what I added to the deep-dive but I also wanted to share it through a regular blog to give it a bit more attention:

The advanced setting das.slotCpuInMHz and das.slotMemInMB allow you to specify an upper boundary for your slot size. When one of your VMs has an 8GB reservation this setting can be used to define for instance an upper boundary of 1GB to avoid resource wastage and an overly conservative slot size. However when for instance das.slotMemInMB is configured to 2048MB and the lowest reservation is 500MB then the slotsize for memory will be 500MB+memory overhead. If a lower boundary needs to be specified the advanced setting “das.vmMemoryMinMB” or “ das.vmCpuMinMHz” can be used.

VMotion, the story and confessions

Duncan Epping · Nov 11, 2010 · 52 Comments

There is something I always wanted to know and that is how VMotion (yes I am using the old school name on purpose) came to life. After some research on the internet and even on the internal websites I noticed that there are hardly any details to be found.

Now this might be because the story isn’t as exciting as we hope it will be or because no one took the time to document it. In my opinion however VMotion is still one of the key features VMware offers but even more important it is what revolutionized the IT world. I think it is a great part of VMware history and probably the turning point for the company. For me personally VMotion literally is what made me decide, years ago, to adopt virtualization and I am certain this goes for many others.

At VMworld I asked around who was mainly responsible for VMotion back in the days but no one really had a clear answer until I bumped into Kit Colbert. Kit, who was still an intern back then, worked closely with the person who originally developed VMotion. I decided to contact the engineer and asked him if he was willing to share the story as there are a million myths floating around.

Before I reveal the real story about how VMotion came to life I want to thank Mike Nelson for revolutionizing the world of IT and taking the time to share this with me and allowing me to share it with the rest of the world. Here is the true story of VMotion:

A bunch of us at VMware came from academia where process migration was popular but never worked in any mainstream OS because it had too many external dependencies to take care of. The VMware platform on the other hand provided the ability to encapsulate all of the state of a virtual machine. This was proven with checkpointing; where we were able to checkpoint a virtual machine, copy the state to another host, and then resume it. It was an obvious next step that if we could checkpoint to disk and resume on another machine that we should be able to checkpoint over the network to another machine and resume.

During the design phase for what would later become Virtual Center a couple of us discussed the notion of virtual machine migration. I took the lead and wrote up some design notes. I managed to extract myself from the mainline development of ESX 2.0 and I decided to go off and build a virtual machine migration prototype. I was able to build a prototype fairly quickly because we already had checkpointing support. However, of course there was a lot more work done by myself and others to turn the prototype into a high quality product.

I needed something to demo it so I used the pinball application on Windows. The only interactive app I had on my virtual machines was pinball. I had two machines side by side each with a display. I would start pinball on a virtual machine on one physical machine. Then I would start the migration and keep playing pinball. When the pre-copy of memory was done it would pause for a second and then resume on the other machine. I would then keep playing pinball on the other machine.

That’s the VMotion story. Basically VMware had built the underlying technology that made VMotion possible. All it required was someone to take the time to exploit this technology and build VMotion.

-Mike

The funny thing is that although this might have been the obvious next step for VMware engineering it is something that “shocked” many of us. Most of us will still remember the first time they heard about VMotion or remember it being demoed, and as I stated it is the feature that convinced me to adopt virtualization at large scale, or better said it is responsible for me ending up here! In my case the demo was fairly “simple” as we VMotioned a Windows VM, however we had an RDP session open to the VM and of course we were convinced the session would be dropped. I think we did the actual VMotion more than 10 times as we couldn’t believe it actually worked.

Now I am not the only one who was flabbergasted by this great piece of technology of course hence the reason I reached out to a couple of the well known bloggers and asked if they could tell their VMotion story/confessions…

Cormac Hogan, comachogan.com

So for me, it happened back around 2004 when I was still at EMC. I was part of a team that provided customer support for Unix and Linux platform. I had seen ESX (might have been 2.0) when someone said that I needed to see this new vMotion feature. I didn’t really get it when he said that he had just moved all the VMs from host A to host B. But when he then flicked the power button on host A, and when I saw that all the VMs were still running, then it sunk in. That was then I knew I needed to work for this company!

Truly eye opening experience, which resulted in a career change. How cool is that 🙂

Chad Sakac, virtualgeek.typepad.com

For me, while I remember being amazed from a generic bland use case, the “this is going to change everything” moment occured for me in 2007.

If vmotion is about non-disruptive workload mobility (an amazing concept), where things get crazy cool for me are scenarioes and definitions of “workload” and “mobility” are stretched.

In early 2007, I was in the basement of my house playing with early prototypes of the Celerra VSA running on ESX whiteboxes. It was one of those now-common “russian doll” scenarios where the host powering the VSA was in turn being supported by an iSCSI LUN being presented by the VSA, which in turn supported other VMs. While intellectually obvious that VMotion **should** work, it was never the less amazing to see in it action, with no dropped connection under load.

At that moment, I realized that the workload could be as broad a definition as I wanted, including full blown stacks normally associated with “hardware” like arrays. It was also an “aha” that this could transform a million use cases not normally associated with a server workload.

Ironic side note – the next day, I was showing that concept in the boardroom during a discussion why all our stacks needed to be encapsulated and virtualized. Turns out they were already working on it 🙂

That all said – those “aha moments” happen constantly. Another example – this time more recently – was about stretching the definition of “mobility”. It was during the run-up to VMworld 2010, when we were doing the demo work for the VERY long distance vMotion scenarios with early prototypes of VPLEX Geo. As we dialed up the latency between the ESX hosts on the network and storage – I was very curious to where it would blow up. When it made it past 44ms RTT (for math/physics folks – that’s the latency equivalent of 13,000km at the speed of light!), it was a “wow” moment (BTW, it blew up at 80ms :-)) I need to point out here that it completely violates the VMware support position (and for many, many good reasons – one “it worked in this narrow case in the lab” does not equal “works in the real world”), so don’t try this at home.

BUT it highlighted how, over time, the idea of non-disruptive workload mobility over what TODAY are consider crazy distances, network, and storage configs will tomorrow be considered normal.

vMotion and svMotion never cease to amaze me.

Nothing less than expected of course, some crazy scenario and as Chad states it isn’t supported but it definitely shows the potential of the technology!

Frank Denneman, frankdenneman.nl

During our VCDX sessions in Copenhagen we spoke about things in your life you would always remember. My reply was ; Seeing Return of the Jedi in the cinema, the falling of the Berlin wall, 9/11, Pim Fortuyn murder and witnessing vMotion in action for the first time.

I clearly remember my colleague screaming through the wall that separated our office. “Frank do you really want to see something cool?” As an MS exchange admin/architect responsible for a global spanning exchange infrastructure nothing really could impress me those days but giving him the benefit of the doubt I walked over. Peter sitting there grinning like a madman, offered me a seat, because he thought it was better to sit down. He opened a dos prompt, triggered a continuous ping and showed the virtual infrastructure explaining the current location of the virtual machine. As he started to migrate the virtual machine he instructed me to keep tracking the continuous ping, after the one ping loss he explained the virtual machine was up and running on the other host and to prove me, he powered-down the ESX host. I just leaped out of my seat, said some words I cannot repeat online and was basically sold. I think we migrated the virtual machine all day long, inviting anyone who passed by our office to see the best show on earth. No explanation needed of course, but from that point I was hooked on virtualization and the rest is history.

I still enjoy explaining people the technology of vMotion and it still ranks in my book as one of the most-kick-ass technologies available today. As Mendel explained in the keynote of VMworld 2006 demonstrating recording an execution stream (now FT), we have the technology and the platform available to do anything we want, the problem is we still haven’t reached the boundaries of our creativity, I fully concur and I think we still haven’t reached the full potential of vMotion. Heck, I’m off to my lab just to vMotion a bunch of virtual machines.

Can I thank Peter for introducing Frank to the wonderful world of virtualization?

Mike Laverick, rtfm-ed.co.uk

My first VMotion was demo of media server being moved from one ESX hosts to another – with the buffering switched off. I forget now what movie clip was being shown to the desktops – I think it might have been a Men In Black trailer. Anyway, nothing flickered and nothing stopped – the video just kept on playing without a hick-up.

At that point my mind began to race. I was thinking initially about hardware maintenance. But quickly (this in in ESX2) days began to think of moving VMs around to improve performance, and possibility of moving VMs across large distances. At the time I told my Microsoft chums all about this, and they were very skeptical. Virtualization, they (mis)informed me, was going to be a flash in the pan, and that VMotion was some kind of toy – of course, in a Road to Damascus way now HyperV supports “Live Migrate” its an integral part of virtualization. In truth when I started to demo VMotion to my students occasionally I felt like I was show-boating. This was in the vCenter 1.x days. But in some respects there’s no harm in showboating. It allowed me to demonstrate to students how far ahead VMware was against the competition, and what a visionary are company VMware is. It certainly added to my credibility to have a technology that was so easy to setup (so long as you meet the basic pre-requisites) and the great thing about VMware and the courses is that you didn’t have to “hard sell” the product sold itself.

On a more humorous note I’ve seen all kind of wacky VMotion setups. I once had two PIII servers with a shared DEC JBOD with SCSI personality cables (circa NT4 Clustering configuration) just to get the shared storage running. I managed to get VMotion working with 1997 era equipment. I’ve also been asked by student – who had laser line-of-site connectivity between two buildings – if he could VMotion between them. I laughed and said as long as he could meet the pre-requisites there would be no reason why not. Although it would definitely be unsupported. Then I smiled and said, if he ever got VMotion working – I would come round in my Dr Evil outfit to explain – VMotion – with laser.

As I already stated, but re-enforced by Mike… VMotion changed the world, and the fact that both Microsoft and Citrix copied the feature definitely supports that claim… now I am wondering if the VMotion across “laser line-of-site connectivity” actually worked or not!

Scott Lowe, blog.scottlowe.org

I remember when I first started testing vMotion (then VMotion, of course). I was absolutely sure that it had to be a trick–surely you can’t move a running workload from 1 physical server to another! I performed my first vMotion with just a standard Windows 2000 server build. It worked as expected. So I tried a Citrix Metaframe server with users logged in. It worked, too. Then I tried a file server while copying files to and from the server. Again, it worked. SSH? Worked. Telnet? Worked. Media server with clients streaming content? Web server while users were accessing pages and downloading files? Active Directory? Solaris? Linux? Everything worked. At this point, after days–even weeks–of unsuccessfully trying to make it fail, I was sold. I was officially hooked on virtualization with VMware.

Thanks for the invitation to share memories about vMotion!

It appears that all top bloggers got hooked on virtualization when they witnessed a VMotion… As I stated at the beginning of this post; VMotion revolutionized the world of IT and I would like to thank VMware and especially Mike Nelson for this great gift! I also like to thank Scott, Mike, Frank and Chad for sharing their stories and I bet many of you are currently having flashbacks of when you first witnessed a VMotion.

How does Symantec ApplicationHA integrate/work with VMware HA?

Duncan Epping · Nov 3, 2010 ·

I briefly touched on this topic already, but I guess on Yellow-Bricks the primary focus should be technical. I downloaded Symantec’s Application HA and decided to give it a spin, I already started documenting the installation of the ApplicationHA Console and the Guest Component but Mr Sloof beat me to it and documented his finding in the following two articles.

No need to repeat it in my opinion as Eric did an excellent job. I do want to point two things out and they are around the requirements/constraints of the product:

Symantec ApplicationHA Console requires Windows Server 2008 or Windows Server 2008 R2.
- The Console provides the integration between vCenter and ApplicationHA from a UI perspective. An ApplicationHA tab will be added to the vSphere Client which can be used to configure and control ApplicationHA.
A 64Bit Guest OS is required for the ApplicationHA Windows Guest Component.
- This is the actually component that enables application level protection.

This basically means that if you are primarily using 32Bit OS’s, like I for instance do in my homelab, it won’t work as the installer literally throws an error that the VM doesn’t meet the requirements. That is for now, as I bet that when demand grows Symantec will add support for 32Bit OSs.

So what is this ApplicationHA product?

Symantec ApplicationHA is an extension of VM / Application Monitoring. Symantec simplified Veritas Cluster Server (VCS) to enable application availability monitoring including of course responding to issues. Just to be absolutely clear, the foundation still is VCS based the unnecessary bits and pieces were taken out. Note that it is not a multi-node clustering solution like VCS itself but a single node solution.

The question remains how does it integrate with vCenter and more specifically with HA?

Lets start with the integration with vCenter. This is where the “ApplicationHA Console” comes into play. It basically adds a tab on a VM level that gives you the details of this VM in terms of ApplicationHA of course depending on the fact if you installed the Guest Component or not the Tab will display different info. Ultimately it should look as follows:

Now this example shows IIS being monitoring, but not only IIS also a disk mount. By the way, besides having a standard agent for IIS; ApplicationHA also includes an agent for MS SQL and MS Exchange. My primary focus for this article is Windows by of course Symantec also offers agents for Linux which includes Oracle, SAP and Weblogic. Just because your application is not included today it doesn’t mean you are out of luck. Symantec regularly updates the ApplicationHA Agent pack. However there is also a standard Guest Component which enables you to monitor standard services, mountpoints, processes and is what Symantec calls an Infrastructure Agent.

That brings me to the next point. From a GUI perspective it seems like there is a single component that monitors the state of your application. That is not the case, as stated above it can monitor the availability of services, process and much more. As stated this is what Symantec calls the Infrastructure Agent and yes it has multiple components. I have listed the most crucial agents below with a short description:

Heartbeat Agent – The Heartbeat agent monitors the configured application service group.
MountMonitor Agent – The MountMonitor agent monitors the mount path of the configured storage. It is independent of how the underlying storage is managed (whether SFW disk groups or LDM disks or any other storage management software). The mount path can be a drive letter or a folder mount.
GenericService Agent – The GenericService agent brings services online, takes them offline, and monitors their status. A service is an application type that is supported by Windows, and conforms to the interface rules of the Service Control Manager (SCM).
Process Agent – The Process agent brings processes online, takes them offline, and monitors their status. You can specify different executables for each process routine.

Each of these agents has very specific configuration attributes. An example for instance would be the “DelayBeforeAppFault” for the Heartbeat Agent. This specifies the number of seconds the agent must wait in case of an error before communicating application fault to VMware HA. I guess I just hit the nail on the head. The Heartbeat Agent is the agent that is responsible for communication with VMware HA. Basically it provides a means to trigger VMware HA to kick in when Symantec can not handle the issue with a restart of the service.

So what happens if an Application that has been configured for protection fails?

First Symantec ApplicationHA is triggered to try to get the Application up and running after the failure occurred by restarting the application. It should be noted that Symantec’s ApplicationHA is aware of dependencies and knows in which order services should be started or stopped. If however for whatever reason this fails for an “X” amount of times VMware HA will be asked to take action. (X is configurable by changing the config attribute “AppRestartAttempts”.) The action that HA takes will be a restart of the virtual machine.

But wait is it really VMware HA? With that meaning is it the good old “AAM” agent who is responsible for this action? No it isn’t. In this case both “vpxa” and “hostd” are responsible for the action taken. I guess visualizing the process will make it a bit easier to digest:

I hope this makes it a bit more clear on how the product works and how it integrates with VMware HA itself.