BC-DR

VM Monitoring (aka VM HA) heartbeat

Duncan Epping · Jun 4, 2010 ·

I got a question around VM Monitoring (aka virtual machine level HA) this week. A customer wanted to test if VM Monitoring worked and as such disabled the NIC of the virtual machine and waited for 30 seconds for the VM Monitoring response to kick in…. nothing happened.

VM Monitoring restarts individual virtual machines when needed. VM monitoring uses a similar concept as HA, heartbeats. If heartbeats, and in this case VMware Tools heartbeats are not received for a specific amount of time the virtual machine will be rebooted. An example of when this will happen for instance is when a Windows virtual machine shows a BSOD.

The big question of course was why didn’t this trigger a response?

The answer is simple: The VMware Tools heartbeat does not use the virtual machine NIC. This heartbeat is “caught” by hostd and passed on to vCenter. vCenter uses this to show those “green/yellow/red” alarm dots. The same heartbeat is used by VM Monitoring to detect the failure of a virtual machine. Even without any NIC attached to your virtual machine these heartbeats will still be received.

One thing to keep in mind though is that when heartbeats are no longer received, by default sent out every second, VM Monitoring will check if there is any Network or Storage I/O to avoid false positives.

Question for you guys! One thing that I always wondered is how many people use VM Monitoring? And if you use it, do you use it on all VMs in every cluster?

VMware SRM Customer Survey!

Duncan Epping · May 28, 2010 ·

I just received an email from Hari Krishnan who is a Senior Product Manager at VMware. Hari has created a survey and is looking for feedback from our customers. Not only will you be helping VMware out, you will also help out a charity organisation which will receive $ 10 for every response for the first 1000 respondents. So please donate 15 minutes of your time!

Hello SRM users,

The VMware vCenter Site Recovery Manager (SRM) product team is looking for product feedback on SRM deployments. If you have purchased SRM, we would like to hear from you. Your participation will be very valuable to us and the information you provide will be used to improve the SRM product going forward.

You can provide your feedback by completing the survey

The survey should take no longer than 15 minutes and will expire on June 10, 2010. Please note that this survey is for SRM customers only.

Upon completion of the survey, if you are among the 1st 1000 respondents, VMware will donate $10 per response to charity. You will also receive a link to download the electronic copy of Mike Laverick’s book “Administering VMware Site Recovery Manager 4.0” upon completion of the survey.

We appreciate you taking the time to provide us with your valuable feedback.

Thank you,
The VMware SRM Team

Cool new HA feature coming up to prevent a split brain situation!

Duncan Epping · Mar 29, 2010 ·

I already knew this was coming up but wasn’t allowed to talk about it. As it is out in the open on the VMTN community I guess I can talk about it as well.

One of the most common issues experienced with VMware HA is a split brain situation. Although currently undocumented, vSphere has a detection mechanism for these situations. Even more important the upcoming release ESX 4.0 Update 2 will also automatically prevent it!

First let me explain what a split brain scenario is, lets start with describing the situation which is most commonly encountered:

4 Hosts – iSCSI / NFS based storage – Isolation response: leave powered on

When one of the hosts is completely isolated, including the Storage Network, the following will happen:

Host ESX001 is completely isolated including the storage network(remember iSCSI/NFS based storage!) but the VMs will not be powered off because the isolation response is set to “leave powered on”. After 15 seconds the remaining, non isolated, hosts will try to restart the VMs. Because of the fact that the iSCSI/NFS network is also isolated the lock on the VMDK will time out and the remaining hosts will be able to boot up the VMs. When ESX001 returns from isolation it will still have the VMX Processes running in memory. This is when you will see a “ping-pong” effect within vCenter, in other words VMs flipping back and forth between ESX001 and any of the other hosts.

As of version 4.0 ESX(i) detects that the lock on the VMDK has been lost and issues a question if the VM should be powered off or not. Please note that you will(currently) only see this question if you directly connect to the ESX host. Below you can find a screenshot of this question.

With ESX 4 update 2 the question will be auto-answered though and the VM will be powered off to avoid the ping-pong effect and a split brain scenario! How cool is that…

Impact of decisions…

Duncan Epping · Feb 15, 2010 ·

I’ve been conducting VCDX Defense Interviews for a while now. Last week in Las Vegas during PEX something struck me and I guess this post by Frank Denneman is a good example…

On a regular basis I come across NFS based environments where the decision is made to store the virtual machine swap files on local VMFS datastores. Using host-local swap can affect DRS load balancing and HA failover in certain situations. So when designing an environment using host-local swap, some areas must be focused on to guarantee HA and DRS functionality.

Every decision you make has an impact on your design/environment. What does a decision exactly impact? In most cases every decision impacts the following:

Cost
Availability
Performance

In the example Frank wrote about (see quote) a decision which clearly had an impact on all three. Although at the time it might have been a best practice the decision to go along with this best practice still had an impact on the environment. Because it was a best practice this impact might not have been as obvious. But when listed as follows I hope you understand why I am writing this article:

Costs – Reduced costs by moving the .vswp file to local disks.
Performance – VMotion performance is effected because .vswp files need to be copied from HOST-A to HOST-B.
Availability – Possibly less availability when the amount of free disk space on local VMFS isn’t sufficient to restart VMs in case of disaster.

As you can see a simple decision has a major impact, even though it might be a best practice you will need to think about the possible impact it has and if this best practice fits your environment and meets your (customer) requirements. Another great example would for instance be LUN sizing. So what if I would randomly pick a LUN size. Lets say 1TB:

Cost – As the average VM size is 35 GB, I want a max of 20VMs on a datastore and I need 20% of overhead for vswp files and snapshots I end up with max usage of 840GB. Added overhead: 160GB!
Availability – Although the availability of the datastore will be unaffected the uptime of your environment might change. When a single datastore fails you will lose 1TB worth of data. Not only will you lose more VMs, restoring will also take longer.
Performance – Normally I would restrict the LUN size to reduce the amount of VMs on a single datastore. More VMs on a datastore means more higher possibility of SCSI reservation conflicts.

The VCDX certification is not about knowing all the technical details, of course it is an essential part of it, it’s about understanding the impact of a decision. It’s about justifying your decision based on the impact it has on the environment/design. Know the pros / cons. Even if it is a best practice it might not necessarily apply to your situation.

Where should you get SRA’s from?

Duncan Epping · Dec 18, 2009 ·

I received Michael White’s(VMware BCDR Specialist SE) weekly newsletter over the weekend and the following is a question I also receive on a regular basis so why not blog it?!

I had a disagreement with a friend about where to get SRA’s from. He was under the impression that we didn’t have the arrays in our premises for all of the SRA’s on the market and so it was OK to take an SRA from a vendor as they could test it. The fact is we do have most, or all of the arrays for each SRA in-house but that is actually not relevant. It is important to only take SRA’s from the VMware web site for a different reason. When a vendor finishes updating or writing an SRA, it is run against a special program that produces a log. The SRA and log are sent to VMware and we check them out. Sometimes they are sent back for improving or fixes. This continues until the SRA passes and then it is posted on our web site. If you took the SRA from the vendor you may accidentally get an SRA that in a week or a month we might decline and send back to be fixed. So please, make sure you get the only safe copy of an SRA available, and that is from our web site!