clustering

iBook / Nook? What really?

Duncan Epping · Jan 24, 2012 ·

It took us a while to figure this one out, but finally we managed to create a proper epub version of our book. Converting from Kindle to iBooks / Nook is not easy, but after using multiple tools and manually editing the book using Sigil we managed to create version which was submitted to the iBooks store and Barnes & Nobles a couple of weeks ago. This morning I noticed that the book has been made available. So for all those who have been asking about it, here are the links:

We hope you will enjoy it!

vSphere 5.0 HA and metro / stretched cluster solutions

Duncan Epping · Oct 5, 2011 ·

I had a discussion via email about metro clusters and HA last week and it made me realize that HA’s new architecture (as part of vSphere 5.0) might be confusing to some. I started re-reading the article which I wrote a while back about HA and metro / stretched cluster configurations and most actually still applies. Before you read this article I suggest reading the HA Deepdive Page as I am going to assume you understand some of the basics. I also want to point you to this section in our HCL which lists the certified configurations for vMSC (vSphere Metro Storage Cluster). You can select the type of storage like for instance “FC Metro Cluster Storage” in the “Array Test Configuration” section.

In this article I will take a single scenario and explain the different type of failures and how HA underneath handles this. I guess the most important part in these scenarios is why HA or did not respond to a failure.

Before I will explain the scenario I want to briefly explain the concept of a metro / stretched cluster, which can be carved up in to two different type of solutions. The first solution is where a synchronous copy of your datastore is available on the other site, this mirror copy will be read-only. In other words there is a read-write copy in Datacenter-A and a read-only copy in Datacenter-B. This means that your VMs in Datacenter-B located on this datastore will do I/O on Datacenter-A since the read-write copy of the datastore is in Datacenter-A. The second solution is which EMC calls “write anywhere”. In this case VMs always write locally. The key point here is that each of the LUNs / datastores has a “preferred site” defined, this is also sometimes referred to as “site bias”. In other words, if anything happens to the link in between then the storage system on the preferred site for a given datastore will be the only one left who can read-write access it. This of course to avoid any data corruption in the case of a failure scenario.

In this article we will use the following scenario:

2 sites
18 hosts
50KM in between sites
Preferred (aka “Should”) VM-Host affinity rule to create “Datacenter Affinity – I/O Locality”
Designated heartbeat datastores
(read this article more details on heartbeat datastores)
Synchronous mirrored datastores
(Please note that I have not depicted the “mirror” copy just to simply the diagram)

This is what it will look like:

What do you see in this diagram? Each site will have 9 hosts. The HA master is located in Datacenter-A. Each host will use the designated “heartbeat datastore” in each of the datacenters, note that I only drew the line for the lower left ESXi host just to simplify the diagram.

There are many failures which can occur but HA will be unaware of many of these. I will not discuss these as they are explained in-depth in the storage vendor’s documentation. I will however discuss the following “common” failures:

Host failure in Datacenter-A
Storage Failure in Datacenter-A
Loss of Datacenter-A
Datacenter Partition
Storage Partition

Host failure in Datacenter-A

When a host fails in Datacenter-A this is detected by the HA master node as network heartbeats from it are not received any longer. When the master has detected network heartbeats are missing it will start monitoring for datastore heartbeats. As the host has failed there will be no datastore heartbeats issued. During this time a third liveness check will be done, which is pinging the management addresses of the failed hosts. If all of these liveness checks are unsuccessful, the master will declare the host dead, and will attempt to restart all the protected virtual machines that were running on the host before the master lost contact with it. . The rules defined on a cluster level are “preferred rules” and as such the virtual machine can be restarted on the other site. If DRS is enabled then it will attempt to correct any sub-optimal placements HA made during the restart of your virtual machines. This same scenario also applies to a situation where all hosts fail in one site without storage being affected.

Storage Failure in Site-A

In this scenario only the storage system fails in Site-A. This failure does not result in down time for your VMs. What will happen? Simply said the mirror, read-only, copy of your datastore will become read-write and be presented with the same identifier and as such the hosts will be able to write to these volumes without the need to resignature. (This sounds very simple of course, but do note I am describing it on an extremely high level and in most solutions manual intervention is required to indicate a failure has occurred.) In most cases however, from a VMs perspective this happens seamlessly. It should be noted that all I/O will now go across your link to the other site. Note that HA is not aware of this failure. Although the storage heartbeat might be lost for a second, HA will not take action as a HA master agent only checks for the storage heartbeat when the network heartbeat has not been received for three seconds.

Loss of Datacenter-A

This is basically a combination of the first and the second failure we’ve described. In this scenario, the hosts in Datacenter-B will lose contact with the master and elect a new one. The new master will access the per-datastore files HA uses to record the set of protected VMs, and so determine the set of HA protected VMs. The master will then attempt to restart any VMs that are not already running on its host and the other hosts in Datacenter-B. At the same time, the master will do the liveness checks noted above, and after 50 seconds, report the hosts of Datacenter-A as dead. From a VMs perspective the storage fail-over could occur seamlessly. The mirror copy of the datastore promoted to Read-write and the hosts on Datacenter-B will be able to access the datastores which were local to Datacenter-A.

Datacenter Partition

This is where most people feel things will become tricky. What happens if your link between the two datacenters fails? Yes I realize that the chances of this happening are slim as you would typically have redundancy on this layer, but I do think it is an interesting one to explain. The main thing to realize here is that with these types of failures VM-Host affinity, or should I call it VM-Datacenter affinity, is very important suddenly, but we will come back to that in a second. Lets explain the scenario first.

In this case a distinction could be made between the two types of metro cluster as briefly explained before. There is a solution, which EMC likes to call “write anywhere” which basically presents a virtual datastore across Datacenters and allows writes on both sites. On the other hand there is the traditional stretched cluster solution where there’s only 1 site actively handling I/Os and a “passive” site which will be used in the case a fail-over need to occur. In the case of “write anywhere” a so-called site bias or preference is defined per datastore. Both scenarios however are similar in terms of that a given datastore will in the case of a failure only be accessible on one site.

If the link between the sites should fail the datastore would become active on just one of the sites. What would happen to the VM that is running on Datacenter-B but has its files stored on a datastore which was configured with site bias for Datacenter-A?

In the case where a VM is running on Datacenter-B the VM would have its storage “yanked” out underneath of it. The VM would more than likely keep running and keep retrying the I/O. However as the link has been broken between the sites, HA in Datacenter-A will try to restart the workload. Why is that?

The network heartbeat is missing because the link dropped
The datastore heartbeat is missing because the link dropped and the datastore becomes inaccessible from Datacenter-B
A ping to the management address of the host fails because the link is missing
The master for Datacenter-A knows the VM was powered on before the failure, and since it can’t communicate with the VM’s host in Datacenter-B after the failure, it will attempt to restart the VM.

What happens in Datacenter-B? In Datacenter-B a master is elected. This master will determine the VMs that need to be protected. Next it will attempt to restart all those that it knows are not already running. Any VMs biased to this site that are not already running will be powered on. However any VMs biased to the other site, won’t be as the datastore is inaccessible. HA will report a restart failure for the latter since it does not know that the VMs are (still) running in the other site.

Now you might wonder what will happen if the link returns? This is the classic “VM split brain scenario”. For a short period of time you will have 2 active copies of the VM on your network both with the same mac address. However only one copy will have access to the VM’s files and HA will recognize this. As soon as this is detected the VM copy that has no access to the VMDK will be powered off.

I hope all of you understand why it is important to understand what the preferential site is for your datastore as it can and probably will impact your up-time. Also note that although we defined VM-Host affinity rules these are preferential / should rules and can be violated by both DRS.

Storage Partition

This is the final scenario. In this scenario only the storage connection between Datacenter-A and Datacenter-B fails. What happens in this case? This scenario is very straight forward. As the HA master is still receiving network heartbeats it will not take any action unfortunately currently. This is very important to realize. Also HA will not be aware of these rules. HA will restart virtual machines where ever it feels it should, DRS however should move the virtual machines to the correct location based on these rules. That is, if and when correctly defined of course!

Summarizing

Stretched Clusters are in my opinion great solutions to increase resiliency in your environment. There is however always a lot of confusion around failure scenarios and the different type of responses from both the vSphere layer and the Storage layer. In this article I have tried to explain how vSphere HA responds to certain failures in a stretched / metro cluster environment. I hope this will help everyone getting a better understanding of vSphere HA. I also fully realize that things can be improved by a tighter integration between HA and your storage systems, for now all I can is that this is being worked on. I want to finish with a quick summary of some vSphere 5.0 HA design consideration for stretched cluster environments:

Das.isolationaddress per Datacenter!
Having multiple isolation addresses will help your hosts understanding if they are isolated or if the master has isolated in the case of a network failure.
Designated heartbeat datastores per Datacenter!
Each site will need a designated heartbeat datastore to ensure each site can at a minimum update the heartbeat region of the site local storage.
- If there are multiple storage systems on each site it is recommended to increase the number of heartbeat datastores to four, two for each site.
Define VM-Host affinity rules as it can lower the impact during a failure and it can help keeping I/O local

Thanks for taking the time to read this far, and don’t hesitate to leave a comment if you have any questions or feedback / remarks.

<edit> completely coincidentally Chad Sakac posted an article about Stretched Clusters and the new HCL Category. Read it!</edit>

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

Hot of the press: vSphere 5.0 Clustering Technical Deepdive

Duncan Epping · Jul 12, 2011 ·

** Update: Available now: paperback full |paperback black & white **

After months of hard work the moment is finally there, the release of our new book: vSphere 5.0 Clustering Technical Deepdive! When we started working, or better said, planning an update of the book we never realized the amount of work required. Be aware that this is not a minor update. This book covers HA (full rewrite as HA has been rewritten for 5.0), DRS (mostly rewritten to focus on resource management) and Storage DRS (new!). Besides these three major pillars we also decided to add what we call supporting deepdives. The supporting deepdives added are: vMotion, Storage vMotion, Storage I/O Control and EVC. This resulted in roughly 50% more content (totaling 348 pages) than the previous book, also worth noting that every single diagram has been recreated and are they cool or what?

Before I will give you the full details I want to thanks a couple of people who have helped us tremendously and without whom this publication would not have been possible. First of all I would like to thank my co-author Frank “Mr Visio” Denneman for all his hard work. Frank and I would also like to thank our VMware management team for supporting us on this project. Doug “VEEAM” Hazelman thanks for writing the foreword! A special thanks goes out to our technical reviewers and editors: Doug Baer, Keith Farkas and Elisha Ziskind (HA Engineering), Anne Holler, Irfan Ahmad and Rajesekar Shanmugam (DRS and SDRS Engineering), Puneet Zaroo (VMkernel scheduling), Ali Mashtizadeh and Gabriel Tarasuk-Levin (vMotion and Storage vMotion Engineering), Doug Fawley and Divya Ranganathan (EVC Engineering). Thanks for keeping us honest and contributing to this book.

As promised in the multiple discussions we had around our 4.1 HA/DRS book we wanted to make sure to offer multiple options straight away. While Frank finalized the printed copy I worked on formatting the ebook. Besides the black&white printed version we are also offering a full color version of the book and a Kindle version. The black&white sells for $ 29.95, the full color for $ 49.95 and the Kindle for an ultra cheap price: $ 9.95. Needless to say that we recommend the Kindle version. It is cheap, full color and portable or should we say virtual… who doesn’t love virtual? On a sidenote, we weren’t planning on doing a black and white release but due to the extremely high production costs of the full color print we decided to offer it as an extra service. Before I give the full description here are the direct links to where you can buy the book. (Please note that Amazon hasn’t listed our book yet, seems like an indexing issue, should be resolved soon hopefully For those who cannot wait to order the printed copy check-out Createspace or Comcol.

Amazon:
eBook (Kindle) – $ 9.99 (price might vary based on location as amazon charges extra for delivery)
Black & White Paper – $ 29.95
Full Color Paper – $ 49.95

Createspace:
Black & White Paper – 29.95
Full Color Paper – 49.95

For the EMEA folks comcol.nl offered to distribute it again, paper black & white can be found here, and full color here.

VMware vSphere 5.0 Clustering Technical Deepdive zooms in on three key components of every VMware based infrastructure and is by no means a “how to” guide. It covers the basic steps needed to create a vSphere HA and DRS cluster and to implement Storage DRS. Even more important, it explains the concepts and mechanisms behind HA, DRS and Storage DRS which will enable you to make well educated decisions. This book will take you in to the trenches of HA, DRS and Storage DRS and will give you the tools to understand and implement e.g. HA admission control policies, DRS resource pools, Datastore Clusters and resource allocation settings. On top of that each section contains basic design principles that can be used for designing, implementing or improving VMware infrastructures and fundamental supporting features like vMotion, Storage I/O Control and much more are described in detail for the very first time.

This book is also the ultimate guide to be prepared for any HA, DRS or Storage DRS related question or case study that might be presented during VMware VCDX, VCP and or VCAP exams.

Coverage includes:
– HA node types
– HA isolation detection and response
– HA admission control
– VM Monitoring
– HA and DRS integration
– DRS imbalance algorithm
– Resource Pools
– Impact of reservations and limits
– CPU Resource Scheduling
– Memory Scheduler
– DPM
– Datastore Clusters
– Storage DRS algorithm
– Influencing SDRS recommendations

Be prepared to dive deep!

Pick it up, leave a comment and of course feel free to make those great mugshots again and ping them over via Facebook or our Twitter accounts! For those looking to buy in bulk (> 20) contact [email protected].

VMware HA Survey

Duncan Epping · Jul 5, 2011 ·

The Product Management Team reached out to me this week and asked me if I could help getting some real world data around HA by posting some information about a survey. These surveys are generally used to priorities features or even add functionality. So if you want to contribute please take the survey. It will roughly take 4 minutes to complete.

We have observed some of our customers have not virtualized the clustered workloads and thus we want to understand their reasons for not doing so. Also, for the clustered workloads that have been virtualized we want to learn of some of the factors that are preventing our customers from replacing the existing clustering software with VMware HA. Thus, we seek some information about your clustered setup to better understand some of your challenges. This would help us improve VMware HA in areas important to you. Kindly take a few minutes to complete this survey.

http://tinyurl.com/3pca8wn