drs

Now Available: vSphere 6.7 Clustering Deep Dive book!

Duncan Epping · Jul 30, 2018 ·

Over the past couple of months Frank, Niels and I have worked ferociously to update the vSphere Clustering Deep Dive. Some of the material was already brought up to date to vSphere 6.0 U2, but the majority was never updated after vSphere 5.1. As you can imagine, this was a tremendous undertaking. Not only did we need to validate every sentence, all diagrams needed to be updated, and with the introduction of the HTML-5 Client also all screenshots had to be retaken.

Now, just a couple of weeks before VMworld, we are finally at the point where we can press “publish”.

What can you expect? Well, we have said this with previous books, this is not a beginners guide! This is a deep dive, and we aimed to take you in to the trenches of vSphere Clustering technologies. We cover a multitude of different features, and for those who haven’t read the previous books expect the following features to be covered:

vSphere HA
vSphere DRS
vSphere Storage DRS
vSphere Storage I/O Control
vSphere Network I/O Control

We also have a chapter on stretched clusters, in this chapter we describe how to design and implement a vSphere Metro Storage Cluster, leveraging all of the knowledge gained in the previous chapters.

For your convenience, I copied/pasted some of the Amazon info below.

—

Paperback: 566 pages
Publisher: CreateSpace Independent Publishing Platform; 1 edition (July 29, 2018)
Language: English
ISBN-10: 1722625325
ISBN-13: 978-1722625320
Product Dimensions: 5.5 x 1.3 x 8.5 inches
Shipping Weight: 1.8 pounds

—

I hope all of you will enjoy the book as much as we enjoyed writing it. And before I forget, I want to thank my co-authors for the late night discussions, the hard work, insights and fun/laughter at times.

Get it while it is hot! (Look on the right side column for the links to the book!)

Disable DRS for a VM

Duncan Epping · Mar 28, 2018 ·

I have been having discussions with a customer who needs to disable DRS on a particular VM. I have written about disabling DRS for a host in the past, but not for a VM, well I probably have at some point but that was years ago. The goal here is to ensure that DRS won’t move a VM around but HA can still restart it. Of course you can create VM to Host rules, and you can create “must rules”. When you create must rules this could lead to an issue when the host on which the VM is running fails as HA will not restart it. Why? Well it is a “must rule”, which means that HA and DRS must comply to the rule specified. But there’s a solution, look at the screenshot below.

In the screenshot you see the “automation level” for the VM in the list, this is the DRS Automation level. (Yes the name will change in the H5 Client, making it more obvious what it is) You add VMs by clicking the green plus sign. Next you select the desired “automation mode” for those VMs and click okay. You can of course completely disable DRS for the VMs which should never be vMotioned by DRS, in this case during contention those “disabled VMs” are not considered at all. You can also set the automation mode to Manual or Partially Automated for those VMs as that gives you at least initial placement, but has as a downside that the VMs are considered for migration by DRS during contention. This could lead to a situation where DRS recommends that particular VM to be migrated, without you being able to migrate it. This in its turn could lead to VMs not getting the resources they require. So this is a choice you have to make, do I need initial placement or not?

If you prefer the VMs to stick to a certain host I would highly recommend to set VM/Host Rules for those VMs, use “should rules”, which define on which host the VM should run. Combined with the new Automation Level this will result in the VM being placed correctly, but not migrated by DRS when there’s contention. On top of that, it will allow HA to restart the VM anywhere in the cluster! Note that with “manual automation level” DRS will ask you if it is okay to place the VM on a certain host, with “partially automated” DRS will do the initial placement for you. In both cases balancing will not happen for those VMs automatically, but recommendations will be made, which you can ignore. (not use “safely”, as it may not be safe)

The difference between an isolation and a partition with vSphere

Duncan Epping · Oct 10, 2017 ·

I have a lot of discussions with customers on the topic of stretched clusters, but also regular vSphere clusters. Something that often comes up is the discussion around what happens in an isolation or partition scenario. Fairly often customers (but also VMware employees) use those words interchangeably. However, a partition is not the same as an isolation. They are 2 different scenarios, and also as a result they have a different type of response associated with it. Before I explain the difference in the two responses to a situation like this, what is a partition and what is an isolation?

An isolation event is a situation where a single host cannot communicate with the rest of the cluster. Note: single host!
A partition is a situation where two (or more) hosts can communicate with each other, but no longer can communicate with the remaining two (or more) hosts in the cluster. Note: two or more!

Why is that such a big deal? Well the response in the case of these two scenarios are different. And the response/result is also determined by what types of configuration you have. Lets break down the scenarios one by one, including the type of infrastructure used (when it is relevant).

Isolation Event

When a host is isolated it will:

start an election process
- declare itself primary
ping the isolation address
declare itself isolated
power off / shut down VMs (when this is configured)
communicate through the connected datastores that it is isolated
the VMs will be restarted on the remaining hosts in the cluster

And then of course vSphere HA will be able to restart the VMs. Note that in the case of vSAN, it isn’t possible to write to the datastore when a host is isolated, so it won’t do that. Yet the workloads will still have been powered off / shutdown so it is safe for vSphere HA to restart them

Partition (traditional storage)

When two or more hosts are partitioned (they can communicate with each other) and the vSphere HA primary is not part of the partition it will:

start an election process
declare a primary in the partition
figure out what has happened to the hosts and VMs in the other partition
- restart any VMs that somehow were impacted, or appeared now to be powered off while the last known state was powered on
if all VMs are running, vSphere HA won’t try to restart any, this is the expected result!

Partition (vSAN stretched)

When the partition scenario happens in a stretched vSAN environment there’s an extra (potential) step. Along the way, vSAN will identify all VMs which have no accessible components and kill those VMs so they can be restarted in the partition which has quorum. In this scenario, you have 3 locations, two for data and 1 for the witness. If a data site loses access to the other locations then the data site is partitioned (the hosts can still communicate with each other within the site), as such the isolation response is not triggered. However, vSAN will still kill these VMs as they are rendered useless (lost access to disk).

I know it is just semantics, but nevertheless, I do feel it is important to understand the difference between an isolation and a partition, especially as the response (and who responds) is different in these situations. Hope it helps,

Don’t know why DRS is not balancing your cluster? DRS Dump Insight!

Duncan Epping · Aug 27, 2017 ·

I was just reading up and noticed the DRS Dump Insight solution. It is a SaaS based DRS Dump Analyzer which gives you details around why your cluster is not balanced, or why certain recommendations are not made. Especially the “what if” scenarios are cool if you ask me. You can take a dump and then using the whatif feature check out what would happen to your cluster if for instance all affinity rules were dropped. Or what would happen if the DRS migration threshold is changed, or some advanced settings are used.

You can find some more info about it here, and the SaaS tool here. I hope this will make it in to the product soon in the form of a “health check”… Very useful and insightful! Oh, if you can’t access the website, try it in “Incognito Mode”. Seems there are some issues with the certificate.

How to disable DRS for a single host in the cluster

Duncan Epping · Jan 17, 2017 ·

I saw a question today which was interesting, how do I disable DRS for a single host in the cluster? I thought about it, and you cannot do this within the UI, at least… there is no “disable DRS” option on a host level. You can enable/disable it on a cluster level but that is it. But there are of course ways to ensure a host is not considered by DRS:

Place the host in maintenance mode
This will result in the host not being used by DRS. However it also means the host won’t be used by HA and you cannot run any workloads on it.
Create “VM/Host” affinity rules and exclude the host that needs to be DRS disabled. That way all current workloads will not run, or be considered to run, on that particular host. If you create “must” rules this is guaranteed, if you create “should” rules then at least HA can still use the host for restarts but unless there is severe memory pressure or you hit 100 CPU utilization it will not be used by DRS either.
Disable the vMotion VMkernel interface
This will result in not being able to vMotion any VMs to the host (and not from the host either). However, HA will still consider it for restarts and you can run workloads on the host, and the host will be considered for “initial placement” during a power-on of a VM.

I will file a feature request for a “disable drs” on a particular host option in the UI, I guess it could be useful for some in certain scenarios.