vSphere Metro Storage Cluster with vSphere 5.5

I had a couple of questions around the exact settings for vSphere Metro Storage Clusters with vSphere 5.5. It was the third time in two weeks I shared the same info about vMSC with vSphere 5.5 so I figured I would write a quick blog making the information a bit easier to find through google. Below you can find the settings required for a vSphere Metro Storage Cluster with vSphere 5.5. Note that in-depth details around operations / testing can be found in this white paper: version 5.x // version 6.0.

  1. VMkernel.Boot.terminateVMOnPDL = True
  2. Das.maskCleanShutdownEnabled = True 
  3. Disk.AutoremoveOnPDL = 0 

I want to point out that if you migrate from 5.0 or 5.1 that Host Advanced Setting “VMkernel.Boot.terminateVMOnPDL” replaces disk.terminateVMOnPDLDefault (/etc/vmware/settings). Das.maskCleanShutdownEnabled is actually configured to “true” by default as of vSphere 5.1 and later, but personally I prefer to set it anyway so that I know for sure it has been configured accurately. Then there is Disk.AutoremoveOnPDL, this setting is new in vSphere 5.5 as discussed here. Make sure to disable it, as PDLs are likely to be temporary there is no point removing the devices and then having to do a rescan to have them reappear, it only slows down your process recovery. (EMC also recommends this by the way, see page 21 of this PDF on vMSC/VPLEX).

What happens to VMs when a cluster is partitioned?

I had this question this week around what happens to VMs when a cluster is partitioned. Funny thing is that with questions like these it seems like everyone is thinking the same thing at the same time. I had the question on the same day from a customer running traditional storage and had a network failure across racks and from a customer running Virtual SAN who just wanted to know how this situation was handled. The question boils down to this, what happens to the VM in “Partition 1″ when the VM is restarted in Partition 2?

The same can be asked for a traditional environment, only difference being that you wouldn’t see those “disk groups” in the bottom but a single datastore. In that case a VM can be restarted when a disk lock is lost… What happens to the VM in partition 1 that has lost access to its disk? Does the isolation response kick in? Well if you have vSphere 6.0 then potentially VMCP can help because if you have a single datastore and you’ve lost access to it (APD) then the APD response can be triggered. But if you don’t have vSphere 6.0 or don’t have VMCP configured, or if you have VSAN, what would happen? Well first of all, it is a partition scenario and not an isolation scenario. On both sides of the partition HA will have a master and hosts will be able to ping each other so there is absolutely no reason to invoke the “isolation response” as far as HA is concerned. The VM will be restarted in partition 2 and you will have it running in Partition 1, you will either need to kill it manually in Partition 1, or you will need to wait until the partition is lifted. When the partition is lifted the kernel will realize it no longer holds the lock (as it is lost it to another host) and it will kill the impacted VMs instantly.

Conservative vs Aggressive for VMCP APD response

I just finished writing the vMSC 6.0 Best Practices paper which is about to be released when a question came in. The question was around the APD scenario and whether the response to an APD should be set to aggressive or conservative. Its a good question and my instinct immediately says: conservative… But should it be configured to that in all cases? If so, why on earth do we even have the aggressive method? That got me thinking. (By the way, make sure to read this article by Matt Meyer on VMCP on the vSphere blog, good post!) But before I spill the beans, what is aggressive / conservative in this case and what is this feature again?

VM Component Protection (VMCP) is new in 6.0 and it allows vSphere to respond to a scenario where the host have lost access to a storage device. (Both PDL and APD.) In previous releases vSphere was already capable of responding to PDL scenarios but the settings weren’t really exposes in the UI and that has been done with 6.0 and the APD response has also been added at the same time. Great feature if you ask me, especially in stretched environments as it will help during certain failure scenarios. [Read more…]

Rubrik follow up, GA and funding announcement

Two months ago I published an introduction post on Rubrik. Yesterday Rubrik announced that their platform went GA and they announced a funding round (series B) of 41 million dollars led by Greylock. I want to congratulate Rubrik with this new milestone, major achievement and I am sure we will hear much more from them in the months to come. For those who don’t recall, here is what Rubrik is all about:

Rubrik is building a hyperconverged backup solution and it will scale from 3 to 1000s of nodes. Note that this solution will be up and running in 15 minutes and includes the option to age out data to the public cloud. What impressed me most is that Rubrik can discover your datacenter without any agents, it scales-out in a fully automated fashion and will be capable of deduplicating / compressing data but also offer the ability to mount data instantly. All of this through a slick UI or you can leverage the REST APIs , fully programmable end-to-end.

When I published the article some people made comments that you can do the above with various of other solutions and people asked why I was so excited about their solution. Well, first of all because you can do all of that from a single platform and don’t need a backup solution plus a storage solution and have multiple pieces to manage without scale-out capabilities. I like the model, the combination of what is being offered, the fact that is is a single package designed for this purpose and not glued together… But of course there is more, I just couldn’t talk about it yet. I am not gonna go in to an extreme amount of detail as Cormac wrote an excellent piece here and there is this great blog from Chris, who is a user of the product, which explains the value of the solution. (Always nice to see by the way people read your article and share their experience as well in return…)

I do want to touch on a couple of things which I feel sets Rubrik apart. (And there may be others who do this / offer this, but I haven’t been briefed by them.)

  • Global search across all data
    • “Google-alike” search, which means you start typing the name of a file in the UI of any VM and while typing the UI already presents a list of potential files you are looking for. Then when it shows the right file you click it and it presents a list of options. The file with this name could of course be on one or many VMs, you can pick which one you want and select from which point in time. When I was an admin I was often challenged with this problem “I deleted a file, I know the name… but no clue where I stored it, can you recover it?”. Well that is no problem any longer with global search, just type the name and restore it.
  • True Scale Out
    • I’d already highlighted this, but I agree with Scott Lowe that there is “scale-out” and there is “Scale-Out”. In the case of Rubrik we are talking scale out with capital S and capital O. Not just from a capacity stance, but also when it comes to (as Scott points out) task management and the ability to run any task anywhere in the cluster. So with each node you add you aren’t just scaling capacity, but also performance on all fronts. No single choking point with Rubrik as far as I can tell.
  • Miscellaneous, stuff that people take for granted… but does matter
    • API-Driven – Not something you would expect I would get excited about. And it seems such an obvious thing, but Rubrik’s solution can be configured and managed through the API they expose. Note that every single thing you see in the UI can be done through the API, the UI is simply an API client.
    • Well performing instant mount through the use of flash and serving the cluster up as a scale-out NFS solution to any vSphere host in your environment. Want to access a VM that was backed-up? Mount it!
    • Cloud archiving… Yes others offer this functionality I know. I still feel it is valuable enough to mention that Rubrik does offer the option to archive data to S3 for instance.

Of course there is more to Rubrik then what I just listed, read the articles by Scott, Cormac and Chris to get a good overview… Or just contact Rubrik and ask for a demo.

DRS rules still active when DRS disabled?

I just received a question around DRS rules and why they are still active when DRS is disabled. I was under the impression this was something I already blogged about, but I cannot find it. I know some others did, but they reported this behaviour as a bug… which it isn’t actually.

Below is a screenshot of the VM/Host Rules screen for vSphere 6.0, it allows you to create rules for clusters… Now note I said “clusters” not DRS in specific. In 6.0 the wording in the UI has changed to align with the functionality vSphere offers. These are not DRS rules, but rather cluster rules. Whether you use HA or DRS, these rules can be used when either of the two is configured.

Note that not all types of rules will automatically be respected by vSphere HA. One thing which you can now also do in the UI is specify if HA should ignore or respect rules, very useful if you ask me and makes life a bit easier: