vMSC for 6.0, any new recommendations?

I am currently updating the vSphere Metro Storage Cluster best practices white paper, over the last two weeks I received various questions if there were any new recommendation for vMSC for 6.0. I have summarized the recommendations below for your convenience, the white paper is being reviewed and I am updating screenshots, hopefully will be done soon.

  • In order to allow vSphere HA to respond to both an APD and a PDL condition vSphere HA needs to be configured in a specific way. VMware recommends enabling VM Component Protection. After the creation of the cluster VM Component Protection needs to be enabled.
  • The configuration for PDL is basic. In the “Failure conditions and VM response” section it can be configured what the response should be after a PDL condition is detected. VMware recommends setting this to “Power off and restart VMs”. When this condition is detected a VM will be restarted instantly on a healthy host within the vSphere HA cluster.
  • When an APD condition is detected a timer is started. After 140 seconds the APD condition is officially declared and the device is marked as APD time out. When the 140 seconds has passed HA will start counting, the default HA time out is 3 minutes. When the 3 minutes has passed HA will restart the impacted virtual machines, but you can configure VMCP to respond differently if desired. VMware recommends configuring it to “Power off and restart VMs (conservative)”.
    • Conservative refers to the likelihood of HA being able to restart VMs. When set to “conservative” HA will only restart the VM that is impacted by the APD if it knows another host can restart it. In the case of “aggressive” HA will try to restart the VM even if it doesn’t know the state of the other hosts, which could lead to a situation where your VM is not restarted as there is no host that has access to the datastore the VM is located on.
  • It is also good to know that if the APD is lifted and access to the storage is restored before the time-out has passed that HA will not unnecessarily restart the virtual machine, unless you explicitly configure it do so. If a response is desired even when the environment has recovered from the APD condition then “Response for APD recovery after APD timeout” should be configured to “Reset VMs”. VMware recommends leaving this setting disabled.

ForceAffinePowerOn what is it?

I’ve seen a lot of confusion around the ForceAffinePowerOn setting, and even the VMware documentation is incorrect around what this feature is / does. First and foremost: ForceAffinePowerOn is an advanced DRS setting (Yes I filed a doc bug for it). I’ve seen many people stating it is an HA setting, but it is not. You need to configure this in the advanced settings section of your DRS configuration.

Secondly, ForceAffinePowerOn can be used to ensure VM to VM affinity rules are respected when powering on a VM. ForceAffinePowerOn has absolutely nothing to do with VM to VM anti-affinity rules, it only applies to “affinity”.

Lets be crystal clear:

  • When ForceAffinePowerOn is set to 0, it means that VM to VM affinity can be dropped if necessary to power on a VM.
  • When ForceAffinePowerOn is set to 1, it means that VM to VM affinity should not be dropped and power-on should fail if the rule cannot be respected.

I hope that helps!

DRS rules still active when DRS disabled?

I just received a question around DRS rules and why they are still active when DRS is disabled. I was under the impression this was something I already blogged about, but I cannot find it. I know some others did, but they reported this behaviour as a bug… which it isn’t actually.

Below is a screenshot of the VM/Host Rules screen for vSphere 6.0, it allows you to create rules for clusters… Now note I said “clusters” not DRS in specific. In 6.0 the wording in the UI has changed to align with the functionality vSphere offers. These are not DRS rules, but rather cluster rules. Whether you use HA or DRS, these rules can be used when either of the two is configured.

Note that not all types of rules will automatically be respected by vSphere HA. One thing which you can now also do in the UI is specify if HA should ignore or respect rules, very useful if you ask me and makes life a bit easier:

What does support for vMotion with active/active (a)sync mean?

Having seen so many cool features being released over the last 10 years by VMware you sometimes wonder what more they can do. It is amazing to see what level of integration we’ve see between the different datacenter components. Many of you have seen the announcements around Long Distance vMotion support by now.

When I saw this slide something stood out to me instantly and that is this part:

  • Replication Support
    • Active/Active only
      • Synchronous
      • Asynchronous

What does this mean? Well first of all “active/active” refers to “stretched storage” aka vSphere Metro Storage Cluster. So when it comes to long distance vMotion some changes have been introduced for sync stretched storage. (** note that “active/active” storage is not required for long distance vMotion**)With stretched storage writes can come from both sides at any time to a volume and will be replicated synchronously. Some optimizations have been done to the vMotion process to avoid writes during switchover to avoid any delay during the process as a result of replication traffic.

For active/active asyncronous the story is a bit different. Here again we are talking about “stretched storage” but in this case the asynchronous flavour. One important aspect which was not mentioned in the deck is that async requires Virtual Volumes. Now, at the time of writing there is no vendor yet who has a VVol capable solution that offers active/active async. But more important, is this process any different than the sync process? Yes it is!

During the migration of a virtual machine which uses virtual volumes, with an “active/active async” configuration backing it, the array is informed that a migration of the virtual machine is taking place and is requested to switch from asynchronous replication to synchronous. This to ensure that the destination is in-sync with the source when the VM is switched over from side A to side B. Besides switching from async to sync when the migration has completed the array is informed that the migration has completed. This allows the array to switch the “bias” of the VM for instance, especially in a stretched environment this is important to ensure availability.

I can’t wait for the first vendor to announce support for this awesome feature!

Another way to fix your non compliant host profile

I found out there is another way to fix your non compliant host profile problems with vSphere 6.0 when you have SAS drives which are detected as shared storage while they are not. This method is a bit more complicated though and there is a command line script that you will need to use: /bin/sharedStorageHostProfile.sh. It works as follows:

  • Run the following to dump all your local details in a folder on your first host
    /bin/sharedStorageHostProfile.sh local /folder/youcreated1/
  • Run the following to dump all your local details in a folder for your second host, you can do this on your first host if you have SSH enabled
    /bin/sharedStorageHostProfile.sh remote /folder/youcreated2/ <name or ip of remote host>
  • Copy the outcome of the second host to folder where the outcome of your first host is stored. You will need to copy the file “remote-shared-profile.txt”.
  • Now you can compare the outcomes by running:
    /bin/sharedStorageHostProfile.sh compare /folder/youcreated1/
  • After comparing you can run the configuration as follows:
    /bin/sharedStorageHostProfile.sh configure /folder/youcreated1/
  • Now the disks which are listed as cluster wide resources but are not shared between the hosts will be configured as non-shared resources. If you want to check what will be changed before running the command you can simply do a “more” of the file the info is stored in:
    more esxcli-sharing-reconfiguration-commands.txt
    esxcli storage core device setconfig -d naa.600508b1001c2ee9a6446e708105054b --shared-clusterwide=false
    esxcli storage core device setconfig -d naa.600508b1001c3ea7838c0436dbe6d7a2 --shared-clusterwide=false

You may wonder by now if there isn’t an easier way, well yes there is. You can do all of the above by running the following simple command. I preferred to go over the steps so at least you know what is happening.

/bin/sharedStorageHostProfile.sh automatic <name-or-ip-of-remote-host>

After you have done this (first method or second method) you can now create your host profile of your first host. Although the other methods I described in the post of yesterday are a bit simpler, I figured I would share this as well as you never know when it may come in handy!