6.5

Trigger APD on iSCSI LUN on vSphere

Duncan Epping · Jun 21, 2018 ·

I was testing various failure scenarios in my lab today for the vSphere Clustering Deepdive session I have scheduled for VMworld. I needed some screenshots and log files of when a datastore hit an APD scenario, for those who don’t know APD stands for all paths down. In other words: the storage is inaccessible and ESXi doesn’t know what has happened and why. vSphere HA has the ability to respond to that kind of failure. I wanted to test this, but my setup was fairly simple and virtual. So I couldn’t unplug any cables. I also couldn’t make configuration changes to the iSCSI array as that would rather trigger a PDL (permanent device loss), so how do you test and APD scenario?

After trying various things like killing the iSCSI daemon (it gets restarted automatically with no impact on the workload) I bumped in to this command which triggered the APD:

SSH in to the host you want to trigger the APD on, run the following command
```
esxcli iscsi session remove  -A vmhba65
```
Make sure of course to replace “vmhba65” with the name of your iSCSI adapter

This triggered APD, as witness in the fdm.log and vmkernel.log, and ultimately resulted in vSphere HA killing the impacted VM and restarting it on a healthy host. Anyway, just wanted to share this as I am sure there are others who would like to test APD responses in their labs or before their environment goes in to production.

There may be other easy ways as well, if you know any, please share in the comments section.

Module MonitorLoop power on failed error when powering on VM on vSphere

Duncan Epping · Jun 12, 2018 ·

I was playing in the lab for our upcoming vSphere Clustering Deepdive book and I ran in to this error when powering on a VM. I had never seen it before myself, so I was kind of surprised when I figured out what it was referring to. The error message is the following:

Module MonitorLoop power on failed when powering on VM

Think about that for a second, if you have never seen it I bet you don’t know what it is about? Not strange as the message doesn’t give a clue.

f you go to the event however there’s a big clue right there, and that is that the swap file can’t be extended from 0KB to whatever it needs to be. In other words, you are probably running out of disk space on the device the VM is stored on. In this case I removed some obsolete VMs and then powered on the VM that had the issue without any problems. So if you see this “Module MonitorLoop power on failed when powering on VM” error, check your free capacity on the datastore the VM sits on!

More details:

Strange error message, for a simple problem. Yes, I will file a request to get this changed.

Customer Experience Improvement Program: where, when and what?

Duncan Epping · May 28, 2018 ·

I got a question on my post about the Customer Experience Improvement Program (ceip) demo, the questions boiled down to the following:

What is being send to VMware
Where is the data stored by VMware
When is the data send to VMware (how often)

The “what” question was easy to answer, as this was documented by John Nicholson on Storagehub.vmware.com for vSAN specifically. Realizing that it isn’t easy to find anywhere what ceip data is stored I figured I would add a link here and also repeat the summary of that article, assuming by now everyone uses a VCSA (if not go to the link):

SSH into VCSA
Run command: cd /var/log/vmware/vsan-health/
Data collected by online health checks is written and gzipped to files " <uuid>cloud_health_check_data.json.gz" and " <uuid>vsan_perf_data.json.gz
You can extract the json content by calling " gunzip -k <gzipped-filename> " or view the contents by calling " zcat <gzipped-filename> "

So that is how you view what is being stored, John also posted an example of the dataset on github for those who just want to have a quick peek. Note that you need an “obfuscation map” (aka key) to make sense out of the data in terms of host-names / VM names / ip-addresses etc. Without that you can stare at the dataset all you want, but you won’t be able to relate it back to a customer. I would also add that we are not storing any VM/Workload data, it is configuration data / feature usage / performance data. Hopefully that will answer the “what” question for you.

Where is the data stored? The data is send to “https://vcsa.vmware.com” and it ends up in VMware’s analytics cloud, which is hosted in secure data centers in the US. The frequency is a bit difficult to answer, as this fully depends on which products are in use, but to my knowledge with vSAN/vSphere it is on an hourly basis. I have asked the VMware team who owns this to create a single page/document with all of the required details needed in it so that security teams can simply be pointed to it.

Hopefully I will have a follow up soon.

vSphere HA Restart Priority

Duncan Epping · Apr 4, 2018 ·

I’ve seen more and more questions popping up about vSphere HA Restart Priority lately. I figured I would write something about it. I already did in this post about what’s new in vSphere 6.5 and I did so in the Stretched Cluster guide. It has always been possible to set a restart priority for VMs, but pre-vSphere 6.5 this priority simply referred to the scheduling of the restart of the VM after a failure. Each host in a cluster can restart 32 VMs at the same time, so you can imagine that if the restart priority is only about VM restarts that it doesn’t really add a lot of value. (Simply because we can schedule many at the same time, and the priority would as such have no effect.)

As of vSphere 6.5 we have the ability to specify the priority and also specify when HA should continue with the next batch. Especially this last part is important, as this allows you to specify that we start with the next priority level when:

Resources are allocated (default)
VMs are powered on
Guest heartbeat is detected
App heartbeat is detected

I think these are mostly self-explanatory, note though the “resources are allocated” means that a target host for restart has been found by the master. So this happens within milliseconds. Very similar for VMs are powered on, this also says nothing about when a VM is available. This literally is “power on”. In some cases it could take 10-20 seconds for a VM to be fully booted and the apps to be available, in other cases it may take minutes… It all depends on the services that will need to be started within the VM. So if it is important for the “service provided” by the VM to be available before starting the next batch then option 3 or 4 would be your best pick. Note that with option 4 you will need to have VM/Application Monitoring and defined within the VM. Now when you have made your choice around when to start the next batch you can simply start adding VMs to a specific level.

Instead of the 3 standard restart “buckets” you now have 5: Highest, High, Medium, Low, Lowest. Why these funny names? Well that was done in order to stay backwards compatible with vSphere 6 / 5 etc. By default all VMs will have the “medium” restart priority, and no it won’t make any difference if you change all of them to high. Simply because the restart priority is about the priority between VMs, it doesn’t change the host response times etc. In other words, changing the restart priority only makes sense when you have VMs at different levels, and usually will only make a big difference when you also change the option “Start next priority VMs when”.

So where do you change this? Well that is pretty straight forward:

Click on your HA cluster and then the “Configure” Tab
Click on “VM Overrides” and then click “Add”
Click on the green plus sign and select the VMs you would like to give a higher, or lower priority
Then select the new priority and specify when the next batch should start

And if you are wondering, yes the restart priority also applies when vCenter is not available. So you can use it even to ensure vCenter, AD and DNS are booted up first. All of this info is stored in the cluster configuration data. You can examine this on the commandline by the way by typing the following:

/opt/vmware/fdm/fdm/prettyPrint.sh clusterconfig

Note that the outcome is usually pretty big, so you will have to scroll through it to find what you need, if you do a search on “restartPriority” then you should be able to find it the VMs for which you changed the priority. Pretty cool right?!

Oh, if you didn’t know yet… Frank, Niels and I are actively updating the vSphere Clustering Deep Dive. Hopefully we will have something out “soon”, as in around VMworld.

Disable DRS for a VM

Duncan Epping · Mar 28, 2018 ·

I have been having discussions with a customer who needs to disable DRS on a particular VM. I have written about disabling DRS for a host in the past, but not for a VM, well I probably have at some point but that was years ago. The goal here is to ensure that DRS won’t move a VM around but HA can still restart it. Of course you can create VM to Host rules, and you can create “must rules”. When you create must rules this could lead to an issue when the host on which the VM is running fails as HA will not restart it. Why? Well it is a “must rule”, which means that HA and DRS must comply to the rule specified. But there’s a solution, look at the screenshot below.

In the screenshot you see the “automation level” for the VM in the list, this is the DRS Automation level. (Yes the name will change in the H5 Client, making it more obvious what it is) You add VMs by clicking the green plus sign. Next you select the desired “automation mode” for those VMs and click okay. You can of course completely disable DRS for the VMs which should never be vMotioned by DRS, in this case during contention those “disabled VMs” are not considered at all. You can also set the automation mode to Manual or Partially Automated for those VMs as that gives you at least initial placement, but has as a downside that the VMs are considered for migration by DRS during contention. This could lead to a situation where DRS recommends that particular VM to be migrated, without you being able to migrate it. This in its turn could lead to VMs not getting the resources they require. So this is a choice you have to make, do I need initial placement or not?

If you prefer the VMs to stick to a certain host I would highly recommend to set VM/Host Rules for those VMs, use “should rules”, which define on which host the VM should run. Combined with the new Automation Level this will result in the VM being placed correctly, but not migrated by DRS when there’s contention. On top of that, it will allow HA to restart the VM anywhere in the cluster! Note that with “manual automation level” DRS will ask you if it is okay to place the VM on a certain host, with “partially automated” DRS will do the initial placement for you. In both cases balancing will not happen for those VMs automatically, but recommendations will be made, which you can ignore. (not use “safely”, as it may not be safe)