6.0

Module MonitorLoop power on failed error when powering on VM on vSphere

Duncan Epping · Jun 12, 2018 ·

I was playing in the lab for our upcoming vSphere Clustering Deepdive book and I ran in to this error when powering on a VM. I had never seen it before myself, so I was kind of surprised when I figured out what it was referring to. The error message is the following:

Module MonitorLoop power on failed when powering on VM

Think about that for a second, if you have never seen it I bet you don’t know what it is about? Not strange as the message doesn’t give a clue.

f you go to the event however there’s a big clue right there, and that is that the swap file can’t be extended from 0KB to whatever it needs to be. In other words, you are probably running out of disk space on the device the VM is stored on. In this case I removed some obsolete VMs and then powered on the VM that had the issue without any problems. So if you see this “Module MonitorLoop power on failed when powering on VM” error, check your free capacity on the datastore the VM sits on!

More details:

Strange error message, for a simple problem. Yes, I will file a request to get this changed.

Disable DRS for a VM

Duncan Epping · Mar 28, 2018 ·

I have been having discussions with a customer who needs to disable DRS on a particular VM. I have written about disabling DRS for a host in the past, but not for a VM, well I probably have at some point but that was years ago. The goal here is to ensure that DRS won’t move a VM around but HA can still restart it. Of course you can create VM to Host rules, and you can create “must rules”. When you create must rules this could lead to an issue when the host on which the VM is running fails as HA will not restart it. Why? Well it is a “must rule”, which means that HA and DRS must comply to the rule specified. But there’s a solution, look at the screenshot below.

In the screenshot you see the “automation level” for the VM in the list, this is the DRS Automation level. (Yes the name will change in the H5 Client, making it more obvious what it is) You add VMs by clicking the green plus sign. Next you select the desired “automation mode” for those VMs and click okay. You can of course completely disable DRS for the VMs which should never be vMotioned by DRS, in this case during contention those “disabled VMs” are not considered at all. You can also set the automation mode to Manual or Partially Automated for those VMs as that gives you at least initial placement, but has as a downside that the VMs are considered for migration by DRS during contention. This could lead to a situation where DRS recommends that particular VM to be migrated, without you being able to migrate it. This in its turn could lead to VMs not getting the resources they require. So this is a choice you have to make, do I need initial placement or not?

If you prefer the VMs to stick to a certain host I would highly recommend to set VM/Host Rules for those VMs, use “should rules”, which define on which host the VM should run. Combined with the new Automation Level this will result in the VM being placed correctly, but not migrated by DRS when there’s contention. On top of that, it will allow HA to restart the VM anywhere in the cluster! Note that with “manual automation level” DRS will ask you if it is okay to place the VM on a certain host, with “partially automated” DRS will do the initial placement for you. In both cases balancing will not happen for those VMs automatically, but recommendations will be made, which you can ignore. (not use “safely”, as it may not be safe)

Changing advanced vSphere FT related settings, is that supported?

Duncan Epping · Feb 1, 2018 ·

This week I received a question around changing the values for vSphere FT related advanced settings. This customer is working on an environment where uptime is key. Of course the application layer is one side, but they also want to have additional availability from an infrastructure perspective. Which means vSphere HA and vSphere FT are key.

They have various VMs they need to enable FT on, these are vSMP VMs (meaning in this case dual CPU). Right now each host is limited to 4 FT VMs and at most 8 vCPUs, this is being controlled by two advanced settings called “das.maxftvmsperhost” and “das.maxFtVCpusPerHost”. The values for these are, obviously, 4 and 8. The question was: can I edit these and still have a supported configuration? Also, why 4 and 8?

I spoke to the product team about this and the answer is: yes, you can safely edit these. These values were set based on typical bandwidth and resource constraints customers have. An FT VM easily consumes between 1-3Gbps of bandwidth, meaning that if you dedicate a 10Gbps link to it you will fit roughly 4 VMs. I say roughly as of course the workload matters: CPU, Memory and IO pattern.

If you have a 40Gbps NIC, and you have plenty of cores and memory you could increase those max numbers for FT VMs per host and FT vCPUs. However, it must be noted that if you run in to problems VMware GSS may request you to revert back to the default just to ensure the issues that occur aren’t due to this change as VMware tests with the default values.

UPDATE to this content can be found here: https://www.yellow-bricks.com/2022/11/18/can-you-exceed-the-number-of-ft-enabled-vcpus-per-host-or-number-of-ft-enabled-vcpus-per-vm/

vSphere HA heartbeat datastores, the isolation address and vSAN

Duncan Epping · Nov 8, 2017 ·

I’ve written about vSAN and vSphere HA various times, but I don’t think this has been explicitly called out before. Cormac and I were doing some tests this week and noticed something. When we were looking at results I realized I described it in my HA book a long time ago, but it is so far hidden away that probably no one has noticed.

In a traditional environment when you enable HA you will automatically have HA heartbeat datastores selected. These heartbeat datastores are used by the HA primary host to determine what has happened to a host which is no longer reachable over the management network. In other words, when a host is isolated it will communicate this to the HA primary using the heartbeat datastores. It will also inform the HA primary which VMs were powered off as the result of this isolation event (or not powered off when the isolation response is not configured).

Now, with vSAN, the management network is not used for communication between the hosts but the vSAN network is used. Typically in a vSAN environment, there’s only vSAN storage so there are no heartbeat datastores. As such, when a host is isolated it is not possible to communicate this to the HA primary. Remember, the network is down and there is no access to the vSAN datastore so the host cannot communicate through that either. HA will still function as expected though. You can set the isolation response to power-off and then the VMs will be killed and restarted. That is, if isolation is declared.

So when is isolation declared? A host declares itself isolated when:

It is not receiving any communication from the primary
It cannot ping the isolation address

Now, if you have not set any advanced settings then the default gateway of the management network will be the isolation address. Just imagine your vSAN Network to be isolated on a given host, but for whatever reason, the Management Network is not. In that scenario isolation is not declared, the host can still ping the isolation address using the management network vmkernel interface. HOWEVER… vSphere HA will restart the VMs. The VMs have lost access to disk, as such the lock on the VMDK is lost. HA notices the hosts are gone, which must mean that the VMs are dead as the locks are lost, lets restart them.

That is when you could be in the situation where the VMs are running on the isolated hosts and also somewhere else in the cluster. Both with the same mac address and the same name / IP address. Not a good situation. Now, if you would have had datastore heartbeats enabled then this would be prevented. As the isolated host would inform the primary it is isolated, but it would also inform the primary about the state of the VMs, which would be powered-on. The primary would then decide not to restart the VMs. However, the VMs which are running on the isolated host are more or less useless as they cannot write to disk anymore.

Let’s describe what we tested and what the outcome was in a way that is a bit easier to consume a table:

Isolation Address	Datastore Heartbeats	Observed behavior
IP on vSAN Network	Not configured	Isolated host cannot ping the isolation address, isolation declared, VMs killed and VMs restarted
Management Network	Not configured	Can ping the isolation address, isolation not declared, yet rest of the cluster restarts the VMs even though they are still running on the isolated hosts
IP on vSAN Network	Configured	Isolated host cannot ping the isolation address, isolation declared, VMs killed and VMs restarted
Management Network	Configured	VMs are not powered-off and not restarted as the “isolated host” can still ping the management network and the datastore heartbeat mechanism is used to inform the master about the state. So the master knows HA network is not working, but the VMs are not powered off.

So what did we learn, what should you do when you have vSAN? Always use an isolation address that is in the same network as vSAN! This way during an isolation event the isolation is validated using the vSAN vmkernel interface. Always set the isolation response to power-off. (My personal opinion based on testing.) This would avoid the scenario of duplicate mac / ip / names on the network when you have a single network being isolated for a specific host! And if you have traditional storage, then you can enable heartbeat datastores. It doesn’t add much in terms of availability, but still it will allow the HA hosts to communicate state through the datastore.

PS1: For those who don’t know, HA is configured to automatically select a heartbeat datastore. In a vSAN only environment you can disable this by selecting “Use datastore from only the specified list” in the HA interface and then set “das.ignoreInsufficientHbDatastore = true” in the advanced HA settings.

PS2: In a non-routable vSAN network environment you could create a Switch Virtual Interface on the physical switch. This will give you an IP on the vSAN segment for the isolation address leveraging the advanced setting das.isolationaddress0.

vSAN 6.x customer? vSphere 6.0 Update 3 is out

Duncan Epping · Feb 26, 2017 ·

Are you a vSAN 6.x customer? vSphere 6.0 Update 3 is out! There are a bunch of important fixes and improvements (checksumming performance for instance) in Update 3, so I would highly recommend looking in to it and testing it out.

vSAN Details: https://kb.vmware.com/kb/2149127
vCenter Server download: https://my.vmware.com/web/vmware/details?downloadGroup=VC60U3&productId=491&rPId=14487
ESXi download: https://my.vmware.com/web/vmware/details?downloadGroup=ESXI60U3&productId=491&rPId=14487