vSphere

DQLEN changes, what is going on?

Duncan Epping · Mar 5, 2019 ·

I had a question this week on twitter, it was about the fact that DQLEN changes to values well below it was expected to be (30) in esxtop for a host. There was latency seen and experienced seen for VMs so the question was why is this happening and wouldn’t a lower DQLEN make things worse?

My first question: Do you have SIOC enabled? The answer was “yes”, and this is (most likely) what is causing the DQLEN changes. (What else could it be? Adaptive Queueing for instance.) When SIOC is enabled it will automatically change DQLEN when the configured latency threshold is exceeded based on the number of VMs per host and the number of shares. DQLEN will be changed to ensure a noisy neighbor VM is not claiming all I/O resources. I described how that works in this post in 2010 on Storage IO Fairness.

How do you solve this problem? Well, first of all, try to identify the source of the problems, this could be a single (or multiple) VMs, but it could also be that in general, the storage array is running at its peak constantly or backend services like replication is causing a slowdown. Typically it is a few (or one) VMs causing the load, try to find out which VMs are pushing the storage system and look for alternatives. Of course, that is easier said than done, as you may not have any expansion possibilities in the current solution. Offloading some of the I/O to a caching solution could also be an option (Infinio for instance), or replace the current solution with a more capable system is another one.

Changed advanced setting VSAN.ClomRepairDelay and upgrading to 6.7 u1? Read this…

Duncan Epping · Feb 6, 2019 ·

If you changed the advanced setting VSAN.ClomRepairDelay to anything else than the default 60 minutes there’s a caveat during the upgrade to 6.7 U1 you need to be aware of. When you do this upgrade the default is reset, meaning the value is configured once again to 60 minutes. It was reported on twitter by “Justin Bias” this week, and I tested in the lab and indeed experience the same behavior. I set my value to 90 and after an upgrade from 6.7 to 6.7 U1 the below was the result.

Why did this happen? Well, in vSAN 6.7 U1 we introduced a new global cluster-wide setting. On a cluster level under “Configure >> vSAN >> Services” you now have the option to set the “Object Repair Time” for the full cluster, instead of doing this on a host by host basis. Hopefully this will make your life a bit easier.

Note that when you make the change globally it appears that the Advanced Settings UI is not updated automatically. The change is however committed to the host, this is just a UI bug at the moment and will be fixed in a future release.

HA Admission Control Policy: Dedicated Failover Hosts

Duncan Epping · Feb 5, 2019 ·

This week I received some questions on the topic of HA Admission Control. There was a customer that had a cluster configured with the dedicated failover host admission control policy and they had no clue why. This cluster had been around for a while and it was configured by a different admin, who had left the company. As they upgraded the environment they noticed that it was configured with an admission control policy they never used anywhere else, but why? Well, of course, the design was documented, but no one documented the design decision so that didn’t really help. So they came to me and asked what it exactly did and why you would use it.

Let’s start with that last question, why would you use it? Well normally you would not, you should not. Forget about it, well unless you have a specific use case and I will discuss that later. What does it do?

When you designate hosts as failover hosts, they will not participate in DRS and you will not be able to run VMs on these hosts! Not even in a two-host cluster when placing one of the two in maintenance. These hosts are literally reserved for failover situations. HA will attempt to use these hosts first to failover the VMs. If, for whatever reason, this is unsuccessful, it will attempt a failover on any of the other hosts in the cluster. For example, in a when two hosts would fail, including the hosts designated as failover hosts, HA will still try to restart the impacted VMs on the host that is left. Although this host was not a designated failover host, HA will use it to limit downtime.

As can be seen above, you select the correct admission control policy and then add the hosts. As mentioned earlier, the hosts added to this list will not be considered by DRS at all. This means that the resources go wasted unless there’s a failure. So why would you use it?

If you need to know where a VM runs all the time, this admission control policy dictates where the restart happens.
There is no resource fragmentation, as a full host (or multiple) worth of resources will be available to restart VMs on, instead of 1 host worth of resources across multiple hosts.

In some cases, the above may be very useful, for instance knowing where a VM is all the time could be required for regulatory compliance, or could be needed for licensing reasons when you run Oracle for instance.

Black Friday Gift: Free copy of the vSphere 6.7 Clustering Deep Dive, thanks Rubrik (ebook)

Duncan Epping · Nov 23, 2018 ·

Many asked us if the ebook would be made available for free again. Today I have the pleasure of announcing that Frank, Niels and I have worked once again with Rubrik and the VMUG organization to make the vSphere 6.7 Clustering Deep Dive book available for free! Yes, that is 0 USD / EURO, or whatever your currency is. As the book signing at VMworld was wildly popular, which resulted in the follow up discussion about the ebook.

Ready to up your vSphere game? Join us at #VMworld booth #P305 for a complimentary copy of @ClusterDeepDive + the chance to meet authors @DuncanYB @FrankDenneman @NHagoort! More info: https://t.co/0DQ7nI1wzX pic.twitter.com/7nIGEvjdBF

— Rubrik (@rubrikInc) November 2, 2018

You want a copy? All that we expect you to do is register on Rubrik’s website using your own email address. Anyway, register and start your download engines, pick up a fresh copy of the vSphere Clustering Deep Dive here!

HA Futures: VMCP for Networking – Part 3 of 4 – (Please comment!)

Duncan Epping · Oct 30, 2018 ·

VMCP, or VM Component Protection, has been around for a while. Many of you are probably using this to mitigate storage issues. However, what if the VM network fails? Well, that is a problem right now… if the VM network fails then there’s no response from HA. This by many customers is considered to be a problem. So what would we like to propose? VM Component Protection for Networking!

How would this work? Well the plan would be to allow you to enable VM Component Protection for Networking for any network on your host. This could be the vMotion network, different VM networks etc. On this network HA would need to have an IP address it could check “liveness” against of course, very similar to how it used the default gateway to verify “host isolation”.

On top of that, besides validating liveness through an IP address, of course, HA should also monitor the physical NIC. If either of the two would not work, well then HA should take action immediately. What this action will be will depend on the type of failure that has occurred. We are considering the following two types of responses to a failure:

If vMotion still works, migrate the VM from impacted host to a healthy host
If vMotion doesn’t work, restart the impacted VM on a healthy host

In addition to monitoring the health of the physical NIC, HA can also use in guest/VM monitoring techniques to monitor the network route from within the VM to a certain address/gateway. Would this technique be useful?

What do you think? Please provide feedback/comments below, even if it is just a “yes, please!” Please help shape the future of HA!