There has been a lot of discussion in the past around Disk.SchedNumReqOutstanding and what the value should be and how it relates to the Queue Depth. Jason Boche wrote a whole article about when Disk.SchedNumReqOutstanding (DSNRO) is used and when not and I guess I would explain it as follows:
When two or more virtual machines are issuing I/Os to the same datastore Disk.SchedNumReqOutstanding will limit the amount of I/Os that will be issued to the LUN.
So what does that mean? It took me a while before I fully got it, so lets try to explain it with an example. This is basically how the VMware IO scheduler (Start-Time Fair Scheduling aka SFQ) works.
You have set your queue depth for your HBA to 64 and a virtual machine is issuing I/Os to a datastore. As it is just a single VM up to 64 IOs will then end up in the device driver immediately. In most environments however LUNs are shared by many virtual machines and in most cases these virtual machines should be treated equally. When two or more virtual machines issue I/O to the same datastore DSNRO kicks in. However it will only throttle the queue depth when the VMkernel has detected that the threshold of a certain counter is reached. The name of this counter is Disk.SchedQControlVMSwitches and by default it is set to 6, meaning that the VMkernel will need to have detected 6 VM switches when handling I/O before it will throttle the queue down to the value of Disk.SchedNumReqOutstanding, by default 32. (VM Switches means that it will need to detect 6 times that the selected I/O is not coming from the same VM as the previous I/O.)
The reason the throttling happens is because the VMkernel cannot control the order of the I/Os that have been issued to the driver. Just imagine you have a VM A issuing a lot of I/Os and another, VM B, issuing just a few I/Os. VM A would end up using most of the full queue depth all the time. Every time VM B issues an I/O it will be picked quickly by the VMkernel scheduler (which is a different topic) and sent to the driver as soon as another one completes from there, but it will still get behind the 64 I/Os already in the driver, which will add significantly to it’s I/O latency. By limiting the amount of outstanding requests we will allow the VMkernel to schedule VM B’s I/O sooner in the I/O stream from VM A and thus we reduce the latency penalty for VM B.
Now that brings us to the second part of all statements out there, should we really set Disk.SchedNumReqOutstanding to the same value as your queue depth? Well in the case you want your I/Os processed as quickly as possible without any fairness you probably should. But if you have mixed workloads on a single datastore, and wouldn’t want virtual machines to incur excessive latency just because a single virtual machine issues a lot if I/Os, you probably shouldn’t.
Is that it? No not really, there are several questions that remain unanswered.
- What about sequential I/O in the case of Disk.SchedNumReqOutstanding?
- How does the VMkernel know when to stop using Disk.SchedNumReqOutstanding?
Lets tackle the sequential I/O question first. The VMkernel will allow by default to issue up to 8 sequential commands (controlled by Disk.SchedQuantum) from a VM in a row even when it would normally seem more fair to take an I/O from another VM. This is done in order not to destroy the sequential-ness of VM workloads because I/Os that happen to sectors nearby the previous I/O are handled by an order of magnitude (10x is not unusual when excluding cache effects or when caches are small compared to the disk size) faster than an I/O to sectors far away. But what is considered to be sequential? Well if the next I/O is less than 2000 sectors away from the current the I/O it is considered to be sequential (controlled by Disk.SectorMaxDiff).
Now if for whatever reason one of the VMs becomes idle you would more than likely prefer your active VM to be able to use the full queue depth again. This is what Disk.SchedQControlSeqReqs is for. By default Disk.SchedQControlSeqReqs is set to 128, meaning that when a VM has been able to issue 128 commands without any switches Disk.SchedQControlVMSwitches will be reset to 0 again and the active VM can use the full queue depth of 64 again. With our example above in mind, the idea is that if VM B is issuing very rare IOs (less than 1 in every 128 from another VM) then we still let VM B pay the high penalty on latency because presumably it is not disk bound anyway.
To conclude, now that the coin has finally dropped on Disk.SchedNumReqOutstanding I strongly feel that the advanced settings should not be changed unless specifically requested by VMware GSS. Changing these values can impact fairness within your environment and could lead to unexpected behavior from a performance perspective.
I would like to thank Thor for all the help he provided.
Ron Singler says
How does SIOC play in these scenarios? Does SIOC use these values under the covers to accomplish it’s mission of share fairness?
Duncan Epping says
When SIOC is enabled, DSNRO is disabled as SIOC takes care of the fairness.
Thanks for the explanation around Disk.SchedQuantum, Disk.SchedQControlSeqReqs and Disk.SchedQControlVMSwitches. It must be nice having an inside line on these details.
Do you know if VMware will publish more details around the advanced parameters?
Duncan Epping says
No VMware will not do that. And yes having direct access to people like Thor is very helpful, still posts like these require a lot of time/effort.
I absolutely agree with you about the time and effort.
Thank you for taking the time to share this with the community.
A question regarding your consideration not to change Disk.Sched unless requested by VMware GSS: What is your stance on changing Disk.Sched per array vendor recommendation?
I linked a Hitachi VSP PDF in a previous post on the subject where they recommend changing Disk.Sched based on the number of disks, LUN’s and HBA’s.
Would you concede that the statement not to change Disk.Sched should include both VMware GSS and the array vendor?
I’d love to see more advanced parameters explained – maybe VMware Press will offer an e-book down the road.
Thanks for the insight.
Duncan Epping says
I cannot imagine they will.
You can get a brief description of each parameter in the vSphere client beside each parameter or by running esxcfg-advcfg -l using the COS/TSM.
Duncan Epping says
and you can see it in the UI as well.
Dennis Agterberg says
Great article Duncan, very interesting.
Askar Kopbayev says
Duncan, is there any setting that regulates queue depth for Storage vMotion? I could increase HBA queue and DSNRO to 64 and could see it was working properly using IOmeter. However, whenever I tried to vmotion VM between two storages the ACTV value in esxtop never went higher than 32.
I see the values of LUN queue throttle from 128 to 32 for period of time.
Disk.SchedQControlVMSwitches = 6
Disk.SchedNumReqOutstanding = 32
LUN queue depth = 128 (toggles to 32 for a period of time)
LUN queue depth throttling is disabled.
The reason I am surprised that this is happening is that there is only one VM on the datastore !!
Any other adapter / device level setting may cause this?
how are you able to view this ?