Disk.SchedNumReqOutstanding the story

There has been a lot of discussion in the past around Disk.SchedNumReqOutstanding and what the value should be and how it relates to the Queue Depth. Jason Boche wrote a whole article about when Disk.SchedNumReqOutstanding (DSNRO) is used and when not and I guess I would explain it as follows:

When two or more virtual machines are issuing I/Os to the same datastore Disk.SchedNumReqOutstanding will limit the amount of I/Os that will be issued to the LUN.

So what does that mean? It took me a while before I fully got it, so lets try to explain it with an example. This is basically how the VMware IO scheduler (Start-Time Fair Scheduling aka SFQ) works.

You have set your queue depth for your HBA to 64 and a virtual machine is issuing I/Os to a datastore. As it is just a single VM up to 64 IOs will then end up in the device driver immediately. In most environments however LUNs are shared by many virtual machines and in most cases these virtual machines should be treated equally. When two or more virtual machines issue I/O to the same datastore DSNRO kicks in. However it will only throttle the queue depth when the VMkernel has detected that the threshold of a certain counter is reached. The name of this counter is Disk.SchedQControlVMSwitches and by default it is set to 6, meaning that the VMkernel will need to have detected 6 VM switches when handling I/O before it will throttle the queue down to the value of Disk.SchedNumReqOutstanding, by default 32. (VM Switches means that it will need to detect 6 times that the selected I/O is not coming from the same VM as the previous I/O.)

The reason the throttling happens is because the VMkernel cannot control the order of the I/Os that have been issued to the driver. Just imagine you have a VM A issuing a lot of I/Os and another, VM B, issuing just a few I/Os. VM A would end up using most of the full queue depth all the time. Every time VM B issues an I/O it will be picked quickly by the VMkernel scheduler (which is a different topic) and sent to the driver as soon as another one completes from there, but it will still get behind the 64 I/Os already in the driver, which will add significantly to it’s I/O latency. By limiting the amount of outstanding requests we will allow the VMkernel to schedule VM B’s I/O sooner in the I/O stream from VM A and thus we reduce the latency penalty for VM B.

Now that brings us to the second part of all statements out there, should we really set Disk.SchedNumReqOutstanding to the same value as your queue depth? Well in the case you want your I/Os processed as quickly as possible without any fairness you probably should. But if you have mixed workloads on a single datastore, and wouldn’t want virtual machines to incur excessive latency just because a single virtual machine issues a lot if I/Os, you probably shouldn’t.

Is that it? No not really, there are several questions that remain unanswered.

  • What about sequential I/O in the case of Disk.SchedNumReqOutstanding?
  • How does the VMkernel know when to stop using Disk.SchedNumReqOutstanding?

Lets tackle the sequential I/O question first. The VMkernel will allow by default to issue up to 8 sequential commands (controlled by Disk.SchedQuantum) from a VM in a row even when it would normally seem more fair to take an I/O from another VM. This is done in order not to destroy the sequential-ness of VM workloads because I/Os that happen to sectors nearby the previous I/O are handled by an order of magnitude (10x is not unusual when excluding cache effects or when caches are small compared to the disk size) faster than an I/O to sectors far away. But what is considered to be sequential? Well if the next I/O is less than 2000 sectors away from the current the I/O it is considered to be sequential (controlled by Disk.SectorMaxDiff).

Now if for whatever reason one of the VMs becomes idle you would more than likely prefer your active VM to be able to use the full queue depth again. This is what Disk.SchedQControlSeqReqs is for. By default Disk.SchedQControlSeqReqs is set to 128, meaning that when a VM has been able to issue 128 commands without any switches Disk.SchedQControlVMSwitches will be reset to 0 again and the active VM can use the full queue depth of 64 again. With our example above in mind, the idea is that if VM B is issuing very rare IOs (less than 1 in every 128 from another VM) then we still let VM B pay the high penalty on latency because presumably it is not disk bound anyway.

To conclude, now that the coin has finally dropped on Disk.SchedNumReqOutstanding I strongly feel that the advanced settings should not be changed unless specifically requested by VMware GSS. Changing these values can impact fairness within your environment and could lead to unexpected behavior from a performance perspective.

I would like to thank Thor for all the help he provided.

Order of storage tiers… (via twitter @mike_laverick)

@Mike_Laverick asked a question on twitter today about something that is stated in the Cloud Computing with vCloud Director book. His question was, and no he is not dyslectic he only had 140 characters :-)

pg65. Order of storage tiers. Doesn’t that infer FC/SDD+VMFS is “race horse” and NFS “donkey”…???

Mike was referring to the following section in the book:

SLA Service Cost RTO Storage RAID Applications
Tier 0 Premium $$$$$ 20 min SSD, FC 1+0 Exchange, SQL
Tier 1 Enterprise $$$$ 1 hour FC 1+0, 5 Web servers, Sharepoint
Tier 2 Professional $$$ 2 hours iSCSI, NFS 3, 5, X Custom apps, QA
Tier 3 Basic $ 2 days NFS 3, 5, X Dev/Test

This basically states, as Mike elegantly translated, that FC/SSD is top performing storage while NFS is slow or should I say “donkey”. Mike’s comment is completely fair. I don’t agree with this table and actually did recommend changing it, somehow that got lost during the editing phase. In the first place we shouldn’t have mixed protocols with disks. Even an FC array will perform crap if you have SATA spindles backing your VMFS volumes. Secondly, there is no way you could compare these really as there are so many factors to take in to account ranging from cache to raid-level to wire speed. I guess it is still an example as clearly mentioned on page 64, nevertheless it is misleading. I would personally prefer to have listed it as follows:

SLA Service Cost RTO Protocol Disk RAID BC/DR
Tier 1 Enterprise $$$ 20 min FC 8GBps SSD 10 Sync replication
Tier 2 Professional $$ 1 hour NFS 10GBps FC 15k 6 Async Replication
Tier 3 Basic $ 1 day iSCSI 1GBps SATA 7k 5 Backup

Of course with the side note that performance is not solely dictated by the transport mechanism used, there is no reason why NFS couldn’t or shouldn’t be Tier 1 to be honest. Once again this is just an example. Thanks Mike for pointing it out,

List of VAAI capable storage arrays?

I was browsing the VMTN community and noticed a great tip from my colleague Mostafa Khalil and I believe it is worth sharing with you. The original question was: “Does anybody have a list of which arrays support VAAI (or a certain subset of the VAAI features)?”. Mostafa updated the post a couple of days back with the following reponse which also shows the capabilities of the 2.0 version of the VMware HCL:

A new version of the Web HCL will provide search criteria specific to VAAI.

As of this date, the new interface is still in “preview” stage. You can access it by clicking the “2.0 preview” button at the top of the page which is at: http://www.vmware.com/go/hcl/

  • The criteria are grouped under Features Category, Features and Plugin’s.
  • Features Category: Choice of “All” or “VAAI-Block”
  • Features: Choice of “All”, Block Zero”, “Full Copy”, “HW Assisted Locking” and more.
  • Plugin’s: Choice of “All” and any of the listed plugins.

HCL.jpg

Unfortunately there appear to be some glitches when it comes to listing all the arrays correctly, but I am confident that it will be fixed soon… Thanks Mostafa for the great tip.

Surprising results FC/NFS/iSCSI/FCoE…

I received a preliminary copy of a this report a couple of weeks ago, but since then nothing has changed. NetApp took the time to compare FC against, FCoE, iSCSI and NFS. Like most of us, probably, I still had the VI3 mindset and expected that FC would come out on top. Fact of the matter is that everything is so close, the differences are neglectable and tr-3916 shows that regardless of the type of data access protocol used you can get the same mileage. I am glad NetApp took the time to test these scenarios and used various test scenarios. It is no longer about which protocol works best, what drives the most performance… no it is about what is easiest for you to manage! Are you an NFS shop? No need to switch to FC anymore. Do you like the simplicity of iSCSI? Go for it…

Thanks NetApp for this valuable report. Although this report of course talks about NetApp it is useful material to read for all of you!

Disk.UseDeviceReset do I really need to set it?

I noticed a discussion on an internal mailinglist which mentioned the advanced setting “Disk.UseDeviceReset” as it is mentioned in the FC SAN guide. The myth that you need to set this setting to “0 “in order to allow for Disk.UseLunReset to function properly has been floating around too long. Lets discuss first what this options does. In short, when an error or SCSI reservations need to be cleared a SCSI reset will be send. We can either do this on a device level or on a LUN level. With device level meaning that we will send it to all disks / targets on a bus. As you can imagine this can be disruptive and when there is no need to reset a SCSI but this should be avoided. With regards to the settings here is what will happen with the different settings:

  • Disk.UseDeviceReset = 0  &  Disk.UseLunReset = 1  --> LUN Reset
  • Disk.UseDeviceReset = 1  &  Disk.UseLunReset = 1  --> LUN Reset
  • Disk.UseDeviceReset = 1  &  Disk.UseLunReset = 0  --> Device Reset

I hope that this makes it clear that there is no point in changing the Disk.UseDeviceReset setting as Disk.UseLunReset overrules it.

ps: I filed a document bug and hope that it will be reflected in the doc soon.