No one likes queues

Well depending on what type of queues we are talking about of course, but in general no one likes queues. We are however confronted with queues on a daily basis, especially in compute environments. I was having a discussing with an engineer around storage queues and he sent me the following which I thought was worth sharing as it gives a good overview of how traffic flows from queue to queue with the default limits on the VMware side:

From top to bottom:

  • Guest device driver queue depth (LSI=32, PVSCSI=64)
  • vHBA (Hard coded limit: LSI=128, PVSCSI=255)
  • disk.schedNumOutstanding=32 (VMKernel),
  • VMkernel Device Driver (FC=32, iSCSI=128, NFS=256,  local disk=32)
  • Multiple SAN/Array Queues (Check Chad’s article for more details but it includes port buffers, port queues, disk queues etc (might be different for other storage vendors))

The following is probably worth repeating or clarifying:

The PVSCSI default queue depth is 64. You can increase it to 255 if required, please note that it is a per-device queue depth and keep in mind that this would only be truly useful when it is increased all the way down the stack and the Array Controller supports it. There is no point in increasing the queuedepth on a single layer when the other layers are not able to handle it as it would only push down the delay one layer. As explained in an article a year or three ago, disk.schednumreqoutstanding is enforced when multiple VMs issue I/Os on the same physical LUN, when it is a single VM it doesn’t apply and it will be the Device Driver queue that limits it.

I hope this provides a bit more insight to how the traffic flows. And by the way, if you are worried a single VM floods one of those queues there is an answer for that, it is called Storage IO Control!

RE: VMFS 3 versions – maybe you should upgrade your vmfs?

I was just answering some questions on the VMTN forum when someone asked the following question:

Should I upgrade our VMFS luns from 3.21 (some in 3.31) to 3.46 ? What benefits will we get?

This person was referred to an article by Frank Brix Pedersen who states the following:

Ever since ESX3.0 we have used the VMFS3 filesystem and we are still using it on vSphere. What most people don’t know is that there actually is sub versions of the VMFS.

  • ESX 3.0 VMFS 3.21
  • ESX 3.5 VMFS 3.31 key new feature: optimistic locking
  • ESX 4.0 VMFS 3.33 key new feature: optimistic IO

The good thing about it is that you can use all features on all versions. In ESX4 thin provisioning was introduced but it does need the VMFS to be 3.33. It will still work on 3.21. The changes in the VMFS is primarily regarding the handling of SCSI reservations. SCSI reservations happens a lot of times. Creation of new vm, growing a snapshot delta file or growing thin provisioned disk etc.

I want to make sure everyone realizes that this is actually not true. All the enhancements made in 3.5, 4.0 and even 4.1 are not implemented on a filesystem level but rather on a VMFS Driver level or through the addition of specific filters or even a new datamover.

Just to give an extreme example: You can leverage VAAI capabilities on a VMFS volume with VMFS filesystem version 3.21, however in order to invoke VAAI you will need the VMFS 3.46 driver. In other words, a migration to a new datastore is not required to leverage new features!

Storage vMotion performance difference?

Last week I wrote about the different datamovers being used when a Storage vMotion is initiated and the destination VMFS volume has a different blocksize as the source VMFS volume. Not only will it make a difference in terms of reclaiming zero space, but as mentioned it also makes a different in performance. The question that always arises is how much difference does it make? Well this week there was a question on the VMTN community regarding a SvMotion from FC to FATA and the slow performance. Of course within a second FATA was blamed, but that wasn’t actually the cause of this problem. The FATA disks were formatted with a different blocksize and that cause the legacy datamover to be used. I asked Paul, who started the thread, if he could check what the difference would be when equal blocksizes were used. Today Paul did his tests and he blogged about it here and but I copied the table which contains the details that shows you what performance improvement the fs3dm (please note, that VAAI is not used… this is purely a different datamover) brought:

From To Duration in minutes
FC datastore 1MB blocksize FATA datastore 4MB blocksize 08:01
FATA datastore 4MB blocksize FC datastore 1MB blocksize 12:49
FC datastore 4MB blocksize FATA datastore 4MB blocksize 02:36
FATA datastore 4MB blocksize FC datastore 4MB blocksize 02:24

As I explained in my article about the datamover, the difference is caused by the fact that the data doesn’t travel all the way up the stack… and yes the difference is huge!

Storage Performance

This is just a post to make it easier finding these excellent articles/threads on VMTN about measuring storage performance:

All these have one “requirement”  and that is that Iometer is used.

Another one that I wanted to point out are these excellent scripts that Clinton Kitson created which collects and processes vscsistats data. That by itself is cool, but what is planned for the next update is even cooler. Live plotted 3d graphs. Can’t wait for that one to be released!

Enable Storage IO Control on all Datastores!

This week I received an email from one of my readers about some weird Storage IO Control behavior in their environment.  On a regular basis he would receive an error stating that an “external I/O workload has been detected on shared datastore running Storage I/O Control (SIOC) for congestion management”. He did a quick scan of his complete environment and couldn’t find any hosts connecting to those volumes. After exchanging a couple of emails about the environment I managed to figure out what triggered this alert.

Now this all sounds very logical but probably is one of the most common made mistakes… sharing spindles. Some storage platforms carve out a volume from a specific set of spindles. This means that these spindles are solely dedicated to that particular volume. Other storage platforms however group spindles and layer volumes across these. Simply said, they are sharing spindles to increase performance. NetApp’s “aggregates” and HP’s “disk groups” would be a good example.

This can and probably will cause the alarm to be triggered as essentially an unknown workload is impacting your datastore performance. If you are designing your environment from the ground-up, make sure that all spindles that are backing your VMFS volumes have SIOC enabled.

However, in an existing environment this will be difficult, don’t worry that SIOC will be overly conservative and unnecessarily throttle your virtual workload. If and when SIOC detects an external workload it will stop throttling the virtual workload to avoid giving the external more bandwidth while negatively impact the virtual workload. From a throttling perspective that will look as follows:

32 29 28 27 25 24 22 20 (detect nonVI –> Max Qdepth )
32 31 29 28 26 25 (detect nonVI –> Max Qdepth)
32 30 29 27 25 24 (detect nonVI –> Max Qdepth)

Please note that the above example depicts a scenario where SIOC notices that the latency threshold is still exceeded and the cycle will start again, SIOC checks latency values every 4 seconds. The question of course remains how SIOC knows that there is an external workload accessing the datastore. SIOC uses a what we call a “self-learning algorithm”. It keeps track of historical observed latency, outstanding IOs and window sizes. Based on that info it can identify anomalies and that is what triggers the alarm.

To summarize:

  • Enable SIOC on all datastores that are backed by the same set of spindles
  • If you are designing a green field implementation try to avoid sharing spindles between non VMware and VMware workloads

More details about when this event could be triggered can be found in this KB article.