Thin provisioned disks and VMFS fragmentation, do I really need to worry?

I’ve seen this myth floating around from time to time and as I never publicly wrote about it I figured it was time to write an article to debunk this myth. The question that is often posed is if thin disks will hurt performance due to fragmentation of the blocks allocated on the VMFS volume. I guess we need to rehash (do a search on VMFS for more info)  some basics first around Think Disks and VMFS volumes…

When you format a VMFS volume you can select the blocksize (1MB, 2MB, 4MB or 8MB). This blocksize is used when the hypervisor allocates storage for the  VMDKs. So when you create a VMDK on an 8MB formatted VMFS volume it will create that VMDK out of 8MB blocks and yes indeed in the case of a 1MB formatted VMFS volume it will use 1MB. Now this blocksize also happens to be the size of the extend that is used for Think Disks. In other words, every time your thin disks needs to expand it will grow in extends of 1MB. (Related to that, with a lazy-thick disk the zero-out also uses the blocksize. So when something needs to be written to an untouched part of the VMDK it will zero out using the blocksize of the VMFS volume.)

So using a thin disk in combination with a small blocksize cause more fragmentation? Yes, more than possibly it would. However the real question is if it will hurt your performance. The answer to that is: No it won’t. The reason for it being that the VMFS blocksize is totally irrelevant when it comes to Guest OS I/O. So lets assume you have an regular Windows VM and this VM is issuing 8KB writes and reads to a 1MB blocksize formatted volume, the hypervisor won’t fetch 1MB as that could cause a substantial overhead… no it would request from the array what was requested by the OS and the array will serve up whatever it is configured to do so. I guess what people are worried about the most is sequential I/O, but think about that for a second or two. How sequential is your I/O when you are looking at it from the Array’s perspective? You have multiple hosts running dozens of VMs accessing who knows how many volumes and subsequently who knows how many spindles. That sequential I/O isn’t as sequential anymore all of a sudden it is?!

<edit> As pointed out many arrays recognize sequential i/o and prefetch which is correct, this doesn’t mean that when contiguous blocks are used it is faster as fragmented blocks also means more spindles etc </edit>

I guess the main take away here is, stop worrying about VMFS it is rock solid and it will get the job done.

No one likes queues

Well depending on what type of queues we are talking about of course, but in general no one likes queues. We are however confronted with queues on a daily basis, especially in compute environments. I was having a discussing with an engineer around storage queues and he sent me the following which I thought was worth sharing as it gives a good overview of how traffic flows from queue to queue with the default limits on the VMware side:

From top to bottom:

  • Guest device driver queue depth (LSI=32, PVSCSI=64)
  • vHBA (Hard coded limit: LSI=128, PVSCSI=255)
  • disk.schedNumOutstanding=32 (VMKernel),
  • VMkernel Device Driver (FC=32, iSCSI=128, NFS=256,  local disk=32)
  • Multiple SAN/Array Queues (Check Chad’s article for more details but it includes port buffers, port queues, disk queues etc (might be different for other storage vendors))

The following is probably worth repeating or clarifying:

The PVSCSI default queue depth is 64. You can increase it to 255 if required, please note that it is a per-device queue depth and keep in mind that this would only be truly useful when it is increased all the way down the stack and the Array Controller supports it. There is no point in increasing the queuedepth on a single layer when the other layers are not able to handle it as it would only push down the delay one layer. As explained in an article a year or three ago, disk.schednumreqoutstanding is enforced when multiple VMs issue I/Os on the same physical LUN, when it is a single VM it doesn’t apply and it will be the Device Driver queue that limits it.

I hope this provides a bit more insight to how the traffic flows. And by the way, if you are worried a single VM floods one of those queues there is an answer for that, it is called Storage IO Control!

RE: VMFS 3 versions – maybe you should upgrade your vmfs?

I was just answering some questions on the VMTN forum when someone asked the following question:

Should I upgrade our VMFS luns from 3.21 (some in 3.31) to 3.46 ? What benefits will we get?

This person was referred to an article by Frank Brix Pedersen who states the following:

Ever since ESX3.0 we have used the VMFS3 filesystem and we are still using it on vSphere. What most people don’t know is that there actually is sub versions of the VMFS.

  • ESX 3.0 VMFS 3.21
  • ESX 3.5 VMFS 3.31 key new feature: optimistic locking
  • ESX 4.0 VMFS 3.33 key new feature: optimistic IO

The good thing about it is that you can use all features on all versions. In ESX4 thin provisioning was introduced but it does need the VMFS to be 3.33. It will still work on 3.21. The changes in the VMFS is primarily regarding the handling of SCSI reservations. SCSI reservations happens a lot of times. Creation of new vm, growing a snapshot delta file or growing thin provisioned disk etc.

I want to make sure everyone realizes that this is actually not true. All the enhancements made in 3.5, 4.0 and even 4.1 are not implemented on a filesystem level but rather on a VMFS Driver level or through the addition of specific filters or even a new datamover.

Just to give an extreme example: You can leverage VAAI capabilities on a VMFS volume with VMFS filesystem version 3.21, however in order to invoke VAAI you will need the VMFS 3.46 driver. In other words, a migration to a new datastore is not required to leverage new features!

Storage vMotion performance difference?

Last week I wrote about the different datamovers being used when a Storage vMotion is initiated and the destination VMFS volume has a different blocksize as the source VMFS volume. Not only will it make a difference in terms of reclaiming zero space, but as mentioned it also makes a different in performance. The question that always arises is how much difference does it make? Well this week there was a question on the VMTN community regarding a SvMotion from FC to FATA and the slow performance. Of course within a second FATA was blamed, but that wasn’t actually the cause of this problem. The FATA disks were formatted with a different blocksize and that cause the legacy datamover to be used. I asked Paul, who started the thread, if he could check what the difference would be when equal blocksizes were used. Today Paul did his tests and he blogged about it here and but I copied the table which contains the details that shows you what performance improvement the fs3dm (please note, that VAAI is not used… this is purely a different datamover) brought:

From To Duration in minutes
FC datastore 1MB blocksize FATA datastore 4MB blocksize 08:01
FATA datastore 4MB blocksize FC datastore 1MB blocksize 12:49
FC datastore 4MB blocksize FATA datastore 4MB blocksize 02:36
FATA datastore 4MB blocksize FC datastore 4MB blocksize 02:24

As I explained in my article about the datamover, the difference is caused by the fact that the data doesn’t travel all the way up the stack… and yes the difference is huge!

Storage Performance

This is just a post to make it easier finding these excellent articles/threads on VMTN about measuring storage performance:

All these have one “requirement”  and that is that Iometer is used.

Another one that I wanted to point out are these excellent scripts that Clinton Kitson created which collects and processes vscsistats data. That by itself is cool, but what is planned for the next update is even cooler. Live plotted 3d graphs. Can’t wait for that one to be released!