load balancing active/active SANs part II

About a year ago I wrote about a script that would load balance your Active/Active SAN by evenly dividing LUNs on all available paths. A week ago I provided Kees van Vloten with this script so that it could be incorporated into a scripted install solution. Kees has enhanced the script and emailed it so that I could share it with you guys:

for N_PATHS in 2 4 6 8; do
# These are the LUNs with N_PATHS:
LUN_LIST=`esxcfg-mpath -l | egrep "^Disk.+has $N_PATHS paths" | awk '{print $2}'`
for LUN in $LUN_LIST; do
echo "LUN: $LUN, Counter: $N, Possible paths:"
esxcfg-mpath -q --lun=$LUN | grep "FC" | awk '{print $4}'
# Take the Nth path for this LUN
LUN_NEWPATH=`esxcfg-mpath -q --lun=$LUN | \
grep "FC" | awk '{print $4}' | head -n $N | tail -n 1`
# Make the Nth path the preferred path
esxcfg-mpath --lun=$LUN --path=$LUN_NEWPATH --preferred
echo ""
# Increase N (within the limit)
if [ $N -gt $N_PATHS ]; then

Thanks for sharing,

VMFS/LUN size?

A question that pops up on the VMTN Community once every day is what size VMFS datastore should I create? The answer always varies,  one says “500Gb” the other says “1TB”. Now the real answer should be, it depends.

Most companies can use a simple formula in my opinion. First you should answer these questions:

  • What’s the maximum amount of VMs you’ve set for a VMFS volume?
  • What’s the average size of a VM in your environment? (First remove the really large VM’s that typically get an RDM.)

If you don’t know what the maximum amount of VMs should be just use a safe number, anywhere between 10 and 15. Here’s the formula I always use:

round((maxVMs * avgSize) + 20% )

I usually use increments of 25GB. This is where the round comes in to play. If you end up with 380GB round it up to 400GB and if you end up with 321GB round it up to 325GB. Let’s assume your average VM size is 30GB and your max amount of VMs per VMFS volume is 10:

(10*30) + 60 =360
360 rounded up -> 375GB

The file is too big…err, no it’s not

I run VMware Workstation at home because ESX doesn’t have drivers for my SATA controller. After a motherboard failure I had to reconstruct my software RAID, and this morning I tried to recreate a virtual disk I use for saving images of my laptop. Previously it was approximately 140Gb in size; after a rearrange the parition is 237Gb, so I will make the disk 236Gb. This is on an ext4 filesystem. So, off I go:

igibbs@host:/images$ vmware-vdiskmanager -c -t 0 -a ide -s 236GB fs1-images.vmdk
Creating disk '/images/fs1-images.vmdk'
Failed to create disk: The file is too big for the filesystem (0xc00000015).

Err, no it’s not. The maximum file size for ext3 and ext4 is 16TB, for ext4 will eventually be 1,048,576TB (or 1EB). To my knowledge, that’s not block size-dependent like VMFS is. Eventually it turned out that I could create a pre-allocated disk (-t 2) of 236GB but not a sparse disk (-t 0) of 236GB:

igibbs@host:/images$ vmware-vdiskmanager -c -t 2 -a ide -s 236GB fs1-images.vmdk
Creating disk 'fs1-images.vmdk'
Create: 0% done.

Hope this helps someone. I presume it’s caused by Workstation not recognising ext4 properly.

That’s why I love blogging…

I’m an outspoken person as most of you noticed by now, but I’m also open for discussion and that’s why I particularly like blogging. Every now and then a good discussion starts based on one of my blog articles. (Or a blog article of any of the bloggers for that matter.) These usually start in the form of a comment on an article but also via email or Twitter, even within VMware some of my articles have been discussed extensively.

A couple of weeks ago I voiced my opinion about VMFS block sizes and growing your VMFS. Growing your VMFS is a new feature, introduced with vSphere. In the article I stated that a large block size, 8MB, would be preferabel because of the fact that you would have less locking when using thin provisioned disks.

If you create a thin provisioned disk on a datastore with a 1MB block size the thin provisioned disk will grow with increments of 1MB. Hopefully you can see where I’m going. A thin provisioned disk on a datastore with an 8MB block size will grow in 8MB increments. Each time the thin-provisioned disk grows a SCSI reservation takes place because of meta data changes. As you can imagine an 8MB block size will decrease the amount of meta data changes needed, which means less SCSI reservations. Less SCSI reservations equals better performance in my book.

As a consultant I get a lot of question on vmfs locking and I assumed, with the current understanding I had, that a larger blocksize would be beneficial in terms of performance. I’m no scientist or developer and I rely on the information I find on the internet, manuals, course material and the occasional internal mailinglists… In this case this information wasn’t correct, or better said not updated yet to the changes that vSphere introduced. Luckily for me, and you guys, one of my colleagues jumped in to give us some good insights:

I am a VMware employee and I wrote VMFS with a few cronies, but the following is a personal opinion:

Forget about locking. Period. Yes, SCSI reservations do happen (and I am not trying to defend that here) and there will be some minor differences in performance, but the suggestion on the (very well written) blog post goes against the mission of VMFS, which is to simplify storage virtualization.

Heres a counter example: if you have a nearly full 8MB VMFS volume and a less full 1MB VMFS volume, you’ll still encounter less IO overheads allocating blocks on a 1MB VMFS volume compared to the 8MB volume because the resource allocator will sweat more trying to find a free block in the nearly full volume. This is just one scenario, but my point is that there are tons of things to consider if one wants to account for overheads in a holistic manner and the VMFS engineers don’t want you to bother with these “tons” of things. Let us handle all that for you.

So in summary, blocksizes and thin provisioning should be treated orthogonally. Since thin provisioning is an official feature, the thing for users to know is that it will work “well” on all VMFS blocksize configurations that we support. Thinking about reservations or # IOs the resource manager does, queue sizes on a host vs the blocksize, etc will confuse the user with assertions that are not valid all the time.

I like the post in that it explains blocks vs sub-blocks. It also appeals to power users, so that’s great too. But reservation vs. thin provisioning considerations should be academic only. I can tell you about things like non-blocking retries, optimistic IO (not optimistic locking) and tons of other things that we have done under the covers to make sure reservations and thin provisioning don’t belong in the same sentence with vSphere 4. But conversely, I challenge any user to prove that 1MB incurs a significant overhead compared to 8MB with thin provisioning :)

Satyam Vaghani

Does this mean that I would not pick an 8MB block size over a 1MB block size any more?

Not exactly, but it will depend on the specific situation of a customer. My other reason for picking an 8MB block size was VMFS volume growing. If you grow a VMFS volume the reason for this probably is the fact that you need to grow a VMDK. If the VMDK needs to grow larger than the maximum file size, which is dictated by the chosen block size, you would need to move(Storage VMotion or cold migration) the VMDK to a different datastore. But if you would have selected an 8MB block size when you created the VMFS volume you would not be in this position. In other words I would prefer a larger block size, but this is based on flexibility in terms of administration and not based on performance or possible locking issues.

I want to thank Satyam for his very useful comment, thanks for chipping in!

How to change the SRM change of power state time out values

One of my customers recently asked if it was possible to change the time-out for a power state change, at the same time this question was asked and answered on an internal mailing list. I thought it would be nice to document it. An example of a power state change task would be the shutdown that is initiated by SRM when you run a recovery plan. The default value is 120 seconds which might not be long enough and could lead to issues when a power off is forced. You can increase or decrease this value by editing the SRM configuration file (vmware-dr.xml). Look for the following section:

<powerStateChangeTimeout>120</ powerStateChangeTimeout>

Like stated above, the time-out value is in seconds. The default value is 120 and it can be changed according to your requirements. This change will be effective when the SRM service has been restarted. (If you can’t find this section in the XML file, just add it…)

Partitioning your ESX host – part II

A while back I published an article on partitioning your ESX host. This was based on 3.5, and of course with vSphere this has slightly changed. Let me start by quoting a section from the install and configure guide.

You cannot define the sizes of the /boot, vmkcore, and /vmfs partitions when you use the graphical or text installation modes. You can define these partition sizes when you do a scripted installation.

The ESX boot disk requires 1.25GB of free space and includes the /boot and vmkcore partitions. The /boot partition alone requires 1100MB.

The reason for this is the fact that the service console is a VMDK. This VMDK is stored on the local VMFS volume by default in the following location: esxconsole-<system-uuid>/esxconsole.vmdk. By the way, “/boot” has been increased as a “safety net” for future upgrades to ESX(i).

So for the manual installations there are three partitions less to worry about. I would advise to use the following sizes for the rest of the partitions, and I would also recommend to rename the local VMFS partition during installation. The default name is “Storage1″, my recommendation would be “<hostname>-localstorage”.

/     - 5120MB
Swap  - 1600MB
Extended Partition:
/var  - 4096MB
/home - 2048MB
/opt  - 2048MB
/tmp  - 2048MB

With the disk sizes these days you should have more than enough space for a roughly 18GB for ESX in total.

vSphere performance

The last couple of weeks I’ve seen all these performance numbers(most not publicly available though)  of vSphere, one even more impressing than the other. I think every one will agree that the latest one is really impressive, 364.00 IOPS is just insane. There’s no load vSphere can’t handle, when correctly sized of course.

But something that even made a bigger impression on me, as a consolidation fanatic, is the following line from the latest performance study:

VMware’s new paravirtualized SCSI adapter (pvSCSI) offered 12% improvement in throughput at 18% less CPU cost compared to LSI virtual adapter

Now this may not sound like much, but when you are running 50 hosts it will make a difference. It will save you on cooling / rack space / power / hardware / maintenance, in other words this will have it’s effect on your ROI and TCO. This is the kind of info that I would love to see more, where did we cut down on “overhead”… Which improvements will make our consolidation numbers go up?!