Definition of the advanced NFS options

An often asked question when implementing NFS based storage is what do these advanced settings represent you are recommending me to change?

VMware published a great KB article which describes these. For instance:

NFS.HeartbeatMaxFailures
The number of consecutive heartbeat requests that must fail before the server is marked as unavailable.

The KB article does not only explain the separate NFS settings but also how you can calculate how long it can take before ESX marks a NFS share as unavailable. Good stuff, definitely highly recommended!

Storage Masking?

I received a bunch of questions around storage masking over the last couple of weeks. One of them was around VMware’s best practice to mask LUNs on a per cluster basis. The best practice has been around for years and basically is there to reduce conflicts. More hosts accessing the same LUNs means more overhead, just to give you an example every 5 minutes a rescan of both HBAs takes place automatically to check for dead storage paths. You can imagine that there’s a difference between 64 hosts accessing your storage or limiting it to for instance 16 hosts. Also think about things like the failure domain you are introducing, what if an APD condition exists, this now doesn’t just impact 1 cluster… It could impact all of them.

For vSphere 5.1 read this revision

The obvious next question is, won’t I lose a lot of flexibility? Well in a way you do as a simple VMotion to another cluster will not work anymore. But of course there’s always a way to move a host to a different cluster. In my design I usually propose a so called “Transfer Volume”. This Volume(NFS or VMFS) can be used to transfer VMs to a different cluster. Yes there’s a slight operational overhead here, but is also reduces overhead in terms of traffic to a LUN and decreases the chance of scsi reservation conflicts etc.

Here’s the process:

  1. Storage VMotion the VM from LUN on Array 1 to Transfer LUN
  2. VMotion VM from Cluster A to Cluster B
  3. Storage VMotion the VM from Transfer LUN to LUN on Array 2

Of course these don’t necessarily need to be two separate arrays, it could just as easily be a single array with a group of LUNs masked to a particular cluster. For the people who have a hard time visualizing it:

Real life RAID penalty example added to the IOps article

I just added a real life RAID penalty example to the IOps article. I know Sys Admins are lazy,  so here’s the info I just added:

I have two IX4-200Ds at home which are capable of doing RAID-0, RAID-10 and RAID-5. As I was rebuilding my homelab I thought I would try to see what changing RAID levels would do on these homelab / s(m)b devices. Keep in mind this is by no means an extensive test. I used IOmeter with 100% Write(Sequential) and 100% Read(Sequential). Read was consistent at 111MB for every single RAID level. However for Write I/O this was clearly different, as expected. I did all tests 4 times to get an average and used a block size of 64KB as Gabes testing showed this was the optimal setting for the IX4.

In other words, we are seeing what we were expecting to see. As you can see RAID-0 had an average throughput of 44MB/s, RAID-10 still managed to reach 39MB/s but RAID-5 dropped to 31MB/s which is roughly 21% less than RAID-10.

I hope I can do the “same” tests on one of the arrays or preferably both (EMC NS20 or NetApp FAS2050) we have in our lab in Frimley!

FC vs NFS vs iSCSI

I was just reading the excellent whitepaper that NetApp just published. The paper is titled “VMware vSphere multiprotocol performance comparison using FC, iSCSI and NFS“. I guess the title says enough and I don’t need to explain why it is important to read this one.

I read the paper twice so far. Something that stood out for me is the following graph:

I would have expected better performance from iSCSI+Jumbo Frames, and most certainly not less performance than iSCSI without Jumbo Frames. Although it is a minimal decrease it is something that you will need to be aware off. I do however feel that the decrease in CPU overhead is more than enough to justify the small decrease in performance.

Read the report, it is worth your time.

IOps?

Just something I wanted to document for myself as it is info I need on a regular basis and always have trouble finding it or at least finding the correct bits and pieces. I was more or less triggered by this excellent white paper that Herco van Brug wrote. I do want to invite everyone out there to comment. I will roll up every single useful comment into this article to make it a reference point for designing your storage layout based on performance indicators.

The basics are simple, RAID introduces a write penalty. The question of course is how many IOps do you need per volume and how many disks should this volume contain to meet the requirements? First, the disk types and the amount of IOps. Keep in mind I’ve tried to keep values on the safe side:


(I’ve added SSD with 6000 IOps as commented by Chad Sakac)

So how did I come up with these numbers? I bought a bunch of disks, measured the IOps several times, used several brands and calculated the average… well sort of. I looked it up on the internet and took 5 articles and calculated the average and rounded the outcome.

[edit]
Many asked about where these numbers came from. Like I said it’s an average of theoretical numbers. In the comments there’s link to a ZDNet article which I used as one of the sources. ZDNet explains what the maximum amount of IOps theoretically is for a disk. In short; It is based on “average seek time” and the half of the time a single rotation takes. These two values added up result in the time an average IO takes. There are 1000 miliseconds in every second so divide 1000 by this value and you have a theoretical maximum amount of IOps. Keep in mind though that this is based on “random” IO. With sequential IO these numbers will of course be different on a single drive.
[/edit]

So what if I add these disks to a raid group:

For “read” IOps it’s simple, RAID Read IOps = Sum of all Single Disk IOps.

For “write” IOps it is slightly more complicated as there is a penalty introduced:

So how do we factor this penalty in? Well it’s simple for instance for RAID-5 for every single write there are 4 IO’s needed. That’s the penalty which is introduced when selecting a specific RAID type. This also means that although you think you have enough spindles in a single RAID Set you might not due to the introduced penalty and the amount of writes versus reads.

I found a formula and tweaked it a bit so that it fits our needs:

(TOTAL IOps × % READ)+ ((TOTAL IOps × % WRITE) ×RAID Penalty)

So for RAID-5 and for instance a VM which produces 1000 IOps and has 40% reads and 60% writes:

(1000 x 0.4) + ((1000 x 0.6) x 4) = 400 + 2400 = 2800 IO’s

The 1000 IOps this VM produces actually results in 2800 IO’s on the backend of the array, this makes you think doesn’t it?

Real life examples

I have two IX4-200Ds at home which are capable of doing RAID-0, RAID-10 and RAID-5. As I was rebuilding my homelab I thought I would try to see what changing RAID levels would do on these homelab / s(m)b devices. Keep in mind this is by no means an extensive test. I used IOmeter with 100% Write(Sequential) and 100% Read(Sequential). Read was consistent at 111MB for every single RAID level. However for Write I/O this was clearly different, as expected. I did all tests 4 times to get an average and used a block size of 64KB as Gabes testing showed this was the optimal setting for the IX4.

In other words, we are seeing what we were expecting to see. As you can see RAID-0 had an average throughput of 44MB/s, RAID-10 still managed to reach 39MB/s but RAID-5 dropped to 31MB/s which is roughly 21% less than RAID-10.

I hope I can do the “same” tests on one of the arrays or preferably both (EMC NS20 or NetApp FAS2050) we have in our lab in Frimley!

<update: December 2012>
More info about storage / VMFS volume sizing can be found in the following articles:

</update>

 

Changed block tracking?

I was reading Eric Siebert’s excellent article on Changed Block Tracking(CBT) and the article on Punching Cloud on this new feature which is part of vSphere. CBT enables incremental backups of full VMDKs. Something that isn’t covered is what the “block” part of Changed Block Tracking actually stands for.

Someone asked me on the VMTN Communities and it’s something I had not looked into yet. The question was around VMFS block sizes and the way it could potentially have its effect on the size of a backup which uses CBT. The assumption was made that CBT on a 1MB block size VMFS volume uses 1MB blocks and on an 8MB block size VMFS volume uses 8MB blocks. This is not the case.

So what’s the size of the block that CBT refers to? Good question I’ve asked around and the answer is that it’s not a specific size but it has a variable size. The block always starts with 64KB and the bigger the VMDK becomes the bigger the blocks become.

Just for the sake of it:

  • CBT is on a per VMDK level and not on a VMFS level.
  • CBT has variable block sizes which are dictated by the size of the VMDK.
  • CBT is a feature that lives within the VMKernel and not within VMFS.
  • CBT is a FS Filter as shown in the VMworld slide below

vscsiStats output in esxtop format?

This week we(Frank Denneman and I) played around with vscsiStats, it’s a weird command and hard to get used to when you normally dive into esxtop when there are performance issues. While asking around for more info on the metrics and values someone emailed us nfstop. I assumed it was NDA or at least not suitable for publication yet  but William Lam pointed me to a topic on the VMTN Communities which contains this great script. Definitely worth checking out. This tool parses the vscsiStats output into an esxtop format. Below a screenshot of what that looks like: