6.1

Virtually Speaking Podcast – vSAN Customer Use Cases

Duncan Epping · Feb 22, 2017 ·

As John Nicholson was traveling in and around New Zealand I was asked by Pete if I could co-host the Virtually Speaking Podcast again. It is always entertaining to join, Pete is such a natural when it comes to these things. I euuh, well I do my best to keep up with him :). Below you can find the latest episode on the topic of vSAN Customer Use Cases. It includes a lot of soundbites recorded at VMware World Wide Kick Off / Tech Summit, which is a VMware internal event for all Sales, Pre-Sales and Post-Sales field facing people.

You can of course also subscribe on iTunes!

Two host stretched vSAN cluster with Standard license?

Duncan Epping · Jan 24, 2017 ·

I was asked today if it was possible to create a 2 host stretched cluster using a vSAN Standard license or a ROBO Standard license. First of all, from a licensing point of view the EULA states you are allowed to do this with a Standard license:

A Cluster containing exactly two Servers, commonly referred to as a 2-node Cluster, can be deployed as a Stretched Cluster. Clusters with three or more Servers are not allowed to be deployed as a Stretched Cluster, and the use of the Software in these Clusters is limited to using only a physical Server or a group of physical Servers as Fault Domains.

I figured I would give it a go in my lab. Exec summary: worked like a charm!

Loaded up the ROBO license:

Go to the Fault Domains & Stretched Cluster section under “Virtual SAN” and click Configure. And one host to “preferred” and one to “secondary” fault domain:

Select the Witness host:

Select the witness disks for the vSAN cluster:

Click Finish:

And then the 2-node stretched cluster is formed using a Standard or ROBO license:

Of course I tried the same with 3 hosts, which failed as my license does not allow me to create a stretched cluster larger than 1+1+1. And even if it would succeed, the EULA clearly states that you are not allowed to do so, you need Enterprise licenses for that.

There you have it. Two host stretched using vSAN Standard, nice right?!

Disk controllers for vSAN with or without cache?

Duncan Epping · Dec 13, 2016 ·

I got this question today and I thought I already wrote something on the topic, but as I cannot find anything I figured I would write up something quick. The question was if a disk controller for vSAN should have cache or not? It is a fair question as many disk controllers these days come with 1GB, 2GB or 4GB of cache.

Let it be clear that with vSAN you are required to disable the write cache at all times. The reason for this is simple, vSAN is in control of data consistency and vSAN does not expect a write cache (battery backed or not) in its data path. Make sure to disable it. From a read perspective you can have caching enabled. In some cases we see controllers where people simply set the write cache to 0% and the rest automatically then becomes read cache. This is fully supported, however our tests have shown that there’s little added benefit in terms of performance. Especially as reads come from SSD anyway typically, theoretically there could be a performance gain, but personally I would rather spend my money on flash for vSAN.

My recommendation is fairly straight forward: use a disk controller which is a plain pass through controller without any fancy features. You don’t need RAID on the disk controller with vSAN, you don’t need caching on the disk controller with vSAN, keep it simple, that works best. So if you have the option to dumb it down, go for it.

Benchmarking an HCI solution with legacy tools

Duncan Epping · Nov 17, 2016 ·

I was driving back home from Germany on the autobahn this week when thinking about 5-6 conversations I have had the past couple of weeks about performance tests for HCI systems. (Hence the pic on the rightside being very appropriate ;-)) What stood out during these conversations is that many folks are repeating the tests they’ve once conducted on their legacy array and then compare the results 1:1 to their HCI system. Fairly often people even use a legacy tool like Atto disk benchmark. Atto is a great tool for testing the speed of your drive in your laptop, or maybe even a RAID configuration, but the name already more or less reveals its limitation: “disk benchmark”. It wasn’t designed to show the capabilities and strengths of a distributed / hyper-converged platform.

Now I am not trying to pick on Atto as similar problems exist with tools like IOMeter for instance. I see people doing a single VM IOMeter test with a single disk. In most hyper-converged offerings that doesn’t result in a spectacular outcome, why? Well simply because that is not what the solution is designed for. Sure, there are ways to demonstrate what your system is capable off with legacy tools, simply create multiple VMs with multiple disks. Or even with a single VM you can produce better results when picking the right policy as vSAN allows you to stripe data across 12 devices for instance (which can be across hosts, diskgroups etc). Without selecting the right policy or having multiple VMs, you may not be hitting the limits of your system, but simply the limits of your VM virtual disk controller, host disk controller, single device capabilities etc.

But there is even a better option, pick the right toolset and select the right workload(Surely only doing 4k blocks isn’t representative of your prod environment). VMware has developed a benchmarking solution that works with both traditional as well as with hyper-converged offerings called HCIBench. HCIBench can be downloaded for free, and used for free, through the VMware Flings website. Instead of that single VM single disk test, you will now be able to test many VMs with multiple disks to show how a scale-out storage system behaves. It will provide you great insights of the capabilities of your storage system, whether that is vSAN or any other HCI solution, or even a legacy storage system for that matter. Just like the world of storage has evolved, so has the world of benchmarking.

600GB write buffer limit for VSAN?

Duncan Epping · May 17, 2016 ·

I get this question on a regular basis and it has been explained many many times, I figured I would dedicate a blog to it. Now, Cormac has written a very lengthy blog on the topic and I am not going to repeat it, I will simply point you to the math he has provided around it. I do however want to provide a quick summary:

When you have an all-flash VSAN configuration the current write buffer limit is 600GB. (only for all-flash) As a result many seem to think that when a 800GB device is being used for the write buffer that 200GB will go unused. This simply is not the case. We have a rule of thumb of 10% cache to capacity ratio. This rule of thumb has been developed with both performance and endurance in mind as described by Cormac in the link above. The 200GB that is above the 600GB limit of the write buffer is actively used by the flash device for endurance. Note that an SSD usually is over-provisioned by default, most of them have extra cells for endurance and write performance. Which makes the experience more predictable and at the same time more reliable, the same applies in this case with the Virtual SAN write buffer.

The image at the top right side shows how this works. This SSD has 800GB as advertised capacity. The “write buffer” is limited to 600GB however the white space is considered “dynamic over provisioning” capacity as it will be actively used by the SSD automatically (SSDs do this by default). Then there is an additional x % of over provisioning by default on all SSDs, which in the example is 28% (typical for enterprise grade) and even after that there usually is an extra 7% for garbage collection and other SSD internals. If you want to know more about why this is and how this works, Seagate has a nice blog.

So lets recap, as a consumer/admin the 600GB write buffer limit should not be a concern. Although the write buffer is limited in terms of buffer capacity, the flash cells will not go unused and the rule of thumb as such remains unchanged: 10% cache to capacity ratio. Lets hope this puts this (non) discussion finally to rest.