Sharing VMUG presentation “vSphere futures”

Last week I presented at the UK VMUG, Nordic VMUG and VMUG Belgium. My topic was vSphere futures… I figured I would share the deck publicly. The deck is based on this blog post and essentially is a collection of what has been revealed at last VMworld. Considering the number of announcements I am guessing that this deck is a nice summary of what is coming, feel free to use it / share it / comment etc.

Once again, I would like to thank the folks of the VMUG organizations throughout EMEA for inviting me, three great events last week with very passionate people. One thing I want to call out in particular that struck me last week: Erik from the VMUG in Belgium has created this charity program where he asks sponsors (and attendees) to contribute to charity. Last event he collected over 8000 euros which went to a local charity, it was the biggest donation that this particular charity received in a long time and you can imagine they were very thankful… all of this while keeping the event free for attendees, great work Erik! Thanks for giving back to the community in various ways.

See you next time.

Validate your hardware configuration when it arrives!

In the last couple of weeks I had 3 different instances of people encountering weird behaviour in their environment. In two cases it was a VSAN environment and the other one was an environment using a flash caching solution. What all three had in common is that when they were driving a lot of IO the SSD device would be unavailable, for one of them they even had challenges enabling VSAN in the first place before any IO load was placed on it.

With the first customer it took me a while what was going on. I asked him the standard questions:

  • Which disk controller are you using?
  • Which flash device are you using?
  • Which disks are you using?
  • Do you have 10GbE networking?
  • Are they on the HCL?
  • What is the queue depth of the devices?

All the answers were “positive”, meaning that the full environment was supported… the queue depth was 600 so that was fine, enterprise grade MLC devices used and even the HDDs were on the HCL. So what was causing their problems? I asked them to show me the Web Client and the disk devices and the flash devices, then I noticed that the flash devices were connected to a different disk controller. The HDDs (SAS drives) were connected to the disk controllers which was on the HCL, a highly performant and reliable device… The flash device however was connected to the on-board shallow queue depth and non-certified controller. Yes indeed, the infamous AHCI disk controller. When I pointed it out the customers were shocked ,”why on earth would the vendor do that…”, well to be honest if you look at it: SAS drives were connected to the SAS controller and the SATA flash device was connected to the SATA disk controller, from that perspective it makes sense right? And in the end, the OEM doesn’t really know what your plans are with it when you configure your own box right? So before you install anything, open it up and make sure that everything is connected properly in a fully supported way! (PS: or simply go VSAN Ready Node / EVO:RAIL :-))

Slow backup of VM on VSAN Datastore

Someone at out internal field conference asked me a question around why doing a full back up of a virtual machine on a VSAN datastore is slower then when doing the same exercise for that virtual machine on a traditional storage array. Note that the test that was conducted here was done with a single virtual machine. The best way to explain why this is is by taking a look at the architecture of VSAN. First, let me mention that the full backup of the VM on a traditional array was done on a storage system that had many disks backing the datastore on which the virtual machine was located.

Virtual SAN, as hopefully all of you know, creates a shared datastore out of host local resources. This datastore is formed out of disk and flash. Another thing to understand is that Virtual SAN is an object store. Each object typically is stored in a resilient fashion and as such on two hosts, hence 3 hosts is the minimum. Now, by default the component of an object is not striped which means that components are stored in most cases on a single physical spindle, for an object this means that as you can see in the diagram below that the disk (object) has two components and without stripes is stored on 2 physical disks.

Now lets get back to the original question. Why did the backup on VSAN take longer then with a traditional storage system? It is fairly simple to explain looking at the above info. In the case of the traditional storage array you are reading from multiple disks (10+) but with VSAN you are only reading from 2 disks. As you can imagine when reading from disk performance / throughput results will differ depending on the number of resources the total number of disks it is reading from can provide. In this test, as there it is just a single virtual machine being backed up, the VSAN result will be different as it has a lower number of disks (resources) to its disposal and on top of that is the VM is new there is no data cached so the flash layer is not used. Now, depending on your workload you can of course decide to stripe the components, but also… when it comes to backup you can also decided to increase the number of concurrent backups… if you increase the number of concurrent backups then the results will get closer as more disks are being leveraged across all VMs. I hope that helps explaining  why results can be different, but hopefully everyone understands that when you test things like this that parallelism is important or provide the right level of stripe width.

VMware / ecosystem / industry news flash… part 4

VMware / ecosystem / industry news flash time again. Took me a while to get a bunch of them, so some of the news is a bit older then normal.

  • Dell and SuperMicro to offer an EVO:RAIL bundle with Nexenta for file services on top of VSAN!
    Smart move by Nexenta, first 3rd party vendor to add value to the EVO:RAIL package and straight away they partner with both Dell and SuperMicro. I expect we will start seeing more of these types of partnerships. There are various other vendors who have shown interest in layering services on top of EVO:RAIL so it is going to be interesting to see what is next!
  • Tintri just announced a new storage system called the T800. This device can hold up to 3500 VMs in just 4U and provides 100TB of effective capacity. With up to 140K IOPS this device also delivers good performance at a starting price of 74K USD. But more then the hardware, I love the simplicity that Tintri brings. Probably one of the most user/admin friendly systems I have seen so far, and coincidentally they also announced Tintri OS 3.1 this week which brings:
    • Long awaited integration with Site Recovery Manager. Great to see that they pulled this one off, it something which I know people have been waiting for.
    • Encryption for the T800 series
    • Tintri Automation Toolkit which allows for end-to-end automation from the VM directly to storage through both PowerShell and REST APIs!
  • Dell releases the PowerEdge FX. I was briefed a long time ago on these systems and I liked it a lot as it provides a great modular mini datacenter solution. I can see people using these for Virtual SAN deployments as it allows for a lot of flexibility and capacity in just 2U. What I love about these systems is that they have networking included, that sounds like true hyper-converged to me! A great review here by StorageReview.com which I recommend reading. Definitely something I’ll be looking in to for my lab, how nice would it be: 4 x FC430 for compute + 2 x FD332 for storage capacity!

That it is for now…

Non-Uniform configurations for VSAN clusters

I have been receiving various questions around support for non-uniform configurations in VSAN environments (sometimes also referred to as “unbalanced” configurations) . I was a bit surprised by it to be honest as personally I am not a big fan of non-uniform configurations to begin with. First, with “non-uniform” I am referring to different hardware configurations. In other words you have four hosts with 400GB Intel s3700 flash and one host with 200GB Intel s3500 flash. The question was if this is an acceptable configuration if the overall flash capacity still meets the recommended practice of 10% of used capacity.

Although technically speaking this configuration will work and is supported, from an operational and user experience perspective you need to ask yourself if this is a desired scenario. I have seen people doing these type of constructions out in the field as well with “flash caching” solutions and believe me when I say that the result were very mixed. The problem is that when you have a non-uniform configuration your predictability of performance will be impacted. As you can imagine cutting your flash capacity in half on a host could impact the cache hit ratio for that particular host. Also using a different type of flash will change your results / experience more then likely. On top of that, imagine you need to do maintenance on your hosts, it could be that the “non-uniform” host will have different procedures for whatever maintenance you are doing… it just complicates things unnecessarily.

So again, although this is supported and will work from a technical perspective it is not something I would recommend from an operational and user experience point of view.