This week I had the pleasure to have a chat with Marc Huppert (VCDX181). Marc works for Computacenter in Germany as a Senior Consultant and Category Leader VMware Solutions. He primarily focuses on datacenter technology. I noticed a tweet from Marc that he was working on a project where they will be implementing a 60 site ROBO deployment. It is one of those use cases that I don’t get to see too often in Europe so I figured I would drop him a note. We had a conversation of about an hour and below is the story around this project, and some of the other projects Marc has worked on.
Marc mentioned that he has been involved with VSAN for about 1.5 years now. They at first did intensive testing internally to see what VSAN could do for their customers and looked at the various use cases for their customer base. Quickly they discovered that when combining VSAN with Fusion-IO they ended up with a very powerful combination. Not only extremely reliable (Marc mentioned he has never seen a Fusion-IO card fail), but also an extremely well performing solution. They did comparisons between Fusion-IO and regular SATA connected SSDs and performance literally doubled, not just for reads but also writes was a big difference. One of the other reasons for considering PCIe based flash is to have the maximum number of disk slots available for the capacity tier. It all makes sense to me. Right now for current projects, NVMe based flash by Intel is being explored, and I am very curious to see what Marc’s experience is going to be like in terms of performance, reliability and the operational aspects compared to Fusion-IO.
Which brings us to the ROBO project, as this is the project where NVMe will be used surprisingly enough. Marc mentioned that this customer, a large company, has over 60 locations all connected to a central (main) datacenter. Each location will be equipped with 2 hosts. Depending on the size of the location and the number of VMs needed a particular VSAN configuration can be selected:
- Small – 1 NVMe device + 5 disks, 128GB RAM
- Medium – 2 NVMe devices + 10 disks, 256GB RAM
- Large – 3 NVMe devices + 15 disks, 384GB RAM
Yes, that also leaves room to grow when desired, as every disk group can go up to 7 disks. From a Compute point of view the configurations do not differ too much besides memory config and disk capacity, actually the CPU is the same, to keep operations simple. In terms of licensing, the vSphere ROBO and VSAN ROBO edition are being leveraged, which provides a great scalable and affordable ROBO infrastructure, especially when coming from a two node configuration with a traditional storage system per location. Not just the price point, but primarily the day 2 management.
When demonstrating VSAN to their customer this is what impressed the customer the most. They have two people managing the entire virtual and physical estate, that is 60 locations (120 nodes) and the main datacenter which houses ~ 5000VMs and many physical machines as a result. You can imagine that they spend a lot of time in vCenter and they prefer to manage things end to end from that same spot, definitely don’t want be switching between different management interfaces. Today they manage many small storage systems for their ROBO locations, and they immediately realised that VSAN in a ROBO configuration would reduce the time they spend managing those locations significantly.
And that is just the first step, next up would be the DMZ. They have a separate compute cluster as it stands right now, but it unfortunately connects back to the same shared storage system as where there production is running. They do fully understand the risk, but never wanted to incur the large cost associated with a storage system dedicated for their DMZ, not just capex but also from an opex point of view. With VSAN the economics change, making a fully isolated and self-contained DMZ compute and storage cluster dead simple to justify, especially when combining it with NSX.
One awesome customer if you ask me, and I am hoping they will become a public VSAN reference at some point in the future as it is a great testimony to what VSAN can do. We briefly discussed other use cases Marc had seen out in the field and Horizon View, Management Clusters and production came up. Which is very similar to what I see. Marc also mentioned that there is a growing interest in all-flash, which is not surprising considering the dollar per GB cost of SAS is very close to flash these days.
Before we wrapped it up, I asked Marc if had any challenges with VSAN itself, what he felt was most complex. Marc mentioned that sizing was a critical aspect and that they have spend a lot of time in the past figuring out which configurations to offer to customers. Today the process they use is fairly straight forward: select Ready Node configuration, change SATA SSD with PCIe based Flash or NVMe, increase or decrease number of disks. Fully supported, yet still flexible enough to meet all the demands of his customers.
Thanks Marc for the great conversation, and looking forward to meeting up with you at VMworld. (PS: Marc has visited all VMworld events so far in both the US and EMEA, a proper old-timer you could say :-))
You can find him on Twitter or on his blog: http://www.vcdx181.com
Emanuele Roserba says
Hello Duncan, reading the article raised some questions:
How they manage the environment? Initially i thought about inked mode, but it’s ruled out because of the 10 linked vcenters maximum, I suppose.
From an HA and resources optimization point of view, does using 2 members clusters make sense? It could mean having 50% resources wasted, instead of the 30% of a 3 nodes configuration, for example.
Comparing the vSAN+Fusion IO solution to a traditional storage with flash cache or flash pools, did the E-W traffic generated by VSAN itself, set particular issues when sizing the network environment?
Thank you! Emanuele.
Thiago Hickmann says
Hi Duncan,
For a VSAN node using NVMe SSD in cache tier and Read Intensive SATA SSD in the capacity tier, do you think the AF behavior of using the caching tier just for writes would be appropriate? Or maybe a change at the caching layer to a similar behavior of an hybrid configuration?
My reasoning is that, while WI and RI SATA/SAS SSD’s have similar Read performance, an NVMe has a way bigger punch.
As an example:
Intel P3700 NVMe deliver 450k read IOPS
Intel S3510 SATA deliver 65k read IOPS
http://ark.intel.com/compare/86192,79624
Maybe this is a engineering level debate, and I’m overthinking it =D
Duncan Epping says
We have customers doing NVMe for everything, but keep in mind that you have 1 caching disk with maybe 5-7 capacity disks sitting behind it. Which means that it isn’t 65k IOPs but 5x65k or more even. So the reads are not an issue typically. We also have a read cache in memory on every host locally (data locality as some refer to it) which also helps speeding up reads.