I attended VMware PEX a couple of weeks back and during some of the sessions and discussions I had after the sessions I realized that many customers out there still design using legacy concepts. Funny thing is that this mainly applies to server virtualization projects and to a certain extend to cloud environments.It appears that designing in building blocks is something that EUC side of this world has embraced a long time ago.
I want to use this post to get feedback about your environments. How you scale up / scale out. I discussed a concept with one of the PEX attendees which I want to share. (This is no rocket science or something revolutionary, let that be clear.) This attendee worked for one of our partners, a service provider in the US, and was responsible for creating a scalable architecture for an Infrastructure as a Service (IaaS) offering.
The original plan they had was to build an environment that would allow for 10.000 virtual machines. Storage, networking and compute sizing and scaling was all done with these 10k VMs in mind. However it was expected that in the first 12 months only 1000 virtual machines would be deployed. You can imagine that internally there was a lot of debate around the upfront investment. Especially the storage and compute platform was a huge discussion. What if the projections where incorrect, what if 10k virtual machines was not realistic in three years. What if the estimated compute and IOps requirements where incorrect? This could lead to substantial underutilization of the environment, especially in IaaS where it is difficult to predict how the workload will behave this could lead to a significant loss. On top of that, they were already floor space constraint… which made it impossible to scale / size for 10k virtual machines straight from the start,
During the discussion I threw the building block (pod, stack, block… all the same) method on the table, as mentioned not unlike what the VDI/EUC folks have been doing for years and not unlike some of you have been preaching. Kris Boyd mentioned this in his session at Partner Exchange and let me quote him on this as I fully agree with his statemenet “If you know what works well on a certain scale, why not just repeat that?!” The advantage to that would be that the costs are predictive, but even more important for the customers and ops team the result of the implementation would be predictive. So what was discussed and what will be the approach for this particular environment, or at least will be the proposed as a possible architecture?
First of all a management cluster would be created. This is the mothership of the environment. It will host all vCenter virtual machines, vCloud Director, Chargeback, Databases etc. This environment does not have high IOps requirements or high compute requirements. It would be implemented on a small storage device, NFS based storage that is. The reason it was decided to use NFS is because of the fact that the vCloud Director cells require NFS to transport files. Chris Colotti wrote an article about when this NFS share is used, might be useful to read for those interested in it. This “management cluster” approach is discussed in-depth in the vCloud Architecture Toolkit.
For the vCloud Director resource the following was discussed. The expectation was a 1000 VMs in the first 12 months. The architecture would need to cater for this. It was decided to use averages to calculate the requirements for this environment as the workload was unknown and could literally be anything. How did they come up with a formula in this case? Well what I suggested was looking at their current “hosted environment” and simply averaging things out. Do a dump of all data and try to come up with some common numbers. This is what it resulted in:
- 1000 VMs (4:1 core / VM, average of 6GB memory per VM)
- Required cores = 250 (for example 21 x dual socket 6 core host)
- Required memory = 6TB (for example 24 x 256GB host)
This did not take any savings due to TPS in to account and the current hardware platform used wasn’t as powerful as the new one. In my opinion it is safe to say that 24 hosts would cater for these 1000 VMs and that would include N+2. Even if it did not, they agreed that this would be their starting point and max cluster size. They wanted to avoid any risks and did not like to push the boundaries too much with regards to cluster sizes. Although I believe 32 hosts is no problem at all in a cluster I can understand where they were coming from.
The storage part is where it got more interesting. They had a huge debate around upfront costs and did not want to invest at this point in a huge enterprise level storage solution. As I said they wanted to make sure the environment would scale, but also wanted to make sure the costs made sense. On average in their current environment the disk size was 60GB. Multiply that by a 1000 and you know you will need at least 60TB of storage. This is a lot of spindles. Datacenter floor space was definitely a constraint, so this would be huge challenge… unless you use techniques like deduplication / compression and you have a proper amount of SSD to maintain a certain service level / guarantee performance.
During the discussion it was mentioned several times that they would be looking at the upcoming storage vendors like Tintri, Nimble and Pure Storage. There were the three specifically mentioned by this partner, but I realize there are many others out there. I have to agree that the solutions offered by these vendors are really compelling and each of them have something unique. It is difficult to compare them on paper though as Tintri does NFS, Nimble iSCSI and Pure Storage HC (and iSCSI soon) but is also SSD only. Especially Pure Storage intrigued them due to the power/cooling/rackspace savings. Also the great thing about all of these solutions is again that they are predictable from a cost / performance perspective and it allows for an easy repeatable architecture. They haven’t made a decision yet and are planning on doing an eval with each of the solutions to see how it integrates, scales, performs and most importantly what the operational impact is.
Something we did not discuss unfortunately was networking. These guys, being a traditional networking provider, did not have much control over what would be deployed as their network department was in charge of this. In order to keep things simple they were aiming for a 10Gbit infrastructure, the cost of networking ports was significant and they wanted to reduce the amount of cables coming out of the rack for simplicity reasons.
All in all it was a great discussion which I thought was worth sharing, although the post is anonymized I did ask their permission before I wrote this up :-). I realize that this is by far a complete picture but I hope it does give an idea of the approach, if I can find the time I will expand on this with some more examples. I hope that those working on similar architectures are willing to share their stories.
Chris Dearden says
Great Article Duncan – my only question around a “Pod” based scale out is around the tool selection to be able to manage multiple pods whats the best way to drive the economy of scale with the multiple management points a horizontal model will give you ? Given that such a Rollup exists , at what point would you purchase it ?
John Walsh says
We have been doing this sort of architecture for the last two years and it has served us very well! We did include networking in our stack, and plan to do a DC Network refresh. We did bake Tintri into the vCD POD with amazing results!
Duncan Epping says
@John: In their case networking will of course be included, but this part is designed by a different team which are (overly) protective of their domain. Which in a way is totally understandable.
@Chris: Good question, but difficult to answer for me as that is something that needs to be determined on a per customer basis. One thing to note though that if they would have scaled up to that size they would have also ended up with multiple vCenter Servers etc, which means that they would have had multiple management points. In many of these environments however common tasks are fully automated using orchestration tools, which lowers the impact of having multiple mgmt points.
Dwayne Lessner says
Pure Storage for the moment is only FC, iscsi will come soon when the build in replication.
First thoughts would be toward a Nutanix implementation. Node by node, block by block, scale to the needed capacity. While they do not handle large storage requirements well (multiple TBs), the data provided suggests large data was not a concern.
I would also look into fabric designs like Xsigo. You could deploy a Xsigo solution a couple of chassis, and have the physical footprint to add compute nodes as needed going forward.
Storage is difficult. The choice of flash based storage seems to be unusual given the reference to ROI and utilization concerns. I have been looking for a “LeftHand” like solution that can scale without Isilon’s price tag.
Duncan Epping says
At first that surprised me as well Donny. But if you look at some of the Flash based storage providers and compare their price to that of a traditional provider and the upfront cost you would be surprised how close they all are especially if you take rack space / cooling / power in to account. Note that I did not do these calculations myself so this is what I was told by this partner.
Kelly O says
Agree with Donny. Sounds like a good Nutanix fit. That would drop cable usage too down to 4/host.
I agree with you Duncan that many times people that have been doing virt for awhile don’t necessarily realize when that one “best practice” is now not a best practice any longer because a certain feature has changed it. Sometimes even someone who reads a lot of blogs and new doco when it comes out doesn’t always hear when something changes. I remember thinking VMware should put almost some sort of bes practice change doco out or at least mention in the newest one when a particular best practice is now moot.
Really, modular infrastructures from standard building blocks is a pretty cool concept when you start thinking about scale. Couple that with orchestration of the individual components and you have a really powerful story. Glad to see you writing this up.
I agree with Donny on the Xsigo part. Scaling to 10,0000 guests to me would put the emphasis on your connectivity fabric and your storage sub-system. I have always been a fan of IB and virtulization increases the vitality of it. For the storage part I would think that the HDS VSP would give you the best scailibility and performance to meet the needs. The sub-lun level teiring would allow you populate the box wtih a majoirity of SATA disk and heavily utilize the large cache and SSD of the array. With that part in place compute nodes are a dime a dozen, you can put what ever you want in from of it. Today we are reaching the 1,000 guest mark on a 5 node cluster keeping Duncan’s 4:1 Core/VM ratio.
Charles Williams says
Great post Duncan!
I had a quick question about the 4:1 VM/Core ratio. Does VMware have any best practice documentation concerning the VM/core ratio? I ask because we have been trying to set a target and I have been able to find anything.
We run a very mixed workload with the majority of VMs having two vcpus and ~6GB RAM. I keep pushing for a higher ratio than we have right now (2 VMs/core) since we aren’t seeing any performance issues, but I haven’t been able to find any documentation to support it.
I don’t think that there would be much as a best practice for VM/Core ratio (could be very wrong). Every workload is different and it will affect the number of VM’s you will be able to get per core. Not to mention the type of processor that you are using. As mentioned from my post above we have about 1,000 guest on a 5-node cluster at which is actually a 5:1 ratio (mis-stated above). These are a majority of 1vCPU boxes with light utilization on Westmere EX class processors and I would think that we could go to almost a 8:1. Some of our other clusters might have larger guests with 4 – 8 vCPU’s with heavier utilization and we will only be able to get a 2:1 ratio older Harpertown quad-cores. My opinion would be that it varies greatly depending on your environment. I would keep piling them on while keeping a close watch to make sure you aren’t seeing any excessive CPU wait times. We typically stop when we see fairly consistent spike to 3%. Would like to hear others opinions also.