vcdx

Scale Up/Out and impact of vRAM?!? (part 2)

Duncan Epping · Jul 21, 2011 ·

** Disclaimer: I am a VMware employee **
** I am not affiliated with Dell, just picked them as their website is straight forward **

About a year ago I wrote an article about scaling up. I have been receiving multiple requests to update this article as with vRAM many seem to be under the impression that the world has changed but did it really? Yes I know I am about to burn myself but then again I am Dutch and we are known for our bluntness so let me be that Dutch guy again. Now before this turns into a “burn the witch who dares to speak about vRAM” thread let me be clear, this article is not about vRAM per se. Of course I will touch upon it and explain why I don’t think there is a problem in the scenario I am describing, but that is not what this article is about.

In my previous article I discussed the benefits of both scaling up and scaling out. Now as I stated, I had that discussion with customers when hosts were moving towards 32GB per host and now we are moving towards 32GB dimms instead easily cramming 256GB in a host. The world is changing and so is your datacenter, with or without vRAM (there is that word again). Once again I am not going to discuss vRAM by itself as I am not an analyst or responsible for pricing and packaging within VMware but what I do want to discuss is if vRAM has an impact on Scale-out vs Scale-up discussion as some are under the impression it does.

Lets assume the following:

To virtualize: 300 servers
Average Mem configured: 3GB
Average vCPU configured: 1.3

That would be a total of 900GB and 390 vCPUs. Now from a CPU perspective the recommended best practice that VMware PSO has had for the last years has been 5-8 vCPUs per core and we’ll come back to why this is important in second. Lets assume we will use 2U servers for now with different configurations. (When you do the math fiddle around with the RAM/Core/Server ratio, 96 vs 192 vs 256 could make a nice difference!)

Config 1:
- Dell r710
- 2 x 4 Core – Intel
- 96GB of memory
- $ 5500 per host
Config 2:
- Dell R810
- 2 x 8 Core – Intel
- 192GB of memory
- $13,000 per host

If we do the quick math that means if we look at it from a memory perspective and assume a roughly 20% TPS benefit (that is very conservative however) and round it up we need 10 x R710s or 5 x R810s. I noticed multiple people making statements about not recommending over-committing on memory because of vRAM, that doesn’t make any sense to me as memory techniques like TPS only lower the overall costs. As mentioned it is recommended to have 5-8 vCPUs per core… Let’s go for 6 vCPUs per core. That means from a vCPU perspective we will need 9 x R710s or 5 x R810s. Now we will take the worst case scenario into account and will go with the larger number for either RAM or CPU. So that results in:

10 x Dell R710 = 55k
5 x Dell R810 = 65k

Before anyone asks, I also looked at AMD 12-Core systems with 256GB and they come in around 16.5k with 256GB and you would need roughly 4 hosts to accomplish the same looking at the cost of those boxes and comparing it with Intel I would expect a broader adoption of AMD to be honest, but lets focus on the Intel comparison for now. So that is only a 10k difference when looking at hardware but the costs of managing it is lower for the R810s (fewer hosts) and not even talking about I/O ports, cooling and power. (Trying to keep things simple, but when adding these costs the difference will be even bigger.)

So what about that vRAM thingie? What about that huh! Well as I said this is not about vRAM but will it matter when buying large hosts? Well it might, but only when you buy more capacity than you need in this example and want to license all of it before hand… In this case, does it matter? 300 VMs x 3GB vRAM is 900GB vRAM (18.75 licenses for enterprise+), the type of host will not change this or will it? Well actually it will. If you look at the R710 you will need 20 (10 x 2) socket licenses assuming and lets assume Enterprise Plus is used. With the Dell R810 we will need 10 licenses from a socket perspective but 19 from a vRAM perspective using Enterprise Plus.

Lets place it in perspective:

Scale out
- 20 Enterprise+ licenses required
- 10 Hosts required
- Estimated costs for hosts + licenses 105k

Scale up
- 19 Enterprise+ licenses required
- 5 Hosts required
- Estimated costs for hosts + licenses 112.5k

Looking at the total costs of acquisition for Scale-Up only in terms of hardware and vSphere licenses in this scenario is indeed slightly more (7.5k) so should you go big?

As mentioned in my other posts there are a couple of things to keep in mind when making this decision and I cannot make it for you unfortunately but there are of course things to factor in. Many of these also have a substantial cost associated with it and I can guarantee that the costs associated with it will more than make up for that 7.5k!

Cost of Guest Operating System and Applications (licensed per socket in some cases)
Cost of I/O ports (storage + network)
Cost of KVM / Rackspace
Cost of Power / Cooling
Cost of operating per host (think firmware etc)
Cost of support (Hardware + Software)
Total number of VMs
Total number of vCPUs
Total number of vRAM
vCPUs per core ratio
Redundancy, taking N+1 into account
Impact of failure
Impact on DRS (less hosts is less balancing options)
Impact on TPS (less hosts means more memory sharing means less physical RAM needed)

Now once again I cannot make the call for you, it will depend on what you feel is most important. If you are concerned about placing all eggs in one basket you should probably go for scale out, but if your primary concern is cost and trust your hardware platform, scale up would be the way to go. I guess one thing to consider before you make your decision, how often does a server fail due to a hardware defect vs a human error? Would less servers also imply less chances of human error? But would it also imply a larger impact of a human error?

For those looking for more exact details I would recommend reading this excellent post by Bob Plankers! Bob and I exchanged a lot of DMs and emails on this topic over the last couple of days and I want to thank him for validating my logic and for the tremendous amount of effort he has put in to his article and spreadsheet!. I also want to thank Massimo Re Ferre for proof reading. This article by Aaron Delp is also worth reading, Aaron released it just after I finished this article. Talking about useful articles I would also like to refer to Massimo’s article which was published in 2009 but still very relevant! Scale-Up vs Scale-Out is a hot topic I guess.

Now looking at you guys to chip, and please keep the fight nice and clean, no clinching, spitting or cursing…

VMware HA Deployment Best Practices

Duncan Epping · Dec 13, 2010 ·

Last week VMware officially released an official paper around Deployment Best Practices for HA. I was one of the authors of the document. Together with several people from the Technical Marketing Team we gathered all best practices that we could find, validated and simplified them to make it rock solid. I think it is a good read. It is short and sweet and I hope you will enjoy it.

Latest Revision:
Dec 9, 2010

Download:
http://www.vmware.com/files/pdf/techpaper/VMW-Server-WP-BestPractices.pdf

Description

This paper describes best practices and guidance for properly deploying VMware HA in VMware vSphere 4.1. These include discussions on proper network and storage design, and recommendations on settings for host isolation response and admission control.

5 Tips for preparing your VCDX Defense

Duncan Epping · Nov 15, 2010 ·

After the VCDX defenses Boston I had a chat with Craig Risinger, also known as 006 ;-). We discussed some of the things we’d seen on the panels and came to the conclusion that it wouldn’t hurt to reiterate some of the tips we’ve given in the past.

It’s OK to change your actual project documents. See the following points for examples. This isn’t really about what you actually happened to do on a particular project with its own unique set of circumstances. It’s about showing what you can do.This is your portfolio to convince potential customers you can do their design, whatever they might need. It’s about proving you could work with a customer to establish requirements and design an architecture that meets them.
Include everything the Application says is mandatory. Don’t be surprised if you have to write some new documents or sections. For example, maybe a Disaster Recovery plan wasn’t important in your project, but it will be to another customer or in another project, so you should show you know how to create one.
Explain any bad or debatable decisions. Did your customer insist on doing something that’s against best practices? Did you explain what was wrong with it? Say how you would have preferred to do things and why. Even if you just made a mistake back then, that’s OK if you can show that you’ve learned and understand the error you made. If you are using VMware’s best practices make sure you know why it is a best practice and why it met your customer’s requirements.
Show you can design for large scale. It’s OK if your actual project was for a small environment, but show that you can think big too. What would you have done for a bigger customer, or for a customer who wanted to start small but be able to scale up easily? What would you need to do to add more VMs, more hosts, more storage, more networking, more vCenter servers, more roles and division of duties, a stronger BC/DR plan in the future? How would that change your design, if at all?
Architect = Knowledge + Reasoning. The VCDX certification isn’t just about knowing technical facts; it’s about being able to apply that knowledge to meet goals. In the defense session itself, be prepared to discuss hypothetical scenarios and alternative approaches, to decide on a design, and to explain the reasons for your choices. Show you know how to consider the pros and cons of different approaches.

There are also many other useful collections of advice for pursuing a VCDX certification, we highly recommend reading them as they will give you an idea of the process. Here’s just a sample:

Craig Risinger (VCDX006) & Duncan Epping (VCDX007)

Last chance to become a VCDX3!

Duncan Epping · Oct 4, 2010 ·

For those who are currently in progress of obtaining a VCDX certification please note that VCDX3 has come to an end. If you still want to become a VCDX 3 and already passed the Design and Enterprise exam Partner Exchange in Orlando (February 7, 2011) is your last chance.

If you have passed both the VCE310 and VCD310 exams and wish to apply for a VCDX3 certification:

Download the VCDX3 Application & Handbook and prepare your defense for the week of February 7, 2011.

The application will be due on November 22, 2010, at 5:00 PM PST.

The final opportunity to deliver a defense in pursuit of the VMware Certified Design Expert on VI3 (VCDX3) certification will be at VMware Partner Exchange in Orlando, Florida the week of February 7, 2011.

Please note that defense slots are limited and will be reserved for candidates who submit completed applications in the order received.

You have got a month and a half to submit your design, but I would definitely recommend to get it in as soon as possible. Make sure however that your Application Form is completely filled out. There are a bunch of tips to be found here.

For those who think they can still quickly register both exams to become a VCDX3

Registrations for the VMware Enterprise Administration on VMware Infrastructure 3 Exam: VCE310 have been closed. No new registrations for this exam will be accepted.
Registrations for the VMware Design on VMware Infrastructure 3 Exam: VCD310 have been closed. No new registrations for this exam will be accepted.

Layer 2 Adjacency for vMotion (vmkernel)

Duncan Epping · Aug 19, 2010 ·

Recently I had a discussion around Layer 2 adjacency for the vMotion(vmkernel interface) network. With that meaning that all vMotion interfaces, aka vmkernel interfaces, are required to be on the same subnet as otherwise vMotion would not function correctly.

Now I remember when this used to be part of the VMware documentation but that requirement is nowhere to be found anywhere. I even have a memory of documentation of the previous versions stating that it was “recommended” to have layer-2 adjacency but even that is nowhere to be found. The only reference I could find was an article by Scott Lowe where Paul Pindell from F5 chips in and debunks the myth, but as Paul is not a VMware spokes person it is not definitive in my opinion. Scott also just published a rectification of his article after we discussed this myth a couple of times over the last week.

So what are the current Networking Requirements around vMotion according to VMware’s documentation?

On each host, configure a VMkernel port group for vMotion
Ensure that virtual machines have access to the same subnets on source and destination hosts
Ensure that the network labels used for virtual machine port groups are consistent across hosts

Now that got me thinking, why would it even be a requirement? As far as I know vMotion is all layer three today, and besides that the vmkernel interface even has the option to specify a gateway. On top of that vMotion does not check if the source vmkernel interface is on the same subnet as the destination interface, so why would we care?

Now that makes me wonder where this myth is coming from… Have we all assumed L2 adjacency was a requirement? Have the requirements changed over time? Has the best practice changed?

Well one of those is easy to answer; no the best practice hasn’t changed. Minimize the amount of hops needed to reduce latency, is and always will be, a best practice. Will vMotion work when your vmkernels are in two different subnets, yes it will. Is it supported? No it is not as it has not explicitly gone through VMware’s QA process. However, I have had several discussions with engineering and they promised me a more conclusive statement will be added to our documentation and the KB in order to avoid any misunderstanding.

Hopefully this will debunk this myth that has been floating around for long enough once and for all. As stated, it will work it just hasn’t gone through QA and as such cannot be supported by VMware at this point in time. I am confident though that over time this statement will change to increase flexibility.

References: