Scale UP!

Duncan Epping · Mar 17, 2010 ·

Lately I am having a lot of discussions with customers around sizing of their hosts. Especially Cisco UCS(with the 384GB option) and the upcoming Intel Xeon 5600 series with six cores per CPU takes the “Scale Up” discussion to a new level.

I guess we had this discussion in the past as well when 32GB became a commodity. The question I always have is how many eggs do you want to have in one basket. Basically do you want to scale up(larger hosts) or scale out(more hosts).

I guess it’s a common discussion and a lot of people don’t see the impact sizing your hosts correctly. Think about this environment, 250 VMs in total with the need of roughly 480GB of memory:

10 Hosts, each having 48GB and 8 Cores, 25 VMs each.
5 Hosts, each having 96GB and 16 Cores, 50 VMs each.

If you look at it from an uptime perspective; Would a failure occur in scenario 1 you will lose 10% of your environment. If you look at scenario 2 this is 20%. Clearly the associated cost with the down time for 20% of your estate is higher than for 10% of your estate.

Now it’s not only the associated cost with the impact of a host failure it is also for instance the ability of DRS to load balance the environment. The less hosts you will have the smaller the chances are DRS will be able to balance the load. Keep in mind DRS uses a deviation to calculate the imbalance and simulates a move to see if it results in a balanced cluster.

Another thing to keep in mind is HA. When you design for N+1 redundancy and need to buy an extra host the costs associated for redundancy is high for a scale up scenario. Not only the costs associated are high, the load when the fail-over needs to occur will also increase immense. If you only have 4 hosts and 1 host fails the added load on the 3 hosts will have a higher impact than it would have on for instance 9 hosts in a scale out scenario.

Licensing is another often used argument for buying larger hosts but for VMware it usually will not make a difference. I’m not the “capacity management” or “capacity planning” guru to be honest but I can recommend VMware Capacity Planner as it can help you to easily create several scenarios. (Or Platespin Recon for that matter.) If you have never tried it and are a VMware partner check it out and run the scenarios based on scale up and scale out principles and do the math.

Now, don’t get me wrong I am not saying you should not buy hosts with 96GB but think before you make this decision. Decide what an acceptable risk is and discuss the impact of the risk with your customer(s). As you can imagine for any company there’s a cost associated with down time. Down time for 20% of your estate will have a different financial impact than down time for 10% of your estate and this needs to be weighted against all the pros and cons of scale out vs scale up.

Comments

Paul Valentino says

17 March, 2010 at 14:21

Also, I like the idea increased DIMM count in the newer systems/blades not because I can build a bigger box with tons of RAM but because I can get more cheaper DIMMs to fill the surplus of slots with much more cost-effective 4G DIMMs. Even the 8G DIMMs are getting more reasonable so if the base price + RAM break even or better I’m going for memory capacity.
Mike Laverick says

17 March, 2010 at 14:28

Taking this further – if the 5 servers were 4U boxes, and the 10 were 2U they would be taking up precisely the same rack space….

With that said there’s doubling of power-supplies and HBAs required in the 10 model… that doesn’t come cheap.

There’s a lot of talk about scaling up and the eggs in one basket scenario. I’ve wrote quite a bit it myself.

http://searchvirtualdatacentre.techtarget.co.uk/news/column/0,294698,sid203_gci1380058,00.html

I think the biggest barriers to scale up are not the quality of the availability techs – but the costs of 4GB/8GB/16GB ram. Whilst the hypervisors and chassis can take the ram not everyone can afford the bigger densities. There’s a lot of anxiety surrounding the failure of the host – but be honest how often does it actually happen??? Are will building a level of redundancy into the model when it might not need it… With that said, I would rather have HA/DRS/FT than not…
Yvo Wiskerke says

17 March, 2010 at 14:29

Good explanation.

My two cents:

I think the example between 96 or 48GB hosts is good, but when you stretch the example to hosts with even more memory, it becomes even worse. For example hosts with 384GB.
It also strongly depends on what you run, desktops? or server? and what kind of workload inside the VMs.
With desktops the need for failover drops when using brokers (Ie. VMW View). With newer or future releases of products by Cisco, HP, IBM, etc. getting up to 32 pcores per host, this design consideration has to be taken into account. Using a 32pcore host with 384GB of memory can run a large nr of desktop VMs, but the impact of host failure can be anywhere between something like 190 to over 300 VMs per host. Now that is a lot of VMs that will fail when the host dies.
Using larger workload VMs with multiple vCPUs and large memory requirements are a better fit for these kind of setups.
Now making the ROI/TCO calculation, it’s up to the customer if they want to scale out or scale up for these workloads.
Ronald says

17 March, 2010 at 14:35

More Dimms in a system might have impact on the speed of the memory-bus. In the 5500 based systems, the best way to fill your system is per 6 Dimms, but dependent on the cpu, 6 dimms can have the maximum speed, 12 Dimms may slow down a bit and 18 Dimms, slow down even more. And having (for instance) 8 Dimms is really ineffective from a speed point of view.
Excellent information about this is here: (thanks to Scott Lowe): http://blog.scottlowe.org/2009/05/11/introduction-to-nehalem-memory/
and here: http://www.delltechcenter.com/page/04-08-2009+-+Nehalem+and+Memory+Configurations
Craig Risinger says

17 March, 2010 at 14:42

All good points. To raise a few others:

1. The more hosts you have, the more likely it is that one will fail. Run the scenario to extremes to see the principle: What are the odds that you will have no host failures in a year if you have 5 hosts? What if you have 2,000? So would you prefer the bigger risk of a small failure (few VMs), or a smaller risk of a big failure?

2. Power consumption. Though one assumes the big boxes use more juice each than each of the small ones, I’d guess it’s not 2x.

3. Number of hosts to manage.

As always, apply the Goldilocks Principle: Not too big, not too small, but just right.

Overall I don’t think scale up vs. scale out has a universal answer. Each org needs to figure out what’s “just right” for their case.
Craig Risinger says

17 March, 2010 at 14:45

P.S.

4. Transparent Page Sharing. The more VMs per host, the greater the TPS savings in pRAM.
Duncan Epping says

17 March, 2010 at 14:55

@paul : the new hosts with many slots are cool indeed!

@mike : it’s not the risks but the impact associated with the risks. although chances are slim, it does happen and when it does the impact is higher. But I agree there’s more to it.

@yvo : Usecase is important indeed. VDI vs Server use case.

@Craig : the operational aspect is indeed important, that’s an overhead which will effect the costs.
Nicolai Sandager says

17 March, 2010 at 14:58

As with a lot of other Design issues there is really no final answer, except “It depends”(Or 42, if you are into that).

I have customers moving away from blade infrastructures due to the cost of expanding the platform. It is either buy more blades or pay through the nose for more RAM.

If we take “The Idealist” approach to this, we are trying to minimize the amount of boxes, and thereby cooling/power/CO2 and the likes. Taking that approach would point towards the “Bigger Fewer” boxes approach.

That being said, the issues of a bigger disruption at failures, when having a higher VM density is very real. But are we really still facing dodgy hardware? Is hardware breakdowns something we are facing on a daily/weekly/yearly basis?

As you might have guessed I am leaning towards the “Big Box” idea. Compared to the competition, VMware has the capability to actually have a wicket VM density. Remember the Taneja report from last year, showing a 1.8 factor in vSphere’s advantage? That might have been just a “fluffy marketing report” but consider just a 1.4 ratio. It is still a big difference, I say let us use that when we have it.
But at the end of the day, it is a “Cost/Benefit” decision, that only can be truly answerd by analysing you customers envirionment and business apporach.

So I agree with the former honorary speaker in “It Depends”…

PS: On a interresting note. The VMware vSphere Architectual Design template uses an example installation with a consolidation ratio of 58.82
Steve Chambers says

17 March, 2010 at 15:18

Scale up is good for vSphere customers because you need less of those vSphere licenses and therefore reduce your TCO. Thanks to vSphere’s bullet proof ESX and HA/FT features, then the resilience and recovery is improved to reduce the impact of any outage.

The biggest cause of outages is administrators, so with less compute nodes and more standard compute nodes you reduce this probability too.
Bouke Groenescheij says

17 March, 2010 at 15:18

@Craig. Point 4: I was thinking about this also, BUT since Nehalems architecture is CPUMemory based, it’s not really true. Taken the above scenario we would have:

8 cores (so 2x pCPUs), 48GB mem = 16GB Mem/cpu where TPS is active
16 cores (so 4x pCPUs), 96GB mem = 16GB Mem/cpu where TPS is active

TPS = NUMA aware and does not share between CPUs. So in a way: yes, you have more VM’s per host, but that does not necessarily mean you have higher TPS and need ‘less’ memory…

But this is architecture dependent also, since since Cisco UCS uses a different memory architecture.

@Duncan. Thanks mate for writing this excellent article!
AFidel says

17 March, 2010 at 16:13

It’s not just about the vSphere licenses either, it’s about the MS Datacenter licenses, the DoubleTake licenses, the performance monitoring tools licenses, etc. Since almost everything in the VMWare world is licensed per socket or per server it can make a lot of sense to maximize your consolidation ratio.

TPS is per NUMA domain AFAIK so it’s memory per domain not memory per core that’s important for TPS.

As to availability, as always a single machine is never going to give you bulletproof uptime so you must cluster if uptime past ~99.99 is important to you, at that point a single node no longer matters so consolidate away.
Matthew Marlowe says

17 March, 2010 at 17:08

I try to keep clusters sized so max capacity is 2-3x the minimum required to keep it running for production purposes. This gives us leg room for failed hosts or for high utilization periods. Nearly all of clusters I’ve dealt can meet this requirement easily with 6-9 hosts. Also, given that so many enterprise software licenses are moving towards per socket pricing, its usually more economical to upgrade older hosts with more cores and faster ram than to add new ones.
duncan says

17 March, 2010 at 17:34

@Steve : what’s the difference between 10 x 2 or 5 x 4? In both cases you will end up with 20 licenses wouldn’t you? Microsoft licensing might be different.
nate says

17 March, 2010 at 18:24

Scale up! damn straight, how about 3.3Thz of CPU and 9TB of ram in a rack(32 servers), all with 40Gbps of non blocking Ethernet, and 8gbps of non blocking fiber channel per server? That is what I have been dreaming of recently

My vmware dream machine
http://www.techopsguys.com/2010/02/28/fourty-eight-all-round/
Horst Mundt says

17 March, 2010 at 19:53

Good discussion going on here. One thing to look at is your average number of vCPUs per VM. If you’ve got a lot of VMs running with 4-8 vCPUs , you’ll probably tend to scale up.
Your overall CPU capacity in the cluster is the same with scale up or scale out, but with scale out it will be fragmented across more hosts. With scale up the scheduler has more choice.
(Also if you are running large VMs , you’ll have fewer VMs per host, reducing the impact of a host failure).
But this does of course not change your point. Think before you buy 🙂
that1guynick says

17 March, 2010 at 21:21

This is something we went through when we first start diving into virtualization for production use.

After all the eggs were hatched, we ended up going with the DL580 line of HP’s, due to the massive amounts of memory you can pile in, 4-socket systems, and relatively small rack/power footprint.

Today, we have three of them running just over 100 VM’s in production use.

Now, you can make the pizza-box argument all day long if you’re running a web farm or VDI, consisting of VM’s using 1 vCPU and a gig or so of ram each, but when you start reaching into the “mission critical” arena, this type of environment just won’t cut it, no matter how many hosts you pile into the cluster. Start throwing 8GB+ RAM into Oracle/Exchange/SQL VM’s, and you’ll see what I mean.

My personal suggestion is to run BOTH environments, and be smart about which VM’s you provision to which environment.

-Nick
Russ says

17 March, 2010 at 21:50

The UCS B250 with 384GB and just 2xQuad core processors seems far too top heavy in RAM to me – resulting in massively overcommitted pCPU to vCPU ratios.

I could see this fitting either a niche “large VM” model (where just a few VMs have large amounts of RAM) or perhaps VDI deployments – but for general purpose server virtualization, I just can’t see it working out.
duncan says

17 March, 2010 at 23:05

So what can you run a DL585 that you cant run on a DL385?
nate says

17 March, 2010 at 23:50

Russ- I completely agree with you, and have been trying to tell people that myself ever since they announced that memory extender technology, the core:memory ratio is way outta whack. I’m sure there are certain workloads that can benefit from it, but if your going to load up on 48x8GB dimms I certainly would want to spend the extra $$ and get more CPU cores. Even 192GB ram with 8 cores seems excessive. But again it all depends on what your doing with them.

Which is one reason I am so excited about the possibility of a 48 core blade with 48 DIMM slots on it(each socket supports 12 slots). My current VM setup for my back end gear is running on really old re-purposed stuff, been trying to hobble along until this next gen hardware was available(hobbling quite well too).

Duncan – one thing you can do is run (many) more multi processor VMs, since there are a lot more cores to work with, more scheduling opportunities. The link I posted above shows more in depth information and the math formulas behind this from VMware documentation and a “tier 1” performance presentation I was at last year. It’s really amazing to me at least how the math adds up as far as scheduling options.
that1guynick says

18 March, 2010 at 00:00

DL58x and DL38x have 4 sockets and 2 sockets, respectively. If you somewhat standardize around the 32GB/socket ideology, that means you’re putting 128GB in a 4xQuadCore, and 64GB in a 2xQuadCore. (lets leave Nehalem and HT out of it for the purpose of not overcomplicating)

If you start running large VM’s with 8+ GB’s (SQL/ORCL/EXCH) of RAM within the same cluster, you’re not going to leave yourself much headroom. That presents problems with HA and DRS deviation. If a host dies, and you’ve got large SQL/EXCH/ORCL VM’s across (3) DL38x’s, you run the likelihood of not having the overhead available to reboot all of them on the other two hosts, or being overcommitted (which no one will EVER like to do for any of those three apps mentioned).

If you want to compare apples to apples, you have to run twice as many DL38x’s, which presents problems for rack real estate, power consumption, cabling up power/LAN/SAN/MGMT, twice as much configuration, twice as much bugging of the network/storage teams…for what, at the end of the day? a 10% vs. a 20%?

“Dear boss, please spend this undisclosed 100’s of 1000’s of dollars so that we can eliminate half of a 20% outage on the off-chance that a host dies.”

And honestly, for these apps I’m talking about, aren’t we all using FT or some other form of clustering ANYway to cover us from those sorts of outages?
duncan says

18 March, 2010 at 00:24

Sorry but you guys are misreading or I might not have been 100% crystal clear. I am not saying that there isn’t a use case. Of course there are multiple, I am only saying that you SHOULD have one of these use case as a justification before making your decision.

I understand how the scheduling works. And of course I understand you can run more vSMP VMs on a host with 16 cores vs 8 cores. But isn’t this what we should avoid? Why use it if you don’t need it? How many VMs today genuinely need vSMP? Yes Oracle will use it and Exchange and SQL and a whole bunch of other apps but a hell of a lot don’t.

Again I clearly state, do the math. Use cap-planner, figure out if it fits your environment and budget. Most important think before you buy.
duncan says

18 March, 2010 at 00:29

By the way, the link you are referring to contains outdated info on the scheduling. This whole concept changed with vSphere!

http://www.yellow-bricks.com/2009/08/13/vsphere-cpu-scheduler-whitepaper-this-is-it/
that1guynick says

18 March, 2010 at 02:01

it has nothing, check that, LITTLE to do with vSMP and everything to do with the inflated levels of memory that today’s servers require to run tier1 apps in enterprise environments.

1500-user Exchange environments call for 8-12GB RAM just for the Mailbox server role!

I still run Exch, SQL, and Oracle boxes with 1 vCPU, I just make sure I’ve got lots of extra cores always sitting around available for that 1 vCPU to take as often as it wants. THATs the difference really.

You’re not going to get the best of both worlds in the pizza box 1U/2U scenario, no matter how many hosts you pile into the cluster.
Brad M says

18 March, 2010 at 02:26

Reading all the great comments on this post shows that, as Duncan point out, there is no correct answer. But, the one thing we inevitably will have to deal with is the fact that technology has been advancing at an ever increasing rate. In the last 2 years we have gone from dual cores being the standard to quad cores and now we have 6 cores and very soon 8 and 12 core CPU’s.

As an idustry we are being forced into the scale up situation because our 2U servers and half height blades will very shortly have 24 cores to go along with the 384GB of memory. When this happens quad cores will be phased out. Consolidation is the nature of the industry. It is happening to storage (larger FC and SATA drives), faster interconnects(10GBe is replacing 1GBe and even FC) as well as many other areas.

This discussion has to be brought up to the level that Virtualization is now the standard platform for a majority of organizations. We have surpassed the discussions around the “low hanging fruit” and are now moving into the Exchange 2010, SAP, Oracle etc. Due to these Tier 1 platforms the capacity will at some point be needed in these 2u and hhb servers because we will be mixing the Tier 1 and LHF servers together.

I am not sure we have much choice in the long run unless Intel, AMD and the server vendors decide to not keep marching forward and replacing the current CPU’s. The Scale Up vs Scale Out will soon be merged into a discussion not about CPU and Memory but of “Why do we still need 4 socket servers if we have 24 cores with 384 GB Ram in a single blade?”
Brian says

18 March, 2010 at 06:13

Great discussion, we have been covering these points while planning a large VDI implementation. Discussing number of blade chassis and how many VM’s per blade and RAM configurations per blade. Using 8gb DIMMs was half the cost of the total quote.
duncan says

18 March, 2010 at 07:02

Let’s agree to disagree that1guynick as in my opinion you just gave the example why there’s nothing wrong with a 2U server. (but again, I am trying to get a discussion going. so thanks for that Nick!) Even the 1U servers take 96GB easily today. when you are running 10 96GB hosts and don’t have enough headroom to restart that 8GB / 12GB that’s just poor capacity management.

@BradM keep in mind that besides enterprise environments there’s also a huge SMB market and they have critical apps but they also only have 100 – 500 users for these apps. They also require a different approach.
that1guynick says

18 March, 2010 at 09:03

You’re assuming I would put 96+ GB of RAM into a 2U 2socket host, and I don’t think I would. OK, for the sake of argument, let’s throw Nehalem into the mix, and talk about the memory alignment requirements in order to take advantage of the triple-channel DDR3 thing, where you have to have the same memory DIMMs in only certain slots. At that point, you’re not going to be able to ram a 2U box full of memory, because once you start getting to 8GB DIMM’s (and larger) it starts becoming cost-prohibitive. These servers (I’ll use DL380G6’s because that’s what I’m most familiar with) are shipping with 12GB of RAM out of the box. All those DIMMs are waste. Because you have to remove them in order to RE-populate with larger DIMMs and align those across all of the triple channel RDIMM slots.

I base everything I’ve said so far around the 32GB-per-socket sweet spot of cost vs. performance. And I have to say that I would much rather have a 4-socket box with piles of memory in it to host all of the LHF servers someone else mentioned above, and then break out your tier1 apps into individual clusters of smaller boxes, and only if absolutely necessary.

One of the pain points that has also failed to come up is the “growing pains” of upgrading an existing cluster. Are you using HA? Are you using FT? Guess what? You have to upgrade ALL hosts. Which means you have to perform upgrades and maintenance on (not to mention purchase) twice as many servers, twice as many RAM sets, twice as many NIC adapters, twice as many HBA’s, etc etc.

At the end of the day, yes, we’ll agree to disagree. And I’ll leave it with the sense of to each their own. It’s 100% about your company, what you can afford, how bleeding edge you wish to be, and how far out you’re growth plans are.
duncan says

18 March, 2010 at 09:55

I think the misunderstanding comes from your focus on Tier-1 apps only. You are talking about a specific use-case where I try to make it more generic and describe the pros and cons of the overall. This use-case might be valid for your specific situation, however, and you might be one of us, for a lot of consultants this is not the case. We need to take multiple usecases into account and come up with a customized solution for a customer based on requirements and constraints, while justifying costs / risks / impact.

By the way, I spoke about 96GB / Nehalem in my Blog article that’s why I refer to it. You spoke about large memory requirements, that’s why I said you can easily go up to 96GB. I am not saying you need to.
Gabrie van Zanten says

18 March, 2010 at 10:13

Two very important factors in scale up are often overlooked: Stability of ESX and HA limits. ESX is very stable as that I have very seldom see purple screens due to errors in ESX, but i do get my occasional VM that keeps hanging or other “small” hick-up that makes me put a host in maintenance mode to solve the issue. With only a few hosts I would too often use my ‘spare’ HA capacity and not be able to failover when I have on host in maintenance mode.

Another factor is the HA limits at the moment. 100 VMs per host or 40 VMs per host in a cluster with more than 8 hosts.
Rudolf Kleijwegt says

18 March, 2010 at 11:04

Microsoft licensing is definitely something to take into account. Although Microsoft altered their licensing policy to better suit virtual environments, a Windows Server license is still bound to a physical box. In your example this would break down to either 2500 (250 VMs * 10 hosts) or 1250 (250 VMs * 5 hosts) Windows Server licences when using the Standard edition. Depending on the number of running VMs you might be better of using an Enterprise or Datacenter edition because these editions allow you to run multiple VMs on a physical host. But ofcourse these editions cost more then a Standard edition. Recommended reading on this subject is a document called ‘Licensing Microsoft Windows Server 2008 to Run with Virtualization Technologies’ that you can locate on the Microsoft website.

http://www.microsoft.com/licensing/about-licensing/volume-licensing-briefs.aspx
John van der Sluis says

18 March, 2010 at 11:27

There’s a simple term for this and its called “failure domain”

What could the impact be on availability when making a certain design choice.

John.
Rudolf Kleijwegt says

18 March, 2010 at 11:30

One addition to my reply… if you don’t use vMotion and VMs remain on a box you could use a different licensing strategy.
CanadianChris says

18 March, 2010 at 12:49

Great conversation! Hardware is irrelevant. Make sure you can support your bandwidth requirements and strategically standardize on a hardware platform. There are always exceptions, the needs of the many outweigh the needs of the few.
PiroNet says

18 March, 2010 at 13:02

‘putting all your eggs in the same basket’ debate is non-sense when embracing virtualization IMO. The whole purpose of it IS to put as many guests on a physical server, isn’t it!? And when you ‘scale up’ at the physical layer, you can still ‘scale out’ at the virtual layer.

My believe is that it depends more on the trust the Admins have into their hardware/hypervisor. If you would have to design a vInfrastructure for a customer, one of the first question you would ask, on what hardware is it going to run?

Hardware reliability, 10 years ago a slide from Compaq showed that 70% of all hardware outages came from memory modules. Unfortunatelly that hasn’t changed much since and memory is still ‘less’ reliable than the rest of the hardware. Room for improvement definitely but it looks like a catch-up race, you never get the rabbit 🙂

I know that VMware’s DRS/HA and now FT help you mitigate the high outage risk introduced with a less reliable hardware (aka commodity hardware). Especially FT technology which is just taking off. When that feature will reach maturity, that will be a big OMG!

So today hardware reliability matters hence your article, but tomorrow?
Google shown us that it doesn’t matter, hardware is known to be unreliable so why bother. VMware is probably following that idea with the Cloud Computing, basicaly let’s make the hypervisor deal with it…
Andrew Mitchell says

18 March, 2010 at 13:07

Horses for courses.
As Duncan said, you make your hardware selection based on the intended workloads and the customers appetite for risk.
A relativly high number of smaller servers might make sense for smaller workloads while a relatively smaller number of larger servers might work well for extremely large workloads.

Somewhere between those extremes you will find your answer.
duncan says

18 March, 2010 at 14:03

@pironet : I disagree. maximize consolidation ratio is part of it. separation of OS from HW is another part and offering “high availability” features and flexibility another. virtualization is more than consolidation only.
Justin says

18 March, 2010 at 14:53

You design for what you need, and always make sure you have adequate resources to handle a host failure or two. Our current configuration is 72GB hosts. Even with progressing from 32GB to 72GB, I have not reduced the number of hosts. In development, I have upwards of 59 VMs running on a server, and the memory usage is roughly 60% on that host. In the corporate cluster, I’m running far less VMs per host, but some of these VMs also have larger memory and CPU requirements. In the production cluster, there will be some hosts with only two or three VMs due to the high memory requirements (around 32GB per VM).
Igor.Nemilostivy says

18 March, 2010 at 16:37

But if Consider not only CPU and Memory? What about Disks and Networks subsystems? In the focus of distribute load of IOps per Hosts, 10 Host most likely better than 5.
Carl Skow says

18 March, 2010 at 17:13

Have more VMs on a host is inevitable. The same issue that created such a large market for virtualization, being that Moore’s law appears to be accurate, while service/program/OS requirements are not doubling every year, is going to hit the virtualization environments as well.

Most of us feel that we hit a “sweet spot” with 50-60 VMs per node, but I think it’s important we understand that those numbers are very likely to keep on going up and up due to the difference in processing power and memory available versus the actual compute requirements for a server running most tasks. That resource vs utilization disparity has driven our field, so we should be comfortable with it when it comes knocking on our door.

I predict many of us will hit over 100 VMs per node in the next few years, and having more nodes will become more important than having tall nodes. As long as we can have clusters that are well utilized, I think management will be happy.
vTrooper says

18 March, 2010 at 17:41

I tried a Capacity Planner Excersise that somewhat talks about the up and out model and licensing here:

http://www.virtualinsanity.com/index.php/2009/09/27/capacity-conundrum-part-1/

Everyone has their feelings about what is ‘just right’ but I always have a problem with a general calculation of what the impact is on a per host basis. VM’s per host is always, meh, to me because VM’s are so different in size and purpose. 1 VM down could be 50% of your Operational Downtime on a bottom line. 25 Development VM’s could be down for hours on end. It unfortunately does depend on the business not the logical boundries of the current ‘right’ size VM. Be that 1 vCPU or 4 vCPU, etc.

@vTrooper
Richpo says

18 March, 2010 at 20:25

Greetings,

I know we are looking at how many VMs on a box and Hardware Failures and Cost of Downtime But I think there’s a point being overlooked. The cost of everything else and of course it’s a “Depends” what your environment is and it m Reality I’m dealing with I have one Server Hardware Vendor I can use and I have to make the best choice for here and now.

For each Box I rack I have to pay a cost just for the Rack and Stack. Bigger boxes means less cost for Rack\Power Circuits\cooling\Network\Fibre plus image,patching(plus firmware). The other costs are the Addons, Nworks, Netbackup (Host Based licenses Tier IIIs are NOT that much more than Tier IIs) and what every else you want to throw in here. Plus the time you have keeping all the maintenance contracts under control. The argument that all your “eggs are in one basket” doesn’t really hold much water because the Biggest Easter Egg Basket there is shared Storage and very few people (Okay I do but I have an Hitachi so I have good reason) spend much time worrying about the storage failing

I’ve had hardware failures and yes we lost 20-30 VMs at a shot but guest what most of the user’s didn’t noticed because the VMs were back up in about 1-2 minutes.

More little boxes = More IT work. Of course if you don’t need Big Iron Don’t By Big Iron. I think everyone (or was it just me?) was looking @ the 2Us because, Nehalem processors didn’t scale beyond 2 Sockets. Now that Nehalem-EX scales beyond 2 Cores (Yes I know the speed is slower on the 4 sockets Nehalem-EX but the memory speed scales out for RDIMMs) For me Larger boxes work better for me. The other thing is Red Tape. At my place of work anything under 5k is an expense and everything over 5K is Capital Cost. Expense cost are much easier to get approved. If I could get a fully configured 2U priced under 5K then I could change my mind.

There’s a lot of valid points everyone is making and there is no one all fit’s all solution. I’m just putting forth some of the reasons I’m buying x3850 x5s instead of x3650M2s or M3s.

My position has an expiration date of 3/18/2010.

-Richpo
Ian K says

21 March, 2010 at 04:49

High Density & Server consolidation is one thing in the Eggs in a Basket. How many eggs do you want in that basket to break all at one time?

As an architect or the Systems Engineer designing and maintaining all of this something to take into account is things fail. Period.

It isn’t a matter of “if”, it is a matter of when. Hardware has become extremely stable compared to even 5 years ago and in my environment we have to put a host into Maintenance Mode at least once a week to deal with some hardware issue (Bad DIMMs being the most often though HBAs do go sometimes).

The question boils down to is how far are you or your boss or your company willing to stick their neck out and put 75 business critical apps onto a single piece of hardware? Are you willing to take that hit in reputation and business impact?

For our most recent purchase I blogged about some of the thought process that went on and what we decided and the “Eggs in a Basket” was a major point this go around. Even more than the last purchase.

http://itsjustanotherlayer.com/2010/03/scale-up-or-scale-out%E2%84%A2/
RJ says

1 April, 2010 at 15:24

Good discussion.
Since I’m redesigning and scaling up our environment. I ran into some interesting servers from Dell PowerEdge C Series (brand new). I’m looking into the C6100, you can get alot of cpu and memory in a 2U server (actually 4 servers within 2U), kinda like a mini blade.

I’m also considering if it’s a good idea to put a single 6core in a server instead a dual quad core (it will costs less licenses:)).
Massimo Re Ferre' says

1 April, 2010 at 21:29

Oh my god this scale up vs scale out discussion is still on …. and it’s been for years…. I wish Ken Cline was here …. 😀

For those interested in additional resources / points of view:

http://it20.info/files/3/documentation/entry186.aspx
http://it20.info/blogs/main/archive/2009/12/10/1427.aspx
http://www.it20.info/misc/virtualizationplatformofchoice.htm

Massimo.
Tom Howarth says

19 July, 2011 at 21:59

Duncan,

what do you think about your position, bearing in mind the recent change in licensing announced by VMware at the vSphere 5 launch.

Personally I feel that those that have chosen to scale up and increase density by the use of TPS and over committment will be signficantly penalised. This position is only exacerbated if the licensee is utilsing a lower level edition like Standard or Enterprise.
- Duncan Epping says
  
  19 July, 2011 at 22:54
  
  I am not sure I understand what you are saying? Irregardless of the types of hosts you will need to pay for vRAM. In other words, 400 VMs with a total of 900GB vRAM will require 19 Enterprise+ licenses. Now you can use 5 Hosts with 192GB to cover that or 20 hosts with 48GB… In the case of the latter you would even require an extra license.
  
  Or am I missing something here?
  - Tom Howarth says
    
    21 July, 2011 at 10:05
    
    Currently I have 64 CPU of Enterprise Plus, and say my hosts have 256GB of pRAM in them that means a total of 4TB across all machines.
    
    now the situation currently is that I can utilise all 4TB of memory as assigned (note I say assigned not utilised) and I am not penalised, I can even over provision my physical RAM, say to 5TB across the Hosts and I am not penalised.
    
    Now under the new vSphere 5 Licencing model I get a vRAM entitlement. with host that are licensed with Enterprise plus that entilement is 48GB x 64 which is 3TB, with Enterprise the situation is worse as I only have 2TB of pooled vRAM and Standard 1.5TB
    
    Today my costs for those licenses are 64 x license cost + recurring SnS
    
    Tomorrow my choices are either reduce my VM count to keep my costs the same by removing 1 to 2 TB of assigned vRAM or pay between 22 and 43 CPUs extra in licensing costs and SnS charges.
    
    Basically I have been penalised for utilising one of VMware’s perceived main advantages over the competitor.
    
    Also bear in mind that those number are for Ent+, the licensing uplift would be greater with Enterprise.
    
    Now I under stand that I can pool all memory by linking vCenters, and “Any Non Utilised vRAM” entilement is spead across the environment. However statements that “most” establisments have a pool of vRAM just sitting there pending a DR do not help. most of the environment I have been involved with utilsie their DR environment as either Test and Dev, or a Pre-Production Staging area, therefore that vRAM is in use.
    
    The fact remains that those that have been agressive and pushed the limits of the product will be penalised and that tends to be the smaller enterprise customers who do not have the advantage of ELA programes to reduce their costs.
    - Duncan Epping says
      
      21 July, 2011 at 10:47
      
      I don’t see how your response relates at all to this topic.
      - Tom Howarth says
        
        21 July, 2011 at 11:41
        
        I am not trying to get into a debate about the merits of the new licening model, I was just asking about whether the licening change will mean any rethinking on your part regarding this post.
        
        The point is basically due to the change in licensing, Memory now becomes a major design issue, if I chose to over-commit I will incur greater licensing costs. So now I ask again. it is now better to scale out rather than scale up due to the changes in the licensing model.
        
        By scaling up I am increasing memory density over less host, but incuring a pecieved waste and extra cost by having “Unassigned” CPU licenses.
        
        By scaling out my vRAM pool may be the same in total, but I do not have that “interesting” converstation on why I have to purchase extra “CPU” licenses just to utilise my RAM.
      - Duncan Epping says
        
        21 July, 2011 at 11:45
        
        Expect a follow up article around 14:00 my time today discussing this exact topic…
      - Steven Bryen says
        
        21 July, 2011 at 11:55
        
        Tom,
        
        I don’t think vSphere 5 licenses should be looked at as CPU licenses in a Scale Up Scenario. You are always going to have more memory, so I do not see this as an Arkward conversation to have. You are buying more vRAM entitlement, forget that that one extra license would also allow you an extra CPU socket (because that is irrelevant in Scale up with todays hardware)
        
        Also take into account costs such as Rackspace, Power, Cooling, Guest OS Licenses, Hardware Costs.
        
        Scale up is cheaper every time. Check out Aaron Delps blog post on this:
        
        http://blog.aarondelp.com/2011/07/scale-up-with-vmware-vsphere-5-im-not.html
        
        Steve
ASID says

17 January, 2012 at 01:50

As a interior decorator i found this very interesting topic, thanks for putting up.

Related

Reader Interactions

Comments