** PLEASE NOTE: This article was written in 2011 and discussed how to monitor memory usage, which is different then memory / capacity sizing. For more info on “active memory” read this article by Mark A. **
This question has come up several times over the last couple of weeks so I figured it was time to dedicate an article to it. People have always been used to monitoring memory usage in a specific way, mainly by looking at the “consumed memory” stats. This always worked fine until ESX(i) 3.5 introduced the aggressive usage of Large Pages. In the 3.5 timeframe that only worked for AMD processors that supported RVI and with vSphere 4.0 support for Intel’s EPT was added. Every architectural change has an impact. The impact is that TPS (transparent page sharing) does not collapse these so called large pages. (Discussed in-depth here.) This unfortunately resulted in many people having the feeling that there was no real benefit of these large pages, or even worse the perception that large pages are the root of all evil.
After having several discussions with customers, fellow consultants and engineers we managed to figure out why this perception was floating around. The answer was actually fairly simple and it is metrics. When monitoring memory most people look at the following section of the host – summary tab:
However, in the case of large pages this metric isn’t actually that relevant. I guess that doesn’t only apply to large pages but to memory monitoring in general, although as explained it used to be an indication. The metric to monitor is “active memory“. Active memory is is what the VMkernel believes is currently being actively used by the VM. This is an estimate calculated by a form of statistical sampling and this statistical sampling will most definitely come in handy when doing capacity planning. Active memory is in our opinion what should be used to analyze trends. Kit Colbert has also hammered on this during his Memory Virtualization sessions at VMworld. I guess the following screenshot is an excellent example of the difference between “consumed” and “active”. Do we need to be worried about “consumed” well I don’t think so, monitoring “active” is probably more relevant at this point! However, it should be noted that “active” represents a 5 minute time slot. It could easily be that the first 5 minute value observed is the same as the second, yet they are different blocks of memory that were touched. So it is an indication of how active the VM is. Nothing more than that.
Dennis Agterberg says
Hi Duncan,
Great article. I’ve been looking into this for some time now in regards to capacity monitoring and planning.
I’ve got clusters with 500 GB of memory resources and actively only using 1/10, like 50 GB, but still a few ESXi hosts are sometimes at 90-92% memory use (due to VMs with large configured sizes and high consumed memory size). The consumed memory size of our VMs is real high compaired to active memory.
Currently I have host memory usage alarms at 90 warning and 95 alert. But when they are triggerd and active memory use is only 10% there is actually no real issue. This is confusing for my operations team.
I’m also trying to educate the VM requestors in requesting only what they need, really hard though.
Regards,
Dennis
Duncan says
it is very confusing indeed and we are working on getting this solved at some point. I would like to recommend to also look at VC Ops as it gives you better insights / logic.
Joshua Biggley says
Brilliantly concise! That question has been floating around as I watch our infrastructure grow and we try to maximize investment in gear. Knowing how to manage memory was key to saving over $200,000 in uneeded hardware acquisitions for a customer recently.
Thanks for nailing this critical topic!
Joshua Biggley says
I’ve been thinking a bit more about this and wonder about a couple of things:
1) Is there a ‘best practices’ ratio of consumed to active memory?
2) If active memory is the crucial statistic, what role does ‘consumed’ memory play in projecting virtualization ratios? (I currently run about 7:1 with CapacityIQ telling me that I can grow to 12:1)
Thanks
Josh
Duncan says
I am aiming to do an article with some sort of formula / recommendations around resizing.
Chris Nakagaki says
I think JVM’s in VM’s skew this somewhat. I don’t believe the Active metric takes this into account or am I incorrect in this thinking?
Duncan says
JVM should always be treated differently
David Davia says
Great post Duncan!
Couple of questions I am curious about…
1) Would you say that there is memory contention only if active memory begins to exceed total memory available?
2) When do vSphere’s memory management techniques (ballooning, swapping, & compression) take effect? When active hits total? Or when consumed hits total?
Very interesting stuff! Keep up the great posts!
-David
Andy says
Those are important questions David, because we could only be using 1/10th active machine memory but still be in a hard memory state and have VM swapping due to high consumed usage.
Active is important, but don’t ignore consumed.
Andy
Brandon says
I totally agree. At one point (because where I worked management was cheap), we had spun all the information vCenter was dumping at the 20 second realtime intervals to our own tables and we were using crystal reports to trend our own data. While it became obvious to us that it was a waste of time (as nobody could agree on the statical sampling method), it did help us learn an incredible amount. The devil in the detail can be quite staggering when you start thinking about something like SQL which caches so much. The less memory it has, the less it can cache so removing memory based on the active memory trend could be harmful overall. It is helpful to understand the workload when looking at right sizing as well.
VMware’s product is CapacityIQ for right sizing virtual machines, and it can do other things, like what-if scenarios. It isn’t cheap and neither are the competing products, but even the 60 days of the trial are incredibly useful. The larger the environment, the more worthwhile it is. It paid for itself on our first run through, as we removed so much spare capacity off of our VMs (so many people did direct p2vs… so annoying), that we were able to delay buying more hosts. Almost all the workloads were unaffected and the environment overall ran much better. Again, some workload might not respond well to the changes, but the overall result was great.
To answer the above post’s question: ballooning and swapping come into play once the host is in contention. Whether the memory of the guests is active or not, if it is backed by physical resources and those become scarce (in the high 90s), ballooning, compression, and swapping will get used progressively. You can see that stat on the summary page for the Host, it will show the memory pct there, which includes the host’s own memory usage along with the consumed memory of the VMs.
Dennis Agterberg says
@Andy,
True, but the host mem usage alarm in vCenter has to be set at a real high threshold or you will get alarms when nothing is wrong.
I’ve got an ESXi 4.1 host with 93% mem use, still in high state and actively using only 7 GB (yep, oversized VMs)
Regards,
Dennis
Andy says
I think we all have the same idea here. A host won’t begin the process of ballooning, compressing or swapping until you leave the high memory state when you get down to 6% free. Duncan has some really good posts on that behavior.
Another thing to question would be, what would expect to be a safe balloon metric? If ballooning can try and reclaim up to 65% of the VM memory, that would probably mean that we would want to size our VM’s to ensure that their “active” amount is less than 35% of the provisioned virtual memory. And those scenarios like VM memory caching are adequate within that amount. Otherwise, should we set the ballooning max to say 25% and then over-commit our cluster by 25% not including TPS?
Oh so many strategies..
Andy
Duncan says
Hi David,
1) Would you say that there is memory contention only if active memory begins to exceed total memory available?
A) No this is based on “host consumed memory”
2) When do vSphere’s memory management techniques (ballooning, swapping, & compression) take effect? When active hits total? Or when consumed hits total?
A) based on “host consumed memory” as mentioned above. However “Active” will indicate where most of the “ballooned” memory will come from.
Sean D says
By monitoring on active and not consumed, aren’t you saying that you expect VMs to balloon/compress/swap memory as the normal course of events? Won’t that cause serious performance problems? Especially for software that stashes data in memory for performance, but doesn’t access it enough to trigger the active memory counter.
Duncan says
Is could incur performance degradation. But keep in mind that many OS’s today optimize their “caching” based on available memory and then might end up doing nothing with it. So lowering memory for a VM might not lead to performance degradation at all.
Ryan says
From our experience the Active Memory hasn’t been very useful. For example, we have a Win2k8 server running SQL 2k8. This server has 8GB RAM. The Active Memory is currently 25%. This server has 2 Sharepoint 2010 DBs among others. When I run the Performance Diagnostic report it tells me the server has insufficient RAM for its current workload. In Windows 99% of the RAM is currently in use. The server is paging memory and runs slow. In VCenter this server looks happy. Why is this?
It is this kind of discrepancy which makes me uncertain of the whole “Active Memory” usefulness and validity. If anyone can shed more light on my experience that would be very helpful.
kennyd says
@duncan
Great post – It made me feel good knowing that when I am monitoring and looking at guests memory active is the way to go in majority of cases.
@Ryan
It also depends on how you have the OS as well as SQL setup in regards to a lot of things.
First – The OS, how do you have the page file setup? This can impact memory usage right away with the guest.
Second – For SQL – do you have it setup for best practices and so forth? I am not a database guy by any means, but typically you have multiple drives and or partitions setup for TLOGS, the DB’S, etc, etc. I also know that vmware has best practices as well.
Ryan says
@kennyd
This SQL server is letting Windows manage the page file. It’s on drive C with 8GB allocated to swap.
Keep in mind that the paging is all within Windows. There is no paging on the VMware side.
Does separating server activities on different logical drives even matter? They all go to the same SAN anyway.
Duncan Epping says
Yes, multiple channels = multiple queues to the array.
Ryan says
Thanks. Good to know.
Mihai says
You should limit SQL memory usage so that the OS has enough free memory so that swapping does not occur (for 8GB we usually leave limit SQL to 6.5GB).
You should watch performance counter SQL Server:Buffer Manager Buffer cache hit ratio and Page life expectancy to decide whether more RAM is useful.
Ryan says
@Mihai
Thanks for the suggestions. I have mentioned this to our DBA as SQL memory is currently not limited. He does expect performance to improve once our old Sharepoint is brought offline.
JT says
Not sure if it’s relevant in your scenario, and most vm admins are already aware of it, but don’t let your SQL or sysadmins enable “lock pages in memory” for SQL servers either. Or if you do, eliminate any chance of memory contention on the host (capacity manage or set reservations).
In this case, relatively idle large-memory VMs will be tagetted by the idle memory tax and they will balloon first and to the greatest extent. Balloon driver can’t reclaim memory from SQL process when lock pages in memory is set, so it will pull all the memory from the OS memory space and kill the performance of the OS completely (free mem will go down to as little as 0-10MB)
Ryan says
@JT
Thanks JT. To be honest I don’t know where you configure “Lock pages in memory.” Our DBA wasn’t familiar either, so I doubt it is set. 🙂 Sounds like something to be aware of though.
Eoin says
If you wanted to prevent swapping shouldn’t you be setting the reservations on VMs to approximately 35% of the VMs allocated memory?
Andy says
To prevent VM swapping you would have to set the memory reservation to 100% because if compression, ballooning and TPS aren’t enough to satisfy host demand, a VM could be forced to swap. I think at that point you are pretty far gone though.
Andy says
Duncan
This is a really good topic to highlight the fact that the guest OS can provide us with better detail than what vCenter can. Let me give an example.
I have a Server 2008R2 VM in my lab running vCenter/SRM that has been allocated 3Gb RAM. vCenter reports 3Gb consumed, 1.3Gb active.
Using Sysinternals Process Explorer I can see that I have 1.7Gb “commit charge”. Using RamMap I can see that I have 1.5Gb active (MS active metric).
See Mark Russinovich’s “Mysteries of Windows Memory Management Revealed” presentation for MS definitions (http://blogs.technet.com/b/sysinternals/archive/2010/11/02/3-of-marks-presentations-from-the-2010-professional-developer-s-conference-and-and-interview-with-mary-jo-foley.aspx) Highly recommend these.
This tells me that I would want to ensure that 1.7Gb machine memory is available to this VM, and anything less would see performance degradation due to memory starvation.
If I went w/ vCenters active metric, I would probably run into that situation, and the vCenter consumed metric of 3Gb is high since 1.3Gb is in the standby (MS) and zeroed (MS) page lists.
This helps re-enforce to me that neither the vCenter active nor consumed metric provide a foolproof method for accurately performing sizing exercises.
Duncan says
I would agree that inside guest metrics are better. However some memory is also used for caching etc, and there is a cost associated with that as well.
Roey Azroel says
Hi Duncan,
Thanks for the article first of all.
But I disagree with you about the line “I guess the following screenshot is an excellent example of the difference between “consumed” and “active”. Do we need to be worried about “consumed” well I don’t think so, monitor “active”!”.
After a few test and situation that I saw in my VMware environment, you should monitor the consumed memory, but not to compare it to the active memory, but to compare it to the granted memory. when the consumed memory is equal to the granted, there is a change that the guest OS will start using the Page file (in case of Microsoft Windows) and this will increase the usage of the CPU.
There is a good articles and labs about this situation and more information about the memory metrics in “VMware Performance Manager” course (I did it 3 month ago, and it’s the best course that I did in VMware).
Hope I was understandable,
Roey
Duncan says
Roey,
What you are stating is incorrect:
Consumed = Granted – TPS – overhead
Granted = all physically mapped pages
When TPS is used Consumed will not usually get close to Granted. As Consumed contains granted minus all overhead and all collapsed pages.
In the scenario that I am talking about, Large Pages, Consumed also gives a very skewed view as there is no collapsing of pages and zeroed pages don’t seem to be subtracted.
Anyway, in-guest monitoring is of course always preferred but many like to use vCenter as a single pane of glass… hence this article.
Duncan says
I am aiming to also do an article with a formula that gives you an idea of how to resize your virtual machine based on the vCenter Metrics… But I will need to run it by our engineering team as I don’t want to take the risk of sending out false info here.
Mihai says
Your article is totally wrong, active memory is a false measurement because it only gives you the active memory at a certain point in time. So yes, maybe only 10% of memory is active at one time but not the SAME 10%!
If you had super fast SSDs for VM OS swap file and VMware swap file maybe you could get away with overcomitting memory based on active measurement but you would still have degradation of performance from this swap/unswap activity.
The way VMware does TPS with large memory pages activated is not optimal from my experience: it can’t TPS fast enough when there is a sudden increase in memory demand from a large VM being created or vmotion of many VMs due to maintenance mode of another host. The effect you have is sudden VM swap which kills performance; after some time the situation does stabilize but you still have the temporary degradation.
Maybe it should start doing TPS at 80% memory usage, way before soft state, I don’t know the best solution.
JT says
Am inclined to agree with Mihai and other comments in this post.
“Memory consumed” as the threshold for ballooning (and it’s associated impact on disk IO and CPU utilisation) is a touchy subject with “in the trenches” Vmware admins and I think the responses to this article highlight this.
Sizing Windows 2008 servers (esp. SQL servers and the like) correctly is very tricky and requires an in depth understanding of memory alloction within the OS, the operations within a particular instance of the OS (e.g file copy jobs, backups etc), and the applications in-process and out-of-process memory requirements.
SSD aside, we all want to avoid memory overcommit and unnecessary paging.
JT says
…. so if there is a formula within vSphere/stats that assists with this (beyond simply “Mem Active”), it would be very gratefully received …
Duncan Epping says
There is no “simple” formula for this. I will need to develop something myself, and have it will take time and validation for me to make a recommendation around it.
Doug Youd says
Whilst I don’t agree with the needlessly hostile tone of this reply…. I think Mihai has a point about “not the same 10%”.
Is there any information available on exactly what “statistical sampling” algorithm is based on?
And in reply to JT’s ‘in the trenches’ comment. Personally the environments I’ve worked in tend to to aim for an overcommit ratio that just borders on balooning starting to take effect… but swapping (active or other otherwise) is generally considered pushing a bit too far.
Just my 2c.
Duncan Epping says
Mihai,
Thanks for your response. Before you start replying with “your article is totally wrong” you might want to read the article. Nowhere do I state that people should resize their virtual machines based on a single data point. It is not too difficult to gather various datapoints and create an average + a safety margin based on that though. Of course people should be careful when resizing virtual machines.
If you have a better suggestion than using Active I would like to hear it. Looking forward to your response.
Mihai says
Hello Duncan,
Sorry for being a bit harsh, I meant your conclusion about Active Memory is wrong.
Actually I think that exclusively from VMware perspective you can use only memory consumed as a reliable counter and active memory only as a hint on which VMs to investigate using guest OS tools.
As I said in my original post, relying on Transparent Page Sharing when you hit the 93% limit is not a good option (unless performance is not important) because you will have swapping/ballooning at least short term.
To go around this issue I disabled large page support on most of our hosts and pinned VMs that use large pages to the hosts that have it enabled using DRS options.
I tried to measure the performance impact of disabling large pages and couldn’t find any significant difference (my hosts are memory bound, not CPU bound).
Duncan Epping says
Sorry, but how would you be able to use Memory Consumed while using Large Pages? Did you read the article? That is what initiated my post. Both Granted and Consumed tells you absolutely nothing.
Mihai says
No, you see that memory is actually consumed! VMware can do TPS to it when you need more memory but it does it poorly in a way that degrades performance – at least from my experience (we have about 350 VMs).
What I am saying is that when you have Large Pages enabled – which means you want absolutely the highest performance achievable – you should forget about TPS and install more memory in the host/new host.
Again, yes, you can use active memory pe VM but only as a HINT to which VMs MIGHT have too much memory allocated. You then log in to those VMs and use task manager in windows or top in linux and check out what’s happening.
You cannot take the active memory per host and infer that you can add more VMs and expect high performance.
If you need to put more VMs but don’t want the best absolute performance you disable Large Pages.
Duncan Epping says
Mihai, Did you actually read my article?
I specifically state: “This is an estimate calculated by a form of statistical sampling and this statistical sampling will most definitely come in handy when doing capacity planning. Active memory is in our opinion what should be used to analyze trends, monitor capacity etc.”
This is my last response is don’t like your “tone” . I am mere trying to provide tips to better manage your environment.
Ranjit says
But then it was never said that the Memory sizing should be a simply equal to Active memory ‘or’ that the VM density be simple sigma of the Active memories.
The point being made is that designing the resource allocation based “Consumed Mem” is a luxury if not lazy on the part of VM admins.
Meanwhile any time-tested formula is welcome.
Mihai says
Please forgive my “tone”, you are right that it was uncalled for and I was far too brutal, I got carried away, I am deeply sorry. Thank you for your wonderful site and tips that helped us understand and use VMware products better!
“This is an estimate calculated by a form of statistical sampling and this statistical sampling will most definitely come in handy when doing capacity planning.”
Yes, I agree, it is clearly useful but only as a hint to possibly downsize some VMs in my opinion.
“Active memory is in our opinion what should be used to analyze trends, monitor capacity etc.”
It is here that I disagree. Can you please explain why you say this is the case?
Because from my observation active memory is not really actionable, i.e. I have a host with 96GB of RAM that “consumes” 77GB of RAM but only 14GB maximum are active. Does this mean I can triple the memory load on this host to reach 42GB active memory (44% of physical) and have good performance (assuming CPU load is not the limiting factor)? Would you recommend something like this on a production system (2.5 memory overcommitment)?
From my experience this would lead to horrible swapping activity when accessing memory that was unused such as change of workload, application installation, etc.
Brandon says
Active memory at any one given time is not really all that useful for right sizing. I’ve seen it be 0. Obviously you can’t right size a VM to 0, something tells me windows wouldn’t boot. Using the consumed stat is really flawed, because removed from that stat are the memory pages that are being used but were deduped via TP.. well its actually more complicated than that, but its close. I think the shared amount is divided by the number of VMs sharing and then that amount is charged back to that VM as consumed — but don’t quote me its been a while since I read that whitepaper :P.
However, you can, depending on the workload, take the active memory trended over a period of time and get “more” useful results. For instance, get a time frame that encapsulates the server at its busiest and lowest both. Then trend the ‘maximum’ memory active stat over that period of time, and then maybe take the 80th percentile and round up by a specific margin. The problem is that if your timeframe is larg enough (say a month) then the intervals for stat collection have been rolled up to 2 hour intervals which blinds you to a lot of activity and I’ve found it very difficult to get consensus (politically) on what is an acceptible for removing resources. Ironically, noboyd cares about adding. I think using vCenter’s built in monitoring for this purpose is a flawed endeavor and really wasn’t intended for this purpose. That is why VMware has Capacity IQ, and other vendors have filled this gap as well. I haven’t looked into their methodology, but I’m sure it is refined and time tested compared to anything we might be able to come up with by playing with vCenter’s stats alone.
Sean Clark says
Hi Duncan,
First off, thank you for your service to the VMware community. I love your blog and it is a priceless resource for me and helping educate customers.
I don’t think it was your intent, but I feel that many folks might draw (already are drawing) the wrong conclusions from your advice about monitoring active only and not worrying about consumed. That conclusion might be to overcomitt memory according to active memory stat only. I think your intent was more to share that active memory is a very meaningful state and a better indicator of consumption than consumed. If anything, I would hope readers take as an action step to resize VM memory based on %active stats rather than overcommit too agressively.
Your blog is good, but none of us should probably redesign our production architectures based on one post. Only after careful meditation on your total body of work, plus VMware guides, books and experience, should such decisions be made. 😉
I’m looking forward to your follow-up post on memory sizing in relation to active memory. I think it will probably help clear up the topic a bit.
Thanks,
Sean Clark
Duncan says
Thanks for your comment Sean.
I think many people rushed to a wrong conclusion. The only thing I am stating here is that memory consumed and memory granted are useless to monitor in an environment where large pages are used. It is often more beneficial to use %active. I am not even saying that people should use Active as an indication to overcommit or to resize in my article. I am mere stating that it should be used to analyze trends / capacity as it is a better indication. Note “indication”. For true capacity management and full blown monitoring I would advise to use tools like VC Ops as they will give a far better insight than vCenter can or is intended to provide.
And yes, NO ONE should apply any of my recommendations without 24 hours of meditation. Think your decisions through, ensure you understand why you are making this change and ensure the recommendation applies to your scenario.
-d
Rob Bergin says
@All – great discussion and we have proved out the point – Memory and Memory Monitoring is hard.
There are Memory Metrics and Duncan’s calling out two of them.
Memory Consumed (i.e. taken from the host by the VM)
vs.
Memory Active (i.e. memory used by the VM)
The wierdness is when a VM has high Memory Consumed and low Memory active, I have run into a couple scenario’s (Linux Filesytem Cache, JVM memory requirements) where a low “Memory Active” may be a false positive because the VM is holding on memory but not using it so it doesn’t trigger VM’s definition of active.
If you lower the VM’s Memory Allocation without examining the VM – you may get smaller JVM caches or smaller Linux Filesystem Cache – which can impact performance albeit minutely.
Thanks,
Rob
Rob Bergin says
This is a great session on Memory management from Vmware (and its from 2007) – so I think its still accurate but its pre-vSphere.
http://www.vmworld.com/docs/DOC-2116
Vijay says
Hi Duncan,
Assuming what you are saying is correct ie. We should monitor the Active memory and not consumed one; I have couple of questions as mentioned below:
Summary of our setup: Five ESXi 4.1 U1 servers, 64GB RAM each with Intel Xeon X5650 (Nehalem cpu).
When I go to the Cluster –> Host Tab in vCenter it shows all host’s memory utilization percentage between 85-90 range. The moment any of the host crosses 90% mark we receive alert from vCenter saying host’s memory usage beyond 90% and the same applies for the critical alert at 95% usage.
Now when I check in vCenter it shows the % usage which is calculated based on “consumed” memory has gone high. All hosts consumed memory is ranging between 55-60GB (whereas active memory is between 5-10GB).
Q1) As you mentioned if the consumed memory is useless/meaning less then why vCenter is giving the alerts based on hosts consumed memory? Can we ignore these alerts?
Q2) When should we think of adding a new server to the cluster to upgrade the memory resources? Till what time we should wait?
Q3) Which performance counter (active/consumed) does VMware HA refers to?
Our cluster is configured for 1 host failure with no memory/cpu reservation on any of the VM and below is HA Advance runtime info:
Slot size: 256 MHz, 4 vCPU, 296MB (+ Max virtual machine overhead in the cluster is 296MB(max OVHDMAX in the cluster) which makes memory slot size 592MB)
Total slots in Cluster: 550 (5hosts x 64GB = 320GB x 1024 = 327680MB / 592 memory slot size = 550)
ie. 110 slots per host. Now if any one of these 5 servers should go down, the 110 slots should get distributed across the remaining 4 servers.
ie 110 slots / 4 servers = 27.5 slots per servers should be free at any point of time to satisfy HA failover.
27.5 slots x 592MB(memory slot size) = 16.3GB memory per servers should be free at any point of time to satisfy HA failover.
But as I mentioned earlier, all servers are showing usage between 55-60GB, only 5-10GB free memory (where around 16.3GB should be free as per above calculations.)
Q4) Why didn’t HA gave resource crunch error?
Q5) As some one in previous post mentioned that TPS, Ballooning and compression are seen in action when hosts sees a memory congestion ie. beyond 90% memory usage/consumed. But if cluster’s overall memory usage has reached beyond 90% then this should violate the HA condition and HA should give a warning/error. Doesn’t these two conflicts with each other?
Q6) VMs’ total configured memory on all the hosts is more than hosts’ total physical memory. That means over commitment is in action and this should kick in the small page sharing TPS. We are mostly using win2k3, win2k8 and RHEL VMs but total sharing value shown are too less ie around 2-3GB per host, why is that? Will this grow when host crosses 90-95% mark?
Q7) Is there any way to confirm which type of pages being used by TPS ie either large or small memory pages?
I understand these are hell lot of questions but I dont see a better place to get answers except this one.
Ranjit says
Vijay ,
All moderm OS are designed to run on a single physical server. The OS will consume all the memory that is thrown its way. This memory is used as data cache by filesystems, DB , JVM & score of apps which tends to manage their own cache ( worst case … even lock it). Try increasing the memory of all your VMs by 50% (& say your resource pool is 2 times over-provisioned).You’ll see that the gap between consumed & active memory has increased.
Now let me attempt the first 2 queries.
“Q1) As you mentioned if the consumed memory is useless/meaning less then why vCenter is giving the alerts based on hosts consumed memory? Can we ignore these alerts?”
Its not useless & you shouldn’t ignore them.
“Q2) When should we think of adding a new server to the cluster to upgrade the memory resources? Till what time we should wait? ”
Treat the above VCenter alerts as an indication of adding new server. But this approach is suited only if there are NO experts in your org who can optimize the deployment OR if the cluster is a critical production and few $/GB of memory is trivial investment.
The second approach is to see the GAP between Consumed Vs active and take that as an opportunity to right-size the VM resources. Most of the time it comes from a simple P2V exercise leading to over-provisioned VMs. Or it could be political reasons where a BU has paid for the server & is not willing to downgrade the VM memory. One may have to do Guest OS & application level analysis. Once the VM farm is right-sized, the host mem consumption will reduce & Vcenter alerts would pushed one level down. You can redo thid activity till the yield is worth the effort. Once optimized , the VCenter alert would make far more sense & you can recommend a procurement with much confidence.
vijay says
Hi Ranjit,
As you rightly mentioned that all the modern OSs consumes all the memory that is thrown its way by caching it, they are doing this since they are designed to run on a physical infrastructure where consuming/caching all the available memory makes sense.
In our scenario we are running these OSs in a virtual infrastructure and if these guest OSs continues consuming memory the way they do in physical, its expected to increase the VM’s consumed memory on the host side.
Continuing troubleshooting the caching issue, I found out that all the VMs in our infrastructure (mostly win2k3, win2k8 & few linux) are trying to cache the entire available memory.
To resolve the caching issue in windows I think we can set the standard size file-system cache instead of large system cache which is a default one.
http://technet.microsoft.com/en-us/library/cc784562(WS.10).aspx
As mentioned in the article, by changing the way windows handles the caching we can reduce the overall memory usage of the guest. This should bring the VM’s consumed memory down considerably. (tested the same in test environment and consumed memory did came down considerably ie. from 8GB to around 3GB)
The system sets the value “LargeSystemCache” entry to 1 when you install Windows Server 2003. But many applications, such as SQL Server and Microsoft Exchange, change the value of this entry to 0.
Do you see any (performance/other)issues for implementing this change in all the windows guests?
The only issue I see is since we have reduced the caching, it might increase the overall IOPS on storage side, but I don’t see a major change on that side too.
Jack says
You can’t chart the active memory in anything other than the realtime view. It’s missing in the other chart options (daily, weekly, etc). This metric is useless for planning because you can’t see the history beyond one hour.
Joern says
Is it possible to tune the “maximum idle time before declared inactive” that ESX uses for the active memory statistics? I have several SAP servers where active memory hints at reducing their memory configuration. But a SAP expert told me, that results cached for several hours are of interest, so SAP has a different idea of “active” than ESX (which is essentially what has already been mentioned in this thread).
JT says
My 2c:
Without a detailed understanding the memory requirements of an application workload within a given Guest OS, it is dangerous to size Vms based on peak Memory Active
Doing so can result in:
1. Sub-optimal application performance (e.g. inability to cache sufficient data, or keep data in cache for optimal timeframes) as per the point made about SAP above
2. Increased overhead on host and disk resource due to excessive in-guest paging with flow on impact to host CPU and disk IOPS capacity)
That said, it is obviously vital to reduce memory allocation where possible in order to avoid getting close to host memory contention states. Balloning and swapping will likely impact performance far more than a few VMs with a less than optimal memory allocation – and balloning will adversely impact the performance of more inactive VMs with larger memory footprints more than correctly sized VMs.
In more important production environments it might make more sense to err on the side of caution, and go to the extra effort of understanding and monitoring in-guest memory requirements. In less critical higher consoldiation environments where the highest possible level of performance is not so important, it might be acceptable and cost effective to do more agressive memory “right sizing”, and there may not be time to fully understand the memory requirements of every application – in which case “peak memory active” plus some overhead might provide a reasonable starting point for VM sizing.
Roll on Vmware tools with in-built application memory utilisation awareness.
Tim Wise says
I just ran across this discussion. Interesting.
Duncan, did you every post a follow-up on a memory usage equation based on vCenter metrics?
Zavanna says
This article is nothing but misleading. I would recommend people to rather look at VM official doc Understanding Memory Resource Management
in VMware® ESX™ Server
http://www.vmware.com/files/pdf/perf-vsphere-memory_management.pdf
The consumed memory needs to be always monitored!!
HW says
Duncan, thanks for helping us with sharing your knowledge with us. Don’t let some of the comments and suggestions derail you. Everybody seems to be an expert but not everybody is a Principal Architect at VMware R&D.
TopCat says
Amazing Article (and follow up discussion) Duncan. Your work and blogs have helped me for years now. Thank You