When talking about Transparent Page Sharing (TPS) one thing that comes up regularly is the use of Large Pages and how that impacts TPS. As most of you hopefully know TPS does not collapse large page. However, when there is memory pressure you will see that large pages are broken up in to small pages and those small pages can then be collapsed by TPS. ESXi does this to prevent other memory reclaiming techniques, which have way more impact on performance, to kick in. You can imagine that fetching a memory page from a swap file on a spindle will take significantly longer than fetching a page from memory. (Nice white paper on the topic of memory reclamation can be found here…)
Something that I have personally ran in to a couple of times is the situation where memory pressure goes up so fast that the different states at which certain memory reclaiming techniques are used are crossed in a matter of seconds. This usually results in swapping to disk, even though large pages should have been broken up and collapsed where possible by TPS or memory should have been compressed or VMs ballooned. This is something that I’ve discussed with the respective developers and they came up with a solution. In order to understand what was implemented, lets look at how memory states were defined in vSphere 5. There were 4 memory states namely High (100% of minFree), Soft (64% of minFree), Hard (32% of minFree) and Low (16% of minFree). What does that mean % of minFree mean? Well if minFree is roughly 10GB for you configuration then the Soft for instance is reached when there is less then 64% of minFree available which is 6.4GB of memory. For Hard this is 3.2GB and so on. It should be noted that the change in state and the action it triggers does not happen exactly at the percentage mentioned, there is a lower and upper boundary where transition happens and this was done to avoid oscillation.
With vSphere 6.0 a fifth memory state is introduced and this state is called Clear. Clear is 100% of minFree and High has been redefined as 400% of MinFree. When there is less then High (400% of minFree) but more then Clear (100% of minFree) available then ESXi will start pre-emptively breaking up large pages so that TPS (when enabled!) can collapse them at next run. Lets take that 10GB as minFree as an example again, when you have between 30GB (High) and 10GB (Clear) of free memory available large pages will be broken up. This should provide the leeway needed to safely collapse pages (TPS) and avoid the potential performance decrease which the other memory states could introduce. Very useful if you ask me, and I am very happy that this change in behaviour, which I requested a long time ago, has finally made it in to the product.
Those of you who have been paying attention the last months will know that by default inter VM transparent page sharing is disabled. If you do want to reap the benefits of TPS and would like to leverage TPS in times of contention then enabling it in 6.0 is pretty straight forward. Just go to the advanced settings and set “Mem.ShareForceSalting” to 0. Do note that there are security risks potentially when doing this, and I recommend to read the above article to get a better understand of those risks.
** update – originally I was told that High was 300% of minFree, looking at my hosts today it seems that High is actually 400% **
Carlos says
I thought that you start doing, e.g., ballooning when you go below 64% and you stop when you get back over 100% minfree. I.e. your lower and upper boundaries are exactly those, but crossed to get the “hysteresis” effect.
On another line, do you know if there is any kind of defragmentation after pages are broken ? I.e, once you brake a 2M page into 512x4K, those will stay as 4K pages forever ?
Sreejesh says
Thanks Duncan for the article, hope you are well. Can you please confirm with this change which “Reclamation mechanism” trigger at each memory state (High,Clear,Soft, Hard and Low)? It would be great if you can give a table same as in Franks blog, like..
Mem State Threshold Reclamation mechanism
—————- ————- ————————————
High 300%
Clear 100%
Soft 64%
Hard 32%
Low 16%
Duncan Epping says
Let me see what I can do…
Sreejesh says
Thanks Duncan, Its bit confusing at what state which Reclamation technique starts.
For example in following example the FreeMem state is HIGH. Though state is high, its observed all Memory Reclamation process (ballooning, compression, TSP…) are kicked in.
~ # vmware -lv
VMware ESXi 5.5.0 build-1892794
VMware ESXi 5.5.0 Update 1
~ # esxtop
6:58:33pm up 204 days 4:20, 662 worlds, 11 VMs, 35 vCPUs; MEM overcommit avg: 0.06, 0.06, 0.06
PMEM /MB: 65490 total: 1838 vmk, 61342 other, 2309 free
VMKMEM/MB: 65166 managed: 1266 minfree, 6210 rsvd, 58955 ursvd, high state
PSHARE/MB: 7011 shared, 441 common: 6570 saving
SWAP /MB: 179 curr, 14 rclmtgt: 0.00 r/s, 0.00 w/s
ZIP /MB: 15 zipped, 9 saved
MEMCTL/MB: 1603 curr, 1609 target, 42637 max
I hope a table and esxtop results will be helpful to understand the concept easily.
Carlos says
Why do you say that all reclamation processes are kicked in ?
What you see is that they did sometime in the past, so same pages were swapped out, some compressed, etc, but they are not running “now”.
Duncan Epping says
exactly, ESXTOP result as you show them show pages swapped and paged ballooned, not what is happening right now, this could be weeks old
Sreejesh says
Thanks Duncan and Carlos,
I am not sure, the current value (curr) for memctl,pshare, swap in esxtop on the host I am referring changes frequently (every 1 0r 2 secs), though the freemem state is High. Observed the same behavior on a different Hyp as well. Observed in ‘vSphere Monitoring and Performance guide’ its mentioned the value for ‘curr’ indicates the current usage.
If its historical value, will it automatically reset to zero or we have to manually do it?
Mitko says
Can I comment on this statement:
“When there is less then High (300% of minFree) but more then Clear (100% of minFree) available then ESXi will start pre-emptively breaking up large pages so that TPS (when enabled!)”
I don’t observe and think that it is comletely accurate.
First of all High state is a state not threshold (every state normaly has two thresholds upper and lower let’s say). In the some VMware materials which I have read this is not described in the best way.
So let me ask you a simple question what is the state above 300%, isn’t it actually High state. And between 300% and 100% Clear (vSphere 6), and if so breaking up large pages and scheduled TPS (when enabled!) must perform always no mater how much the free memory is (because it works above 300% in High state and alslo below 300% in the other states.
I personaly have observed such kind of behaviour in vSphere 5.5 an 6.
And when there is much free memory (much more the 300%, lets say 2000%, 3000%) the state is always High (in esxtop, resxtop) so the breaking up large pages + scheduled TPS always works, which is visible on such hosts with considerably high values of Shared and Shared Common in esxtop and vSphere clients performance charts.
Duncan Epping says
Yes I simplified the explanation, each state indeed has a lower and an upper threshold. Just looking at it today it appears that a last minute change was made and the threshold for high is actually not 300 but 400% of minFree. If it drops below High until Clear then pages will be broken up as described.
Duncan Epping says
I describe the upper and lower boundary here: http://vmwa.re/1pj
Mitko says
Sreejesh you are right that these are current values, and I agree with the other guys that they were acumulated (part of them) in the past when the corresponding thresholds and states were hit.
But when the free memory have increased I suggest that the memory management will maybe slowly try to throw back the pages from swap, balloon, zip and this maybe is the reason for the constant slow change in the High state.
Duncan Epping says
Here you go, the table you asked for: http://vmwa.re/1pj
sig says
good work
Tim Curless (@timcurless) says
Thanks for the post Duncan – great stuff. One thing comes to mind regarding Tier 1 critical applications such as a particularly important MS SQL workload. Obviously in these cases we always want to leverage large pages from the app through the OS down to the hypervisor for increased performance.
The problem I envision with the new Clear state is that as pages are preemptively broken up (when clear < x < high) we could adversely affect performance before a need arises. If my understanding is correct I think we avoid this by reserving memory (obviously another best practice in these cases).
My question is: does a memory reservation in any way guarantee large pages will be maintained? Is there an advanced setting in vSphere 6 to accomplish this? I know we can force small pages, but what about forcing large pages, or at least limit breaking them up to a Hard or Low state?
Santo says
Hi Duncan,
Thanks for the great article, I would like to clarify on the broken Large pages.
Am I right to say that the broken large pages only apply for the VM(s) with TLB backup by EPT or RVI?
But if the Large Pages has been set at the OS Guest level then TPS will not be able to collapse large pages?
Appreciate for your answer, thanks
Duncan Epping says
For VMs which have memory pages backed by larges pages by vSphere