memory

vSphere 6.0: Breaking Large Pages…

Duncan Epping · Feb 17, 2015 ·

When talking about Transparent Page Sharing (TPS) one thing that comes up regularly is the use of Large Pages and how that impacts TPS. As most of you hopefully know TPS does not collapse large page. However, when there is memory pressure you will see that large pages are broken up in to small pages and those small pages can then be collapsed by TPS. ESXi does this to prevent other memory reclaiming techniques, which have way more impact on performance, to kick in. You can imagine that fetching a memory page from a swap file on a spindle will take significantly longer than fetching a page from memory. (Nice white paper on the topic of memory reclamation can be found here…)

Something that I have personally ran in to a couple of times is the situation where memory pressure goes up so fast that the different states at which certain memory reclaiming techniques are used are crossed in a matter of seconds. This usually results in swapping to disk, even though large pages should have been broken up and collapsed where possible by TPS or memory should have been compressed or VMs ballooned. This is something that I’ve discussed with the respective developers and they came up with a solution. In order to understand what was implemented, lets look at how memory states were defined in vSphere 5. There were 4 memory states namely High (100% of minFree), Soft (64% of minFree), Hard (32% of minFree) and Low (16% of minFree). What does that mean % of minFree mean? Well if minFree is roughly 10GB for you configuration then the Soft for instance is reached when there is less then 64% of minFree available which is 6.4GB of memory. For Hard this is 3.2GB and so on. It should be noted that the change in state and the action it triggers does not happen exactly at the percentage mentioned, there is a lower and upper boundary where transition happens and this was done to avoid oscillation.

With vSphere 6.0 a fifth memory state is introduced and this state is called Clear. Clear is 100% of minFree and High has been redefined as 400% of MinFree. When there is less then High (400% of minFree) but more then Clear (100% of minFree) available then ESXi will start pre-emptively breaking up large pages so that TPS (when enabled!) can collapse them at next run. Lets take that 10GB as minFree as an example again, when you have between 30GB (High) and 10GB (Clear) of free memory available large pages will be broken up. This should provide the leeway needed to safely collapse pages (TPS) and avoid the potential performance decrease which the other memory states could introduce. Very useful if you ask me, and I am very happy that this change in behaviour, which I requested a long time ago, has finally made it in to the product.

Those of you who have been paying attention the last months will know that by default inter VM transparent page sharing is disabled. If you do want to reap the benefits of TPS and would like to leverage TPS in times of contention then enabling it in 6.0 is pretty straight forward. Just go to the advanced settings and set “Mem.ShareForceSalting” to 0. Do note that there are security risks potentially when doing this, and I recommend to read the above article to get a better understand of those risks.

** update – originally I was told that High was 300% of minFree, looking at my hosts today it seems that High is actually 400% **

How does Mem.MinFreePct work with vSphere 5.0 and up?

Duncan Epping · Jun 14, 2013 ·

With vSphere 5.0 VMware changed the way Mem.MinFreePct worked. I had briefly explained Mem.MinFreePct in a blog post a long time ago. Basically Mem.MinFreePct, pre vSphere 5.0, was the percentage of memory set aside by the VMkernel to ensure there are always sufficient system resources available. I received a question on twitter yesterday based on the explanation in the vSphere 5.1 Clustering Deepdive and after exchanging > 10 tweets I figured it made sense to just write an article.

https://twitter.com/vmcutlip/status/345289952684290048

Mem.MinFreePct used to be 6% with vSphere 4.1 and lower. Now you can imagine that when you had a host with 10GB you wouldn’t worry about 600MB being kept free, but that is slightly different for a host with 100GB as it would result in 6GB being kept free but still not an extreme amount right. What would happen when you have a host with 512GB of memory… Yes, that would result in 30GB of memory being kept free. I am guessing you can see the point now. So what changed with vSphere 5.0?

In vSphere 5.0 a “sliding scale” principle was introduced instead of Mem.MinFreePct. Let me call it “Mem.MinFree”, as I wouldn’t view this as a percentage but rather do the math and view it as a number instead. Lets borrow Frank’s table for this sliding scale concept:

Percentage kept free of –>	Memory Range
6%	0-4GB
4%	4-12GB
2%	12-28GB
1%	Remaining memory

What does this mean if you have 100GB of memory in your host? It means that from the first 4GB of memory we will set aside 6% which equates to ~ 245MB. For the next 8GB (4-12GB range) we set aside another 4% which equates to ~327MB. For the next 16GB (12-28GB range) we set aside 2% which equates to ~ 327MB. Now from the remaining 72GB (100GB host – 28GB) we set aside 1% which equates to ~ 720MB. In total the value of Mem.MinFree is ~ 1619MB. This number, 1619MB, is being kept free for the system.

Now, what happens when the host has less than 1619MB of free memory? That is when the various memory reclamation techniques come in to play. We all know the famous “high, soft, hard, and low” memory states, these used to be explained as: 6% (High), 4% (Soft), 2% (Hard), 1% (Low). FORGET THAT! Yes, I mean that… forget these as that is what we used in the “old world” (pre 5.0). With vSphere 5.0 and up these water marks should be viewed as a Percentage of Mem.MinFree. I used the example from above to clarify it a bit what it results in.

Free memory state	Threshold in Percentage	Threshold in MB
High water mark	Higher than or equal to Mem.MinFree	1619MB
Soft water mark	64% of Mem.MinFree	1036MB
Hard water mark	32% of Mem.MinFree	518MB
Low water mark	16% of Mem.MinFree	259MB

I hope this clarifies a bit how vSphere 5.0 (and up) ensures there is sufficient memory available for the VMkernel to handle system tasks…

What is static overhead memory?

Duncan Epping · May 6, 2013 ·

We had a discussion internally on static overhead memory. Coincidentally I spoke with Aashish Parikh from the DRS team on this topic a couple of weeks ago when I was in Palo Alto. Aashish is working on improving the overhead memory estimation calculation so that both HA and DRS can be even more efficient when it comes to placing virtual machines. The question was around what determines the static memory and this is the answer that Aashish provided. I found it very useful hence the reason I asked Aashish if it was okay to share it with the world. I added some bits and pieces where I felt additional details were needed though.

First of all, what is static overhead and what is dynamic overhead:

When a VM is powered-off, the amount of overhead memory required to power it on is called static overhead memory.
Once a VM is powered-on, the amount of overhead memory required to keep it running is called dynamic or runtime overhead memory.

Static overhead memory of a VM depends upon various factors:

Several virtual machine configuration parameters like the number vCPUs, amount of vRAM, number of devices, etc
The enabling/disabling of various VMware features (FT, CBRC; etc)
ESXi Build Number

Note that static overhead memory estimation is calculated fairly conservative and we take a worst-case-scenario in to account. This is the reason why engineering is exploring ways of improving it. One of the areas that can be improved is for instance including host configuration parameters. These parameters are things like CPU model, family & stepping, various CPUID bits, etc. This means that as a result, two similar VMs residing on different hosts would have different overhead values.

But what about Dynamic? Dynamic overhead seems to be more accurate today right? Well there is a good reason for it, with dynamic overhead it is “known” where the host is running and the cost of running the VM on that host can easily be calculated. It is not a matter of estimating it any longer, but a matter of doing the math. That is the big difference: Dynamic = VM is running and we know where versus Static = VM is powered off and we don’t know where it might be powered!

Same applies for instance to vMotion scenarios. Although the platform knows what the target destination will be; it still doesn’t know how the target will treat that virtual machine. As such the vMotion process aims to be conservative and uses static overhead memory instead of dynamic. One of the things or instance that changes the amount of overhead memory needed is the “monitor mode” used (BT, HV or HWMMU).

So what is being explored to improve it? First of all including the additional host side parameters as mentioned above. But secondly, but equally important, based on the vm -> “target host” combination the overhead memory should be calculated. Or as engineering calls it calculating “Static overhead of VM v on Host h”.

Now why is this important? When is static overhead memory used? Static overhead memory is used by both HA and DRS. HA for instance uses it with Admission Control when doing the calculations around how many VMs can be powered on before unreserved resources are depleted. When you power-on a virtual machine the host side “admission control” will validate if it has sufficient unreserved resource available for the “static memory overhead” to be guaranteed… But also DRS and vMotion use the static memory overhead metric, for instance to ensure a virtual machine can be placed on a target host during a vMotion process as the static memory overhead needs to be guaranteed.

As you can see, a fairly lengthy chunk of info on just a single simple metric in vCenter / ESXTOP… but very nice to know!

Enabling Hot-Add by default? /cc @gabvirtualworld

Duncan Epping · Jan 16, 2012 ·

Gabe asked the question on one of my recent posts if it made sense to enable Hot-Add by default and if there was an impact/overhead?

Lets answer the impact/overhead portion first, yes there is an overhead. It is in the range of percents. You might ask yourself where this overhead is coming from and if that is vSphere overhead or… When CPU and Memory Hot-add is enabled the Guest OS, especially Windows, will accommodate for all possible memory and CPU changes. For CPU is will take the max amount of vCPUs into account, so with vSphere 5 that would be 32. For memory it will take 16 x power-on memory in to account, as that is the max you can provision . Does it have an impact? Again, a matter of percents. It could also lead to problems however when you don’t have sufficient memory provisioned as described in this KB by Microsoft: http://support.microsoft.com/kb/913568.

Another impact, mentioned by Valentin (VMware), is the fact that on ESXi 5.0 vNUMA would not be used if you had the HotAdd feature enabled for that VM.

What is our recommendation? Enable it only when you need it. Yes they impact might be small, but if you don’t need it why would you incur it?!

vNUMA and vMotion

Duncan Epping · Oct 28, 2011 ·

I was listening to some VMworld talks during the weekend and something caught my attention which I hadn’t realized before. The talk I was listening to was VSP2122″VMware vMotion in vSphere 5.0, Architecture and Performance”. Now this probably doesn’t apply to most of the people reading this so let me set the scenario first:

Different hosts from a CPU/Memory perspective in a single cluster (different NUMA topology)
VMs with more than 8 vCPUs

Now the thing is that the vNUMA topology is set for a given VM during the power-on. This is based on the NUMA topology of the physical host that has received the power-on request. When you move a VM to a host which has a different NUMA topology then it could result in reduced performance. This is also described in the Performance Best Practices whitepaper for vSphere 5.0. A nice example of how you can benefit from vNUMA is explained in the recently released academic paper “Performance Evaluation of HPC Benchmarks on VMware’s ESXi Server“.

I’ve never been a huge fan of mixed clusters due to complications it adds around resource management and availability, but this is definitely another argument to try to avoid it where and when possible.