swap

List all “thick” swap files on vSAN

Duncan Epping · Sep 6, 2017 ·

As some may know, on vSAN by default the swap file is a fully reserved file. This means that if you have a VM with 8GB of memory, vSAN will reserve 16GB capacity in total for it. 16GB? Yes, 16GB as the FTT=1 policy is also applied to it. In vSAN 6.2 we introduced the ability to have swap files created “thin” or “unreserved” I should probably say. You can simply do these by setting an advanced setting on each host in your cluster. (SwapThickProvisionDisabled) Now when you have set this and power-off/power-on your VMs the swap file is recreated and the swap file will be thin. Jase McCarty wrote a script that will set the setting for you in each host of your cluster, but the problem of course is how do you know which VM has the “new unreserved” swap file and which VM still has the fully reserved swap file. This is what a customer asked me last week.

I was sitting next to William at a session and I asked him this question. William went at it and knocked out a Python script which lists all VMs in a cluster which have a fully reserved swap file. Very useful for those who are moving to “unreserved / sparse” swap. This way you can figure out which VMs still need a reboot and reclaim that (unused) disk capacity.

Note, the “sparse” / “unreserved” swap files are only intended for environments which do not overcommit on memory. If you do overcommit on memory please ensure you have disk capacity available, as you will need the disk capacity as soon as the hypervisor wants to place memory pages in the swap file. If there’s no disk capacity available it will result in the VM failing.

Thanks William for knocking out this script so fast…

Storage capacity for swap files and TPS disabled

Duncan Epping · Dec 8, 2016 ·

A while ago (2014) I wrote an article on TPS being disabled by default in future release. (Read KB 2080735 and 2097593 for more info) I described why VMware made this change from a security perspective and what the impact could be. Even today, two years later, I am still getting questions about this and what for instance the impact is on swap files. With vSAN you have the ability to thin provision swap files, and with TPS being disabled is this something that brings a risk?

Lets break it down, first of all what is the risk of having TPS enabled and where does TPS come in to play?

With large pages enabled by default most customers aren’t actually using TPS to the level they think they are. Unless you are using old CPUs which don’t have EPT or RVI capabilities, which I doubt at this point, it only kicks in with memory pressure (usually) and then large pages get broken in to small pages and only then will they be TPS’ed, if you have severe memory pressure that usually means you will go straight to ballooning or swapping.

Having said that, lets assume a hacker has managed to find his way in to you virtual machine’s guest operating system. Only when memory pages are collapsed, which as described above only happens under memory pressure, will the hacker be able to attack the system. Note that the VM/Data he wants to attack will need to be on the located on the same host and the memory pages/data he needs to breach the system will need to be collapsed. (actually, same NUMA node even) Many would argue that if a hacker gets that far and gets all the way in to your VM and capable of exploiting this gap you have far bigger problems. On top of that, what is the likelihood of pulling this off? Personally, and I know the VMware security team probably doesn’t agree, I think it is unlikely. I understand why VMware changed the default, but there are a lot of “IFs” in play here.

Anyway, lets assume you assessed the risk and feel you need to protect yourself against it and keep the default setting (intra-VM TPS only), what is the impact on your swap file capacity allocation? As stated when there is memory pressure, and ballooning cannot free up sufficient memory and intra-VM TPS is not providing the needed memory space either the next step after compressing memory pages is swapping! And in order for ESXi to swap memory to disk you will need disk capacity. If and when the swap file is thin provisioned (vSAN Sparse Swap) then before swapping out those blocks on vSAN will need to be allocated. (This also applies to NFS where files are thin provisioned by default by the way.)

What does that mean in terms of design? Well in your design you will need to ensure you allocate capacity on vSAN (or any other storage platform) for your swap files. This doesn’t need to be 100% capacity, but should be more than the level of expected overcommitment. If you expect that during maintenance for instance (or an HA event) you will have memory overcommitment of about 25% than you could ensure you have 25% of the capacity needed for swap files available at least to avoid having a VM being stunned as new blocks for the swap file cannot be allocated and you run out of vSAN datastore space.

Let it be clear, I don’t know many customers running their storage systems in terms of capacity up to 95% or more, but if you are and you have thin swap files and you are overcommitting and TPS is disabled, you may want to re-think your strategy.

I have memory pages swapped, can vSphere unswap them?

Duncan Epping · Jun 2, 2016 ·

“I have memory pages swapped out to disk, can vSphere swap them back in to memory again” is one of those questions that comes up occasionally. A while back I asked the engineering team why we don’t “swap in” pages when memory contention is lifted. There was no real good answer for it other than it was difficult to predict from a behavioural point of view. So I asked what about doing it manually? Unfortunately the answer was: well we will look in to it but it has no real priority it this point.

I was very surprised to receive an email this week from one of our support engineers, Valentin Bondzio, that you can actually do this in vSphere 6.0. Although not widely exposed, the feature is actually in there and typically (as it stands today) is used by VMware support when requested by a customer. Valentin was kind enough to provide me with this excellent write-up. Before you read it, do note that this feature was intended for VMware Support. While it is internally supported, you’d be using it at your own risk, and consider this write-up to be purely for educational purposes. Support for this feature, and exposure through the UI, may or may not change in the future.

By Valentin Bondzio

Did you ever receive an alarm due to a hanging or simply underperforming application or VM? If yes, was it ever due to prolonged hypervisor swap wait? That might be somewhat expected in an acute overcommit or limited VM / RP scenario but very often the actual contention happened days, weeks or even month ago. In those scenarios, you were just unlucky enough that the guest or application decided to touch a lot of the memory that happened to be swapped out around the same time. Which until this exact time you either didn’t notice or if you did, didn’t pose any visible threat. It just happened to be idle data that resided on disk instead of in memory.

The notable distinction being that it is on disk with every expectation of it being in memory, meaning a (hard) page fault will suspend the execution of the VM until that very page is read from disk and back in memory. If that happens to be a fairly large and contiguous range, even with gracious pre-fetching from the ESXi, you’ll might experience some sort of service unavailability.

How to prevent this from happening in scenarios where you actually have ample free memory and the cause of contention is long resolved? Up until today that answer would be to power cycle your VM or using vMotion with local swap store to asynchronously page in the swapped out data. For everyone that is running on ESXi 6.0 that answer just got a lot simpler.

Introducing unswap

As the name implies, it will page in memory that has been swapped out by the hypervisor, whether it was actual contention during an outage or just an ill-placed Virtual Machine or Resource Pool Limit. Let’s play through an example:

A VM experienced a non-specified event (hint, it was a 2GB limit) and now about 14GB of its 16GB of allocated memory are swapped out to the default swap location.

# memstats -r vm-stats -u mb -s name:memSize:max:consumed:swapped | sed -n '/  \+name/,/ \+Total/p'
           name    memSize        max   consumed    swapped
-----------------------------------------------------------
      vm.449922      16384       2000       2000      14146

What happens at which vSphere memory state?

Duncan Epping · Mar 2, 2015 ·

I’ve received a bunch of questions from people around what happens at each vSphere memory state after writing the article around breaking up large pages and introducing a new memory state in vSphere 6.0. Note, that the below is about vSphere 6.0 only, the “Clear” memory state does not exist before 6.0. Also, note that there is an upper and lower boundary when transitioning between states, this means that you will not see actions triggered at the exact specified threshold, but slightly before or after passing that threshold.

I created a simple table that shows what happens when. Note that minFree itself not a fixed number but rather a sliding scale and the value will depend on the host memory configuration.

Memory state	Threshold	Actions performed
High	400% of minFree	Break Large Pages when below Threshold (wait for next TPS run)
Clear	100% of minFree	Break Large Pages and actively call TPS to collapse pages
Soft	64% of minFree	TPS + Balloon
Hard	32% of minFree	TPS + Compress + Swap
Low	16% of minFree	Compress + Swap + Block

First of all, note that when you enter the “Hard” state the balloon driver stops and “Swap” and “Compress” take over. This is something I never really realized, but it is important to know as it means that when memory fills up fast you will see a short period of ballooning and then jump to compressing and swapping immediately. You may ask yourself what is this “block” thing. Well this is the worst situation you can find yourself in and it is the last resort, as my colleague Ishan described it:

The low state is similar to the hard state. In addition, to compressing and swapping memory pages, ESX may block certain VMs from allocating memory in this state. It aggressively reclaims memory from VMs, until ESX moves into the hard state.

I hope this makes it clear which action is triggered at which state, and also why the “Clear” state was introduced and the “High” state changed. It provides more time for the other actions to do what they need to do: free up memory to avoid blocking VMs from allocating new memory pages.

Swap to host cache aka swap to SSD?

Duncan Epping · Aug 18, 2011 ·

Before we dive in to it, lets spell out the actual name of the feature “Swap to host cache”. Remember that, swap to host cache!

I’ve seen multiple people mentioning this feature and saw William posting a hack on how to fool vSphere (feature is part of vSphere 5 to be clear) into thinking it has access to SSD disks while this might not be the case. One thing I noticed is that there seems to be a misunderstanding of what this swap to host cache actually is / does and that is probably due to the fact that some tend to call it swap to SSD. Yes it is true, ultimately your VM would be swapping to SSD but it is not just a swap file on SSD or better said it is NOT a regular virtual machine swap file on SSD.

When I logged in to my environment first thing I noticed was that my SSD backed datastore was not tagged as SSD. First thing I wanted to do was tag it as SSD, as mentioned William already described this in his article and it is well documented in our own documentation as well so I followed it. This is what I did to get it working:

Check the NAA ID in the vSphere UI
Opened up an SSH session to my ESXi host
Validate which SATP claimed the device:
esxcli storage nmp device list
In my case: VMW_SATP_ALUA_CX
Verify it is currently not recognized as SSD by typing the following command:
esxcli storage core device list -d naa.60060160916128003edc4c4e4654e011
should say: “Is SSD : False”
Set “Is SSD” to true:
esxcli storage nmp satp rule add -s VMW_SATP_ALUA_CX –device naa.60060160916128003edc4c4e4654e011 –option=enable_ssd
I reloaded claim rules and ran them using the following commands:
esxcli storage core claimrule load
esxcli storage core claimrule run
Validate it is set to true:
esxcli storage core device list -d naa.60060160916128003edc4c4e4654e011
Now the device should be listed as SSD

Next would be to enable the feature… When you go to your host and click on the “Configuration Tab” there should be a section called “Host Cache Configuration” on the left. When you’ve correctly tagged your SSD it should look like this:

Please note that I already had a VM running on the device and hence the reason it is showing some of the space as being in use on this device, normally I would recommend using a drive dedicated for swap. Next step would be enabling the feature and you can do that by opening the pop-up window (right click your datastore and select “Properties”). This is what I did:

Tick “Allocate space for host cache”
Select “Custom size”
Set the size to 25GB
Click “OK”

Now there is no science to this value as I just wanted to enable it and test the feature. What happened when we enabled it? We allocated space on this LUN so something must have been done with it? I opened up the datastore browser and I noticed a new folder was created on this particular VMFS volume:

Not only did it create a folder structure but it also created 25 x 1GB .vswp files. Now before we go any further, please note that this is a per host setting. Each host will need to have its own Host Cache assigned so it probably makes more sense to use a local SSD drive instead of a SAN volume. Some of you might say but what about resiliency? Well if your host fails the VMs will need to restart anyway so that data is no longer relevant, in terms of disk resiliency you should definitely consider a RAID-1 configuration. Generally speaking SAN volumes are much more expensive than local volumes and using local volumes also removes the latency caused by the storage network. Compared to the latency of a SSD (less than 100 μs), network latency can be significant. So lets recap that in a nice design principal:

Basic design principle
Using “Swap to host cache” will severely reduce the performance impact of VMkernel swapping. It is recommended to use a local SSD drive to elimate any network latency and to optimize for performance.

How does it work? Well fairly straight forward actually. When there is severe memory pressure and the hypervisor needs to swap memory pages to disk it will swap to the .vswp files on the SSD drive instead. Each of these, in my case, 25 files are shared amongst the VMs running on this host. Now you will probably wonder how you know if the host is using this Host Cache or not, that can of course simply be validated by looking at the performance statistics within vCenter. It contains a couple of new metrics of which “Swap in from host cache” and “Swap out to host cache” (and the “rate”…) metrics are most important to monitor. (Yes, esxtop has metrics as well to monitor it namely LLSWR/s and LLSWW/s)

What if you want to resize your Host Cache and it is already in use? Well simply said the Host Cache is optimized to allow for this scenario. If the Host Cache is completely filled memory pages will need to be copied to the regular .vswp file. This could mean that the process takes longer than expected and of course it is not a recommended practice as it will decrease performance for your VMs as these pages more than likely at some point will need to be swapped in. Resizing however can be done on the fly, no need to vMotion away your VMs. Just adjust the slider and wait for the process to complete. If you decide to complete remove all host cache for what ever reason than all relevant data will be migrated to the regular .vswp.

What if the Host Cache is full? Normally it shouldn’t even reach that state, but when you run out of space in the host cache pages will be migrated from your host cache to your regular vswap file and it is first in first out in this case, which should be the right policy for most workloads. Now chances of course of having memory pressure to the extend where you fill up a local SSD are small, but it is good to realize what the impact is. If you are going down the path of local SSD drives with Host Cache enabled and will be overcommitting it might be good to do the math and ensure that you have enough cache available to keep these pages in cache rather than on rotating media. I prefer to keep it simple though and would probably recommend to equal the size of your hosts memory. In the case of a host with 128GB RAM that would be a 128GB SSD. Yes this might be overkill, but the price difference between 64GB and 128GB is probably neglect-able.

Basic design principle
Monitor swap usage. Although “Swap to host cache” will reduce the impact of VMkernel swapping it will not eliminate it. Take your expected consolidation ratio into account including your HA (N-X) strategy and size accordingly. Or keep it simple and just use the same size as physical memory.

One interesting use case could be to place all regular swap files on very cheap shared storage (RAID5 of SATA drives) or even local SATA storage using the “VM swapfile location” (aka. Host local swap) feature. Then install a host cache for any host these VMs can be migrated to. This should give you the performance of a SSD while maintaining most of the cost saving of the cheap storage. Please note that the host cache is a per-host feature. Hence in the time of a vMotion all data from the cache will need to be transferred to the destination host. This will impact the time a vMotion takes. Unless your vMotions are time critical, this should not be an issue though. I have been told that VMware will publish a KB article with advise how to buy the right SSDs for this feature.

Summarizing, Swap to SSD is what people have been calling this feature and that is not what it is. This is a mechanism that caches memory pages to SSD and should be referred to as “Swap to host cache”. Depending on how you do the math all memory pages can be swapped to and from SSD. If there is insufficient space available memory pages will move over to the regular .vswp file. Use local SSD drives to avoid any latency associated with your storage network and to minimize costs.