I was just reading Scott Drummonds article on Memory Compression. Scott explains where Memory Compression comes in to play. I guess the part I want to reply on is the following:
VMware’s long-term prioritization for managing the most aggressively over-committed memory looks like this:
- Do not swap if possible. We will continue to leverage transparent page sharing and ballooning to make swapping a last resort.
- Use ODMC to a predefined cache to decrease memory utilization.*
- Swap to persistent memory (SSD) installed locally in the server.**
- Swap to the array, which may benefit from installed SSDs.
(*) Demonstrated in the lab and coming in a future product.
(**) Part of our vision and not yet demonstrated.
I just love it when we give insights in upcoming features but I am not sure I agree with the prioritization. I think there are several things that one needs to keep in mind. In other words there’s a cost associated to these decisions / features and your design needs to adjusted to these associated effects.
- TPS -> Although TPS is an amazing way of reducing the memory footprint you will need to figure out what the ratio of deduplication is. Especially when you are using Nehalem processors there’s a serious decrease. The reasons for the decrease of TPS effectiveness are the following:
- NUMA – By default there is no inter node transparent page sharing (read Frank’s article for more info on this topic)
- Large Pages – By default TPS does not share large(2MB) pages. TPS only shares small(4KB) pages. It will break large pages down in small pages when memory is scarce but it is definitely something you need to be aware off. (for more info read my article on this topic.
- Use ODMC -> I haven’t tested with ODMC yet and I don’t know what the associated cost is at the moment.
- Swap on local SSD -> Swap on local SSD will most definitely improve the speed when swapping occurs. However as Frank already described in his article there is an associated cost:
- Disk space – You will need to make sure you will have enough disk space available to power on VMs or migrate VMs as these swap files will be created at power on or at migration.
- Defaults – By default .vswp files are stored in the same folder as the .vmx. Changing this needs to be documented and taken into account during upgrades and design changes.
- Swap to array (SSD) -> This is the option that most customers use for the simple reason that it doesn’t require a local SSD disk. There are no changes needed to enable it and it’s easier to increase a SAN volume than it is to increase a local disk when needed. The associated costs however are:
- Costs – Shared storage is relatively expensive compared to local disks
- Defaults – If .vswp files need to be SSD based you will need to separate the .vswp from the rest of the VMs and created dedicated shared SSD volumes.
I fully agree with Scott that it’s an exciting feature and I can’t wait for it to be available. Keep in mind though that there is a trade off for every decision you make and that the result of a decision might not always end up as you expected it would. Even though Scott’s list makes totally sense there is more than meets the eye.
Bouke Groenescheij says
I just can’t resist to comment on that too: I think you’re both forgetting the ‘ballooning’ part…
In 2.x, 3.x and 4.x the order has always been:
1: TPS (transparent page sharing)
2: Ballooning (VMMemCtl) – essentially swapping inside OS of VM
3: Swapping
And there is something called MemIdleTax, allowing to grab memory back from the VM if it isn’t used.
I’m missing step 2 here… Ballooning. Then I read:
“Do not swap if possible. We will continue to leverage transparent page sharing and ballooning to make swapping a last resort.”
But I sure hope the order would be:
1: TPS
2: ODMC
3: Ballooning (and if you’ve got SSD, great, just make sure you’ve got the pagefile/swapfile of OS on .vmdk in VMFS on SSD. This will, ofcourse, not work on local SSD if you still want cluster capabilities, like vMotion, HA, DRS, FT)
4: Swapping (local or remote, don’t really care as long as you keep the correct design described above by Duncan).
I hope “it’s like that, and that’s the way it is, whua” – (run)-o-DMC 😉
Bouke Groenescheij says
Oh and by the way, I would rather spend my money on buying more memory, instead of those expensive SSD’s… And would not recommend over-committing that much anyway. Either buy more memory, or slower CPU’s to get the CPU/Memory balance better :-).
Chad Sakac says
Bouke – I don’t disagree that “add more memory” is a good starting point, but understand that these all increase together (i.e. as memory density increases, so does VM count and mem per VM – ergo memory pressure).
Also, SSD in servers will likely become more predominant. If a 256GB SSD can be added to a server for several hundred dollars – it will be. It will be slower than DRAM by a bit, and SRAM by a lot, but it’s still orders of magnitude cheaper and larger.
Also, Duncan, if the storage subsystem can tier by block, separate datastores aren’t implicitly needed to leverage shared SSDs.
Duncan says
Chad, maybe I don’t fully understand the tiering concept EMC uses but how does the storage array, when it does tiering on a block level, know which blocks belong to .vswp files. These files are usually only sporadically used so with any normal algorithm they would end up on slow storage instead of fast?!
Sean Clark says
Duncan,
The array that tiers at block level doesn’t care what the block is. Therefore it will keep active blocks on tier 0/1 storage and only migrate to lower tiers after being inactive for set period of time (configurable).
So the goodness that Chad is hinting at would be during the first swap out to disk period. The auto-tiering enabled SAN would keep vSwp on tier 0/1, and greatly ease the pain of swapping to disk.
I think where the problem with auto-tiered SAN would come in is after the swapping ceases and days/weeks go by the SAN auto-tiers the vSwp blocks down to SATA drives. My understanding is that ESX will leave that swapped memory in the vSwp file indefinitely until needed EVEN if sufficient physical memory now exists in the ESX server. This downside would need to be mitigated somehow…probably a reboot of the affected VMs after the (hopefully rare) ESX swapping event.
my $.02.
Sean Clark
Duncan Epping says
I might not have been clear Sean, but when memory gets paged out it will only be paged in when the VM requests the page. The question is how long does it take before it is move to a lower tier? And what are the chances that that page has been moved to a lower tier before it will be paged in. Cause if that would happen you would even have worse performance than without auto-tiered storage.
Sean Clark says
You are correct. The initial paging event will be handled fairly well by the auto-tiered storage, but you’re are definitely correct in that if you let days/weeks go by you would probably have worse performance than if you just swapped to VM’s default location.
It might still be desirable to swap to auto-tiered SAN with an SSD Tier 0/1 of storage, *BUT* you would want to automate the power down and restart of the VM to bring all the VM’s memory back into physical memory and out of the vSwp file. The power down and restart would have to be coordinated for a maintenance window but you’d definitely what that to happen prior to the next scheduled migration of blocks between tiers on the SAN.
Am I correct in saying that a power off/on or restart of a VM is the only way to bring back a VM’s memory to physical memory from vswp?
Duncan Epping says
No,
When there is no memory pressure and the guest requests the page it will also be paged in again!
Sean Clark says
“When there is no memory pressure and the guest requests the page it will also be paged in again!”
Right, I know that. But in the case where no memory pressure exists, and you reboot the VM, the VM would then have 100% of its consumed memory in physical memory and therefore would experience consistent performance from then on.
rotary laser levels says
I was trying to find more information on Re: Memory Compression » Yellow Bricks on Dogpile and this site was the first site I saw about it. Thanks for your opinion and now I know where to find great stuff in the future