memory

Memory Tiering… Say what?!

Duncan Epping · Jun 14, 2024 · 1 Comment

Recently I presented a keynote at the Belgium VMUG, the topic was Innovation at VMware by Broadcom, but I guess I should say Innovation at Broadcom to be more accurate. During the keynote I briefly went over the process and the various types of innovation and what this can lead to. During the session, I discussed three projects, namely vSAN ESA, the Distributed Services Engine, and something which is being worked on called: Memory Tiering.

Memory Tiering is a very interesting concept that was first publicly discussed at Explore (or VMworld I guess it was still. called) a few years ago as a potential future feature. You may ask yourself why anyone would want to tier memory, as the impact from a performance stance can be significant. There are various reasons to do so, one of them being the cost of memory. Another problem the industry is facing is the fact that memory capacity (and performance) has not grown at the same rate as CPU capacity, which has resulted in many environments being memory-bound, differently said the imbalance between CPU and memory has increased substantially. That’s why VMware started Project Capitola.

When Project Capitola was discussed most of the focus was on Intel Optane, and most of us know what happened to that. I guess some assumed that that would also result in Project Capitola, or memory tiering and memory pooling technology, being scrapped. This is most definitely not the case, VMware has gone full steam ahead and has been discussing the progress in public, although you need to know where to look. If you listen to that session it is clear that there are various efforts, that would allow customers to tier memory in various ways, one of them being of course the various CXL based solutions that are coming to market now/soon.

One of which is memory tiering via a CXL accelerator card, basically an FPGA that has the sole purpose of increasing memory capacity, offload memory tiering and accelerating certain functionality where memory is crucial like for instance vMotion. As mentioned in the SNIA session, using an accelerator card can lead to a 30% reduction in migration times. An accelerator card like this will also open up other opportunities, like pooling memory for instance, which is something customers have been asking for since we created the concept of a cluster. Being able to share compute resources across hosts. Just imagine, your VM can use memory capacity available on another host without having to move the VM. Yes, before anyone comments on this, I do realize that this will have a significant performance impact potentially.

That is of course where the VMware logic comes into play. At VMworld in 2021 when Project Capitola was presented, the team also shared the performance results of recent tests, and it showed that the performance degradation was around 10% when 50% of DRAM was used and 50% of Optane memory. I was watching the SNIA session, and this demo shows the true power of VMware vSphere, memory tiering, and acceleration (Project Peaberry as it is called). On average the performance degradation was around 10%, yet roughly 40% of virtual memory was accessed via the Peaberry accelerator. Do note that the tiering is completely transparent to the application, this works for all different types of workloads out there. The crucial part here to understand is that because the hypervisor is already responsible for memory management, it knows which pages are hot and which pages are cold, that also means it can determine which pages it can move to a different tier while maintaining performance.

Anyway, I cannot reveal too much about what may, or may not, be coming in the future. What I can promise though is that I will make sure to write a blog as soon as I am allowed to talk about more details publicly, and I will probably also record a podcast with the product manager(s) when the time is there, so stay tuned!

VMworld Reveals: VMware Cluster Memory (OCTO2746BU)

Duncan Epping · Sep 2, 2019 ·

At VMworld, various cool new technologies were previewed. In this series of articles, I will write about some of those previewed technologies. Unfortunately, I can’t cover them all as there are simply too many. This article is about VMware Cluster Memory, which was session OCTO2746BU. For those who want to see the session, you can find it here. I first learned about VMware Cluster Memory at our VMware internal R&D conference in May this year, and immediately got excited about it. Please note that this is a summary of a session which is discussing a Technical Preview, this feature/product may never be released, and this preview does not represent a commitment of any kind, and this feature (or it’s functionality) is subject to change. Now let’s dive into it, what is Cluster Memory?

Well, it is exactly what you would expect it to be. Providing the ability to create a pool of cross-host memory resources. Now in order to do this, the first problem that needs to be looked at is the network. As mentioned in the session, the ratio of network to memory latency has lowered significantly. In 1997 the ratio was 1000 roughly, right now it is below 10. Meaning that network latency has lowered from milliseconds to low microseconds. Today to reach these low microseconds latencies technologies like RDMA will need to be considered. This change is very important for the Cluster Memory feature being discussed. Also very important, is the fact that RDMA is affordable, and this means it will be coming to a data center near you soon. A huge difference compared to years ago.

Must read white paper: Persistent Memory performance with vSphere 6.7

Duncan Epping · Aug 14, 2018 ·

Today I noticed this whitepaper titled: Persistent Memory Performance on vSphere 6.7. An intriguing topic for sure as it is something “relatively new and something I haven’t encountered too much in the field. Yes, I talk about Persistent Memory, aka NVDIMMs, in my talks usually but then it typically relates to vSAN. I have not seen too many publications from VMware on this topic, so I figured I would share this publication with you:

Persistent Memory Performance in vSphere 6.7 – https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/pmem-vsphere67-perf.pdf
Persistent memory (PMEM) is a new technology that has the characteristics of memory but retains data through power cycles. PMEM bridges the gap between DRAM and flash storage. PMEM offers several advantages over current technologies like:
- DRAM-like latency and bandwidth
- CPU can use regular load/store byte-addressable instructions
- Persistence of data across reboots and crashes

The paper starts with a brief intro and then explains the different modes in which PMEM can be used, either as a “disk” (vPMEMDisk) or surfaced up to the guest OS as an NVDIMM (vPMEM). With the latter option, there’s also the ability to have some form of application awareness, which is referred to as the 3rd mode (vPMEM-aware).

I am not going to copy and paste the findings, as the paper has a lot of interesting data and you should go through it. One thing I found most interesting is the huge decrease in latency. Anyway, read the paper and get familiar with persistent memory / NVDIMMs, as this technology will start changing the way we design HCI platforms in the future and cater for low latency / high throughput applications in traditional environments.

I have memory pages swapped, can vSphere unswap them?

Duncan Epping · Jun 2, 2016 ·

“I have memory pages swapped out to disk, can vSphere swap them back in to memory again” is one of those questions that comes up occasionally. A while back I asked the engineering team why we don’t “swap in” pages when memory contention is lifted. There was no real good answer for it other than it was difficult to predict from a behavioural point of view. So I asked what about doing it manually? Unfortunately the answer was: well we will look in to it but it has no real priority it this point.

I was very surprised to receive an email this week from one of our support engineers, Valentin Bondzio, that you can actually do this in vSphere 6.0. Although not widely exposed, the feature is actually in there and typically (as it stands today) is used by VMware support when requested by a customer. Valentin was kind enough to provide me with this excellent write-up. Before you read it, do note that this feature was intended for VMware Support. While it is internally supported, you’d be using it at your own risk, and consider this write-up to be purely for educational purposes. Support for this feature, and exposure through the UI, may or may not change in the future.

By Valentin Bondzio

Did you ever receive an alarm due to a hanging or simply underperforming application or VM? If yes, was it ever due to prolonged hypervisor swap wait? That might be somewhat expected in an acute overcommit or limited VM / RP scenario but very often the actual contention happened days, weeks or even month ago. In those scenarios, you were just unlucky enough that the guest or application decided to touch a lot of the memory that happened to be swapped out around the same time. Which until this exact time you either didn’t notice or if you did, didn’t pose any visible threat. It just happened to be idle data that resided on disk instead of in memory.

The notable distinction being that it is on disk with every expectation of it being in memory, meaning a (hard) page fault will suspend the execution of the VM until that very page is read from disk and back in memory. If that happens to be a fairly large and contiguous range, even with gracious pre-fetching from the ESXi, you’ll might experience some sort of service unavailability.

How to prevent this from happening in scenarios where you actually have ample free memory and the cause of contention is long resolved? Up until today that answer would be to power cycle your VM or using vMotion with local swap store to asynchronously page in the swapped out data. For everyone that is running on ESXi 6.0 that answer just got a lot simpler.

Introducing unswap

As the name implies, it will page in memory that has been swapped out by the hypervisor, whether it was actual contention during an outage or just an ill-placed Virtual Machine or Resource Pool Limit. Let’s play through an example:

A VM experienced a non-specified event (hint, it was a 2GB limit) and now about 14GB of its 16GB of allocated memory are swapped out to the default swap location.

# memstats -r vm-stats -u mb -s name:memSize:max:consumed:swapped | sed -n '/  \+name/,/ \+Total/p'
           name    memSize        max   consumed    swapped
-----------------------------------------------------------
      vm.449922      16384       2000       2000      14146

What happens at which vSphere memory state?

Duncan Epping · Mar 2, 2015 ·

I’ve received a bunch of questions from people around what happens at each vSphere memory state after writing the article around breaking up large pages and introducing a new memory state in vSphere 6.0. Note, that the below is about vSphere 6.0 only, the “Clear” memory state does not exist before 6.0. Also, note that there is an upper and lower boundary when transitioning between states, this means that you will not see actions triggered at the exact specified threshold, but slightly before or after passing that threshold.

I created a simple table that shows what happens when. Note that minFree itself not a fixed number but rather a sliding scale and the value will depend on the host memory configuration.

Memory state	Threshold	Actions performed
High	400% of minFree	Break Large Pages when below Threshold (wait for next TPS run)
Clear	100% of minFree	Break Large Pages and actively call TPS to collapse pages
Soft	64% of minFree	TPS + Balloon
Hard	32% of minFree	TPS + Compress + Swap
Low	16% of minFree	Compress + Swap + Block

First of all, note that when you enter the “Hard” state the balloon driver stops and “Swap” and “Compress” take over. This is something I never really realized, but it is important to know as it means that when memory fills up fast you will see a short period of ballooning and then jump to compressing and swapping immediately. You may ask yourself what is this “block” thing. Well this is the worst situation you can find yourself in and it is the last resort, as my colleague Ishan described it:

The low state is similar to the hard state. In addition, to compressing and swapping memory pages, ESX may block certain VMs from allocating memory in this state. It aggressively reclaims memory from VMs, until ESX moves into the hard state.

I hope this makes it clear which action is triggered at which state, and also why the “Clear” state was introduced and the “High” state changed. It provides more time for the other actions to do what they need to do: free up memory to avoid blocking VMs from allocating new memory pages.