innovation

Memory Tiering… Say what?!

Duncan Epping · Jun 14, 2024 ·

Recently I presented a keynote at the Belgium VMUG, the topic was Innovation at VMware by Broadcom, but I guess I should say Innovation at Broadcom to be more accurate. During the keynote I briefly went over the process and the various types of innovation and what this can lead to. During the session, I discussed three projects, namely vSAN ESA, the Distributed Services Engine, and something which is being worked on called: Memory Tiering.

Memory Tiering is a very interesting concept that was first publicly discussed at Explore (or VMworld I guess it was still. called) a few years ago as a potential future feature. You may ask yourself why anyone would want to tier memory, as the impact from a performance stance can be significant. There are various reasons to do so, one of them being the cost of memory. Another problem the industry is facing is the fact that memory capacity (and performance) has not grown at the same rate as CPU capacity, which has resulted in many environments being memory-bound, differently said the imbalance between CPU and memory has increased substantially. That’s why VMware started Project Capitola.

When Project Capitola was discussed most of the focus was on Intel Optane, and most of us know what happened to that. I guess some assumed that that would also result in Project Capitola, or memory tiering and memory pooling technology, being scrapped. This is most definitely not the case, VMware has gone full steam ahead and has been discussing the progress in public, although you need to know where to look. If you listen to that session it is clear that there are various efforts, that would allow customers to tier memory in various ways, one of them being of course the various CXL based solutions that are coming to market now/soon.

One of which is memory tiering via a CXL accelerator card, basically an FPGA that has the sole purpose of increasing memory capacity, offload memory tiering and accelerating certain functionality where memory is crucial like for instance vMotion. As mentioned in the SNIA session, using an accelerator card can lead to a 30% reduction in migration times. An accelerator card like this will also open up other opportunities, like pooling memory for instance, which is something customers have been asking for since we created the concept of a cluster. Being able to share compute resources across hosts. Just imagine, your VM can use memory capacity available on another host without having to move the VM. Yes, before anyone comments on this, I do realize that this will have a significant performance impact potentially.

That is of course where the VMware logic comes into play. At VMworld in 2021 when Project Capitola was presented, the team also shared the performance results of recent tests, and it showed that the performance degradation was around 10% when 50% of DRAM was used and 50% of Optane memory. I was watching the SNIA session, and this demo shows the true power of VMware vSphere, memory tiering, and acceleration (Project Peaberry as it is called). On average the performance degradation was around 10%, yet roughly 40% of virtual memory was accessed via the Peaberry accelerator. Do note that the tiering is completely transparent to the application, this works for all different types of workloads out there. The crucial part here to understand is that because the hypervisor is already responsible for memory management, it knows which pages are hot and which pages are cold, that also means it can determine which pages it can move to a different tier while maintaining performance.

Anyway, I cannot reveal too much about what may, or may not, be coming in the future. What I can promise though is that I will make sure to write a blog as soon as I am allowed to talk about more details publicly, and I will probably also record a podcast with the product manager(s) when the time is there, so stay tuned!

VMworld Reveals: vMotion innovations

Duncan Epping · Sep 3, 2019 ·

At VMworld, various cool new technologies were previewed. In this series of articles, I will write about some of those previewed technologies. Unfortunately, I can’t cover them all as there are simply too many. This article is about enhancements that will be introduced in the future to vMotion, the was session HBI1421BU. For those who want to see the session, you can find it here. This session was presented by Arunachalam Ramanathan and Sreekanth Setty. Please note that this is a summary of a session which is discussing a Technical Preview, this feature/product may never be released, and this preview does not represent a commitment of any kind, and this feature (or it’s functionality) is subject to change. Now let’s dive into it, what can you expect for vMotion in the future.

The session starts with a brief history of vMotion and how we are capable today to vMotion VMs with 128 vCPUs and 6 TB of memory. The expectation is though that vSphere in the future will support 768 vCPUs and 24 TB of memory. Crazy configuration if you ask me, that is a proper Monster VM.

VMworld Reveals: DRS 2.0 (#HBI2880BU)

Duncan Epping · Sep 3, 2019 ·

At VMworld, various cool new technologies were previewed. In this series of articles, I will write about some of those previewed technologies. Unfortunately, I can’t cover them all as there are simply too many. This article is about DRS 2.0, which was session HBI2880BU. For those who want to see the session, you can find it here. This session was presented by Adarsh Jagadeeshwaran and Sai Inabattini. Please note that this is a summary of a session which is discussing a Technical Preview, this feature/product may never be released, and this preview does not represent a commitment of any kind, and this feature (or it’s functionality) is subject to change. Now let’s dive into it, what is DRS 2.0 all about?

The session started with an intro, DRS was first introduced in 2006. Since then datacenters, and workloads (cloud-native architectures), have changed a lot. DRS, however, has remained largely the same over the past 10 years. What we need is a resource management engine which is more workload-centric than it is cluster-centric, that is why we are planning on introducing DRS 2.0

What has changed? In general, the changes can be placed in 3 categories:

New cost-benefit model
Support for new resources and devices
Faster and scalable

VMworld Reveals: VMware Cluster Memory (OCTO2746BU)

Duncan Epping · Sep 2, 2019 ·

At VMworld, various cool new technologies were previewed. In this series of articles, I will write about some of those previewed technologies. Unfortunately, I can’t cover them all as there are simply too many. This article is about VMware Cluster Memory, which was session OCTO2746BU. For those who want to see the session, you can find it here. I first learned about VMware Cluster Memory at our VMware internal R&D conference in May this year, and immediately got excited about it. Please note that this is a summary of a session which is discussing a Technical Preview, this feature/product may never be released, and this preview does not represent a commitment of any kind, and this feature (or it’s functionality) is subject to change. Now let’s dive into it, what is Cluster Memory?

Well, it is exactly what you would expect it to be. Providing the ability to create a pool of cross-host memory resources. Now in order to do this, the first problem that needs to be looked at is the network. As mentioned in the session, the ratio of network to memory latency has lowered significantly. In 1997 the ratio was 1000 roughly, right now it is below 10. Meaning that network latency has lowered from milliseconds to low microseconds. Today to reach these low microseconds latencies technologies like RDMA will need to be considered. This change is very important for the Cluster Memory feature being discussed. Also very important, is the fact that RDMA is affordable, and this means it will be coming to a data center near you soon. A huge difference compared to years ago.

No one ever got fired for buying IBM/HP/DELL/EMC etc

Duncan Epping · May 26, 2015 ·

Last week on twitter there was a discussion about hyper-converged solutions and how these were not what someone who works in an enterprise environment would buy for their tier 1 workloads. I asked the question: well what about buying Pure Storage, Tintri, Nimble or Solid Fire systems? All non-hyper converged solutions, but relatively new. Answer was straight forward: not buying those either, big risk. Then the classic comment came:

No one ever got fired for buying IBM (Dell, HP, NetApp, EMC… pick one)

Brilliant marketing slogan by the way (IBM) which has stuck around since the 70s and is now being used by many others. I wondered though… Did anyone ever get fired for buying Pure Storage? Or for buying Tintri? What about Nutanix? Or VMware Virtual SAN? Hold on, maybe someone got fired for buying Nimble, yeah probably Nimble then. No of course not, even after a dozen google searches nothing shows up. Why you may ask yourself, well because typically people don’t get fired for buying a certain solution. People get fired for being incompetent / lazy / stupid. In the case of infrastructure and workloads that translates in to managing and placing workloads incorrectly or misconfiguring infrastructure. Fatal mistakes which result in dataloss or long periods of downtime, that is what gets you fired.

Sure, buying from a startup may impose some risks. But I would hope that everyone reading this weighs those risks against the benefits, that is what you do as an architect in my opinion. You assess risks and you determine how to mitigate those within your budget. (Yes of course taking requirements and constraints in to account as well.)

Now when it comes to these newer storage solutions, and “new” is relative in this case as some have been around for over 5 years, I would argue that the risk is in most cases negligible. Will those newer storage systems be free of bugs? No, but neither will your legacy storage system be. Some of those systems have been around for over a decade and are now used in scenarios they were never designed for, which means that new problems may be exposed. I am not saying that legacy storage systems will break under your workload, but are you taking that risk in to account? Probably not, why not? Because hardly anyone talks about that risk.

If you (still) don’t feel comfortable with that “new” storage system (yet) but they do appear to give you that edge or bigger bang for the buck simply ask the sales rep a couple of questions which will help building trust:

How many systems are sold world wide similar to what you are looking to buy and for similar platforms
- If they sold thousands, but none of them is using vSphere for instance then what are the chances of you hitting that driver problem firsts? If they sold thousand it will be useful to know…
How many customers for that particular model
- Wouldn’t be the first time a vendors sells thousands of boxes to a single customer for a very specific use case and it works great for them, just not in your particular use case.
- But if they have many customers, maybe ask…
If you can talk to a couple of customers
- Best thing you can ask for in my opinion, reference call or visit. This is when you find out if what is promised actually is reality.

I do believe that the majority of infrastructure related startups are great companies with great technology. Personally I see a bigger threat in terms of sustainability, rather than technology. Not every startup is going to be around 10 years from now. But if you look at all the different storage (or infra) startups which are out there today, and then look at how they are doing in the market it shouldn’t be too difficult to figure out who is in it for the long run. Whether you buy from a well-established vendor or from a relatively new storage company, it is all about your workload. What are the requirements and how can those requirements be satisfied by that platform. Assess the risks and weigh them against the benefit and make a decision based on that. Don’t make decisions based on a marketing slogan that has been around since the 70s. The world looks different now, technology is moving faster than ever before, being stuck in the 70s is not going to help you or your company compete in this day and age.