VMware Cloud Foundation

VCF-9 announcements at Explore Barcelona – vSAN Site Takeover and vSAN Site Maintenance

Duncan Epping · Nov 15, 2024 · Leave a Comment

At Explore in Barcelona we had several announcements and showed several roadmap items which we did not reveal in Las Vegas. As the sessions were not recording in Barcelona, I wanted to share with you the features I spoke about at Explore which are currently planned for VCF 9. Please note, I don’t know when these features will be generally available, and there’s always a chance they are not released at all.

I created a video of the features we discussed, as I also wanted to share the demos with you. Now for those who don’t watch videos, the functionality that we are working on for VCF-9 is the following, I am just going to do a brief description, as we have not made a full public announcement about this, and I don’t want to get into trouble.

vSAN Site Maintenance

In a vSAN stretched cluster environment when you want to do site maintenance, today you will need to place every host into maintenance mode one by one. This not only is an administrative/operational burden, it also increases the chances of placing the wrong hosts into maintenance mode. On top of that, as you need to do this sequentially, it could also be that the data stored on host-1 in site A differs from host-2 in site A, meaning that there’s an inconsistent set of data in a site. Normally this is not a problem as the environment will resync when it comes back online, but if the other data site fails, now that existing (inconsistent) data set cannot be used to recover. With Site Maintenance we not only make it easier to place a full site into maintenance mode, we also remove that risk of data inconsistency as vSAN coordinates the maintenance and ensures that the data set is consistent within the site. Fantastic right?!

vSAN Site Takeover

One of the features I felt we were lacking for the longest time was the ability to promote a site when 2 out of the 3 sites had failed simultaneously. This is where Site Takeover comes into play. If you end up in a situation where both the Witness Site and a data site goes down at the same time, you want to be able to still recover. Especially as it is very likely that you will have healthy objects for each VM in that second site. This is what vSAN Site Takeover will help you with. It will allow you to manually (through the UI or script) inform vSAN that even though quorum is lost, it should make the local RAID set for each of the VMs impacted accessible again. After which, of course, vSphere HA would instruct the hosts to power-on those VMs.

If you have any feedback on the demos, and the planned functionality, feel free to leave a comment!

vSAN Data Protection Demo!

Duncan Epping · Oct 16, 2024 · Leave a Comment

I created a vSAN Data Protection Demo for VMware Explore in Las Vegas and Barcelona, and as I got some questions from people about vSAN Data Protection I figured I would share the demo on Youtube and post it on my blog. I have written a few articles on vSAN Data Protection already, and I am starting to notice that some customers are starting to test it and even use it in production where possible. As mentioned before, vSAN Data Protection is only available with vSAN ESA as it leverages the new snapshotting capabilities. The brand new UI is introduced through the Snap Manager Appliance. Make sure to download, deploy, and configure that first.

Why vSAN Max aka disaggregated storage?

Duncan Epping · Sep 6, 2024 · 2 Comments

At VMware Explore 2024 in Las Vegas I had many customer meetings. Last year my calendar was also swamped and one of the things we spend a lot of time on was explaining to customers where vSAN Max would fit into the picture. vSAN Max was originally positioned as a “storage only vSAN platform for petabyte scale use cases”. I guess this still somewhat applies, but since then a lot has changed.

First, the vSAN Max ReadyNode configurations have changed substantially, and you can start at a much smaller capacity scale than when originally launched. We start at 20TB for an XS ReadyNode configuration, which means that with a 4 node minimum you have 80TB. That is something completely different then the petabytes we originally discussed. The other big difference is also the networking requirements, depending on the capacity needs, those have also come down substantially.

Now was said, originally the platform was positioned as a solution for customers running at petabyte scale. The reason I wanted to write a quick blog is because that is not the main argument today customers have for adopting vSAN Max or considering vSAN Max in their environment. The reason is something I personally did not expect to hear, but it is all about operations and sometimes politics.

In a traditional environment, of course, depending on the size, you typically see a separation of duties. You have virtualization admins, networking admins, and storage admins. We have worked hard over the past decade to try to create these full-stack engineers, but the reality is that many companies still have these silos, and they will likely still exist 20 years from now.

This is where vSAN Max can help. The HCI model typically means that the virtualization administrator takes on the storage responsibilities when they implement vSAN, but with vSAN Max this doesn’t necessarily need to be the case. As various customers mentioned last week, with vSAN Max you could have a fully separated environment that is managed by a different team. Funny how often this was brought up as a great use case for vSAN. Especially with the amount of vSAN capacity included in VCF this makes more and more sense!

You could even have a different authentication service connected to the vCenter Server, which manages your vSAN Max clusters! You could have other types of hosts, cluster sizes, best practices, naming schemes, etc. This will all be up to the team managing that particular euuh silo. I know, sometimes a silo is seen as something negative, but for a lot of organizations, this is how they operate, and prefer to operate for the foreseeable future. If so, vSAN Max can cater for that use case as well!

vSAN Stretched: Why is the witness not part of the cluster when the link between a data site and the witness fails?

Duncan Epping · Jun 25, 2024 · Leave a Comment

Last week I received a question about vSAN Stretched which had me wondering for a while what on earth was going on. The person who asked this question was running through several failure scenarios, some of which I have also documented in the past here. The question I got is what is supposed to happen when I have the following scenario as shown in the diagram and the link between the preferred site (Site A) and the witness fails:

The answer, at least that is what I thought, was simple: All VMs will remain running, or said differently, there’s no impact on vSAN. While doing the test, indeed the outcome I documented, which is also documented in the Stretched Clustering Guide and the PoC Guide was indeed the same, the VMs remain running. However, one of the things that was noticed is that when this situation occurs, and indeed the connection between Site A and the Witness is lost, the witness is somehow no longer part of the cluster, which is not what I would expect. The reason I would not expect this to happen is because if a second failure would occur, and for instance the ISL between Site A and Site B goes down, it would direclty impact all VMs. At least, that is what I assumed.

However, when I triggered that second failure and I disconnected the ISL between Site A and Site B, I saw the witness re-appearing again immidiately, I saw the witness objects going from “absent” to “active”, and more importantly, all VMs remained running. The reason this happens is fairly straight forward, when running a configuration like this vSAN has a “leader” and a “backup”, and they each run in a seperate fault domain. Both the leader and the backup need to be able to communicate with the Witness for it to be able to function correctly. If the connection between Site A and the Witness is gone, then either the leader or the backup can no longer communicate with the Witness and the Witness is taken out of the cluster.

So why does the Witness return for duty when the second failure is triggered? Well, when the second failure is triggered the leader is restarted in Site B (as Site A is deemed lost), and the backup is already running in Site B. As both the leader and the backup can communicate again with the witness, the witness returns for duty and so will all of the components automatically and instantly. Which means that even though the ISL has failed between Site A and B after the witness was taken out of the cluster, all VMs remain accessible as the witness is reintroduced instantly to ensure availability of the workload. Pretty cool! (Thanks to vSAN engineering for providing these insights on why this happens!)

Memory Tiering… Say what?!

Duncan Epping · Jun 14, 2024 · 1 Comment

Recently I presented a keynote at the Belgium VMUG, the topic was Innovation at VMware by Broadcom, but I guess I should say Innovation at Broadcom to be more accurate. During the keynote I briefly went over the process and the various types of innovation and what this can lead to. During the session, I discussed three projects, namely vSAN ESA, the Distributed Services Engine, and something which is being worked on called: Memory Tiering.

Memory Tiering is a very interesting concept that was first publicly discussed at Explore (or VMworld I guess it was still. called) a few years ago as a potential future feature. You may ask yourself why anyone would want to tier memory, as the impact from a performance stance can be significant. There are various reasons to do so, one of them being the cost of memory. Another problem the industry is facing is the fact that memory capacity (and performance) has not grown at the same rate as CPU capacity, which has resulted in many environments being memory-bound, differently said the imbalance between CPU and memory has increased substantially. That’s why VMware started Project Capitola.

When Project Capitola was discussed most of the focus was on Intel Optane, and most of us know what happened to that. I guess some assumed that that would also result in Project Capitola, or memory tiering and memory pooling technology, being scrapped. This is most definitely not the case, VMware has gone full steam ahead and has been discussing the progress in public, although you need to know where to look. If you listen to that session it is clear that there are various efforts, that would allow customers to tier memory in various ways, one of them being of course the various CXL based solutions that are coming to market now/soon.

One of which is memory tiering via a CXL accelerator card, basically an FPGA that has the sole purpose of increasing memory capacity, offload memory tiering and accelerating certain functionality where memory is crucial like for instance vMotion. As mentioned in the SNIA session, using an accelerator card can lead to a 30% reduction in migration times. An accelerator card like this will also open up other opportunities, like pooling memory for instance, which is something customers have been asking for since we created the concept of a cluster. Being able to share compute resources across hosts. Just imagine, your VM can use memory capacity available on another host without having to move the VM. Yes, before anyone comments on this, I do realize that this will have a significant performance impact potentially.

That is of course where the VMware logic comes into play. At VMworld in 2021 when Project Capitola was presented, the team also shared the performance results of recent tests, and it showed that the performance degradation was around 10% when 50% of DRAM was used and 50% of Optane memory. I was watching the SNIA session, and this demo shows the true power of VMware vSphere, memory tiering, and acceleration (Project Peaberry as it is called). On average the performance degradation was around 10%, yet roughly 40% of virtual memory was accessed via the Peaberry accelerator. Do note that the tiering is completely transparent to the application, this works for all different types of workloads out there. The crucial part here to understand is that because the hypervisor is already responsible for memory management, it knows which pages are hot and which pages are cold, that also means it can determine which pages it can move to a different tier while maintaining performance.

Anyway, I cannot reveal too much about what may, or may not, be coming in the future. What I can promise though is that I will make sure to write a blog as soon as I am allowed to talk about more details publicly, and I will probably also record a podcast with the product manager(s) when the time is there, so stay tuned!