Software Defined

Playing around with Memory Tiering, are my memory pages tiered?

Duncan Epping · Dec 18, 2025 · 2 Comments

There was a question on VMTN about Memory Tiering performance, and how you can check if pages were tiered. I haven’t played around with Memory Tiering too much, so I noted down for myself what I needed to do on every host in order to enable it. Note, if the command contains a path and you want to do this in your own environment you need to change the path and device name accordingly. The question was if memory pages were tiered or not, so I dug up the command that allows you to check this on a per host level. It is at the bottom of this article for those who just want to skip to that part.

Now, before I forget, probably worth mentioning as this is something many people don’t seem to understand, memory tiering only tiers cold memory pages. Active pages are not being moved to NVMe, on top of that, it only tiers memory when there’s memory pressure! So if you don’t see any tiering, it could simply be that you are not under any memory capacity pressure. (Why move pages to a lower tier when there’s no need?)

List all storage devices via the CLI:

esxcli storage core device list

Create memory tiering partition on an NVMe device:

esxcli system tierdevice create -d=/vmfs/devices/disks/eui.1ea506b32a7f4454000c296a4884dc68

Enable Memory Tiering on a host level, note this requires a reboot:

esxcli system settings kernel set -s MemoryTiering -v TRUE

How is Memory Tiering configured in terms of DRAM to NVMe ratio? A 4:1 DRAM to NVMe ratio would be 25%, 1:1 would be 100%. So if you have it set at 4:1, with 512GB of DRAM you would only use 128GB of the NVMe at most, regardless of the size of the device.

esxcli system settings advanced list -o /Mem/TierNvmePct

Is memory tiered or not? Find out all about it via memstats!

memstats -r vmtier-stats -u mb

Want to show a select number of metrics?

memstats -r vmtier-stats -u mb -s name:memSize:active:tier1Target:tier1Consumed:tier1ConsumedPeak:comnsumed

So what would the outcome look like when there is memory tiering happening? I removed a bunch of the metrics, just to keep it readable, “tier1” is the NVMe device, and as you can see each VM has several MBs worth of memory pages on NVMe right now.

 VIRTUAL MACHINE MEMORY TIER STATS: Wed Dec 17 15:29:43 2025
 -----------------------------------------------
   Start Group ID   : 0
   No. of levels    : 12
   Unit             : MB
   Selected columns : name:memSize:tier1Consumed

----------------------------------------
           name    memSize tier1Consumed
----------------------------------------
      vm.533611       4096            12
      vm.533612       4096            34
      vm.533613       4096            24
      vm.533614       4096            11
      vm.533615       4096            25
----------------------------------------
          Total      20480           106
----------------------------------------

What do I do after a vSAN Stretched Cluster Site Takeover?

Duncan Epping · Nov 10, 2025 · 4 Comments

Over the last couple of months, various new vSAN features were announced. Two of those features are around the Stretched Cluster configuration, and have probably been the number 1 feature request for a few years. Now that we have Site Takeover and Site Maintenance functionality available, I am starting to get some questions about the impact of them, and in particular, the Site Takeover functionality is raising some questions.

For those who don’t know what these features are, let me describe them briefly:

Site Maintenance = The ability to place a full vSAN stretched cluster Fault Domain into maintenance mode at once. This ensures that all hosts within the fault domain have consistently stored the data, and all hosts will go into maintenance mode at the same time.

Site Takeover = This provides the ability when a Witness and a Data Site has failed to bring back the remaining site through a command line interface. This will reconstruct the remaining “site local” RAID configuration, making the objects available again, which will then allow vSphere HA to restart the VMs.

Now, the question that the above typically raises is what happens to the Witness and the Data Site that failed when you do the Site Takeover? If you look at the VMs RAID configuration, you will notice that both the Witness and the Data Site components of the sites that failed will completely disappear from the RAID configuration.

But what do you do next, because even after you run the Site Takeover, you still see your hosts and the witness in vCenter Server, and you still see a stretched cluster configuration in the UI. Now at first I thought that if the environment was completely up and running again, you had to go through some manual effort to reconstruct the stretched cluster. Basically, remove the failed hosts, wipe the disks, and recreate the stretched cluster. This is, however, not the case.

In the example above, if the Preferred site and the Witness site return for duty, vSAN will automatically discard the stale components in those previously failed sites. It will recreate new components for all objects, and it will do a full resync of the data.

If you end up in a situation where your hosts are completely gone (let’s say as a result of a fire), then you will have to do some kind of manual cleanup as follows, before you rebuild and add hosts back:

Remove the failed hosts from the vCenter inventory
Remove the witness from the vCenter inventory
- Delete the witness from the vCenter Server it is running, a real delete!
Delete the surviving Fault Domain, this should be the only Fault Domain still listed in the vCenter interface
You now have a normal cluster again
Rebuild hosts and recreate the stretched cluster

I hope that helps,

vSAN Component vote recalculation with Witness Resilience, the follow up!

Duncan Epping · Mar 21, 2025 · Leave a Comment

I wrote about the Witness Resilience feature a few years ago and had a question on this topic today. I did some tests and then realized I already had an article describing how it works, but as I also tested a different scenario I figured I would write a follow up. In this case we are particularly talking about a 2-node configuration, but this would also apply to stretched cluster.

In a stretched cluster, or a 2-node, configuration when a data site goes down (or is placed into maintenance mode) a vote recalculation will automatically be done on each object/component. This is to ensure that if now the witness ends up failing, the objects/VMs will remain accessible. How that works I’ve explained here, and demonstrated for a 2-node cluster here.

But what if the Witness fails first? Well, I can explain it fairly easily, then the VMs will be inaccessible if the Witness goes down. Why is that? Well because the votes will not be recalculated in this scenario. Of course, I tested this and the screenshots below demonstrate it.

This screenshot shows the witness as Absent and both the “data” components have 1 vote. This means that if we fail one of those hosts the component will become inaccessible. Let’s do that next and then check the UI for more details.

As you can see below, the VM is now inaccessible. This is the result of the fact that there’s no longer a quorum, as 2 out of 3 votes are dead.

I hope that explains how this works.

vSphere HA restart times, how long does it actually take?

Duncan Epping · Mar 13, 2025 · Leave a Comment

I had a question today, and it was based on material I wrote years ago for the Clustering Deepdive. (read it here) The material talks about the sequence HA goes through when a failure has occurred. If you look at the sequence for instance where a “secondary” host has failed, it looks as follows:

T0 – Secondary host failure.
T3s – Primary host begins monitoring datastore heartbeats for 15 seconds.
T10s – The secondary host is declared unreachable and the primary will ping the management network of the failed secondary host. This is a continuous ping for 5 seconds.
T15s – If no heartbeat datastores are configured, the secondary host will be declared dead if there is no reply to the ping.
T18s – If heartbeat datastores are configured, the secondary host will be declared dead if there’s no reply to the ping and the heartbeat file has not been updated or the lock was lost.

So, depending on whether you have heartbeat datastores or not, this sequence takes either 15 or 18 seconds. Does that mean the VMs are then instantly restarted, and if so, how long does that take? Well no, they won’t instantly restart, because when this sequence has ended, the secondary host which has failed is actually declared dead. Now the potentially impacted VMs will need to be verified if they have actually failed, a list of “to be restarted” VMs will need to be created, and a placement request will need to be done.

The placement request will either go to DRS, or will be handled by HA itself, depending on whether DRS is enabled and if vCenter Server is available. After placement has been determined, the primary host will then request the individual hosts to restart the VMs which should be restarted. After the host(s) has received the list of VMs it needs to restart it will do this in batches of 32, and of course restart priority / order, will be applied. The whole aforementioned process can easily take 10-15 seconds (if not longer), which means that in a perfect world, the restart of the VM occurs after about 30 seconds. Now, this is when the restart of the VM is initiated, that does not mean that the VM, or the services it is hosting, will be available after 30 seconds. The power-on sequence of the VM can take anywhere from seconds, to minutes, depending of course on the size of the VM and the services that need to be started during the power-on sequence.

So, although it only takes 15 to 18 seconds for vSphere HA to determine and declare a failure, there’s much more to it, hopefully, this post provides a better understanding of all that is involved.

Can I disable the vSAN service if the cluster is running production workloads?

Duncan Epping · Feb 7, 2025 · Leave a Comment

I just had a discussion with someone who had to disable the vSAN service, while the cluster was running a production workload. They had all their VMs running on 3rd party storage, so vSAN was empty, but when they went to the vSAN Configuration UI the “Turn Off” option was grayed out. The reason this option is grayed out is that vSphere HA was enabled. This is usually the case for most customers. (Probably 99.9%.) If you need to turn off vSAN, make sure to temporarily disable vSphere HA first, and of course enable it again after you turned off vSAN! This ensures that HA is reconfigured to use the Management Network instead of the vSAN Network.

Another thing to consider, it could be that you manually configured the “HA Isolation Address” for the vSAN Network, make sure to also change that to an IP address on the Management Network again. Lastly, if there’s still anything stored on vSAN, this will be inaccessible when you disable the vSAN service. Of course, if nothing is running on vSAN, then there will be no impact to the workload.