What happens at which vSphere memory state?

I’ve received a bunch of questions from people around what happens at each vSphere memory state after writing the article around breaking up large pages and introducing a new memory state in vSphere 6.0. Note, that the below is about vSphere 6.0 only, the “Clear” memory state does not exist before  6.0. Also, note that there is an upper and lower boundary when transitioning between states, this means that you will not see actions triggered at the exact specified threshold, but slightly before or after passing that threshold.

I created a simple table that shows what happens when. Note that minFree itself not a fixed number but rather a sliding scale and the value will depend on the host memory configuration.

Memory state
Threshold Actions performed
High 300% of minFree Break Large Pages (wait for next TPS run)
Clear 100% of minFree Break Large Pages and actively call TPS to collapse pages
Soft 64% of minFree TPS + Balloon
Hard 32% of minFree TPS + Compress + Swap
Low 16% of minFree Compress + Swap + Block

First of all, note that when you enter the “Hard” state the balloon driver stops and “Swap” and “Compress” take over. This is something I never really realized, but it is important to know as it means that when memory fills up fast you will see a short period of ballooning and then jump to compressing and swapping immediately. You may ask yourself what is this “block” thing. Well this is the worst situation you can find yourself in and it is the last resort, as my colleague Ishan described it:

The low state is similar to the hard state. In addition, to compressing and swapping memory pages, ESX may block certain VMs from allocating memory in this state. It aggressively reclaims memory from VMs, until ESX moves into the hard state.

I hope this makes it clear which action is triggered at which state, and also why the “Clear” state was introduced and the “High” state changed. It provides more time for the other actions to do what they need to do: free up memory to avoid blocking VMs from allocating new memory pages.

Virtual Volumes and queueing

I was reading an article last week by Ray Lucchesi on Virtual Volumes and queueing. In that article (and podcast) Ray (and friends on the podcast) describe Virtual Volumes and the benefits they bring but also a potential danger. I have written about Virtual Volumes before and if you don’t know what it is or does then I recommend reading those articles. I have been wondering as well, how all of this works, as I also felt that there could easily be a bottleneck. I had some conversations over the last couple of weeks and I figured I would share it with you instead of just leaving a comment on Ray’s blog. Lets look at an architectural diagram first:

In the diagram above (which I borrowed from the vSphere Storage blog, thanks Rolo) you see two important constructs which are part of the overall VVOL architecture namely the Storage Container aka Virtual Datastore and the Protocol Endpoint (PE). The Storage Container is where the VVOLs will be stored. The IO though is proxied through the Protocol Endpoint. You can imagine that if we would not do this and expose every single VVOL directly to vSphere that you would have 1000s of devices connected to vSphere, and as you know vSphere has a 256 device limit at the moment. This would never scale, and as such the Protocol Endpoint is used as an access point to a VVOL capable storage system.

Now think about a VMFS volume and look at the VVOL architectural diagram again. Yes, there is a potential bottleneck indeed. However, what the diagram does not show is that you can have multiple Protocol Endpoints. Ray mentions the following in his post: “I am also not aware of any VASA 2.0 requirement that restricts the number of PEs for a storage system’s support of a single vSphere cluster”. And I can confirm that VMware did not limit the number of Protocol Endpoints in any shape or form. I read the specifications and it literally states 1 PE at a minimum and preferably more. Note that vendor implementations of VVOL may differ, I have seen implementations that describe many PEs per storage system, but also implementations which have 1 PE per storage system. And in the case of 1 PE per storage system can that be a bottleneck?

The queue depth of the Protocol Endpoint isn’t limited to 32 like a regular LUN when multiple VMs are contending for IO (“disk.schednumreqoutstanding”) or 64 (typical device queue depth) but set to 128 by default. This can be increased when required however. Before you do, please consult your storage vendor. There are a couple of variables that need to be taken in to account like the max device queue depth for instance and then there also is the HBA max queue depth as well. (For NFS queue depth is no concern typically.) The potential constraint when there is only (uncommon) a single PE can be mitigated. What is important here is that VVOL itself does not impose any constraints.

I am hoping that clears up some of the misunderstandings out there.

vSphere 6.0: Breaking Large Pages…

When talking about Transparent Page Sharing (TPS) one thing that comes up regularly is the use of Large Pages and how that impacts TPS. As most of you hopefully know TPS does not collapse large page. However, when there is memory pressure you will see that large pages are broken up in to small pages and those small pages can then be collapsed by TPS. ESXi does this to prevent other memory reclaiming techniques, which have way more impact on performance, to kick in. You can imagine that fetching a memory page from a swap file on a spindle will take significantly longer than fetching a page from memory. (Nice white paper on the topic of memory reclamation can be found here…)

Something that I have personally ran in to a couple of times is the situation where memory pressure goes up so fast that the different states at which certain memory reclaiming techniques are used are crossed in a matter of seconds. This usually results in swapping to disk, even though large pages should have been broken up and collapsed where possible by TPS or memory should have been compressed or VMs ballooned. This is something that I’ve discussed with the respective developers and they came up with a solution. In order to understand what was implemented, lets look at how memory states were defined in vSphere 5. There were 4 memory states namely High (100% of minFree), Soft (64% of minFree), Hard (32% of minFree) and Low (16% of minFree). What does that mean % of minFree mean? Well if minFree is roughly 10GB for you configuration then the Soft for instance is reached when there is less then 64% of minFree available which is 6.4GB of memory. For Hard this is 3.2GB and so on. It should be noted that the change in state and the action it triggers does not happen exactly at the percentage mentioned, there is a lower and upper boundary where transition happens and this was done to avoid oscillation.

With vSphere 6.0 a fifth memory state is introduced and this state is called Clear. Clear is 100% of minFree and High has been redefined as 300% of MinFree. When there is less then High (300% of minFree) but more then Clear (100% of minFree) available then ESXi will start pre-emptively breaking up large pages so that TPS (when enabled!) can collapse them at next run. Lets take that 10GB as minFree as an example again, when you have between 30GB (High) and 10GB (Clear) of free memory available large pages will be broken up. This should provide the leeway needed to safely collapse pages (TPS) and avoid the potential performance decrease which the other memory states could introduce. Very useful if you ask me, and I am very happy that this change in behaviour, which I requested a long time ago, has finally made it in to the product.

Those of you who have been paying attention the last months will know that by default inter VM transparent page sharing is disabled. If you do want to reap the benefits of TPS and would like to leverage TPS in times of contention then enabling it in 6.0 is pretty straight forward. Just go to the advanced settings and set “Mem.ShareForceSalting” to 0. Do note that there are security risks potentially when doing this, and I recommend to read the above article to get a better understand of those risks.

HP ConvergedSystem 200–HC EVO:RAIL available now!

Yesterday I was informed by the EVO:RAIL team that the HP ConvergedSystem 200–HC EVO:RAIL is available (shipping) as of this week. I haven’t seen much around additional pieces HP is including, but I was told though that they are planning to integrate HP One View. HP One View is a management/monitoring solution that gives you a great high level overview of the state of your systems but at the same time enables you to dive deep when required. Depending on the version included HP One View can also do things like Firmware Management, which is very useful in a Virtual SAN environment if you ask me. I know though that many people have been waiting for HP to start shipping as it appears to be a preferred vendor for many customers. In terms of configuration, the HP solution is very much similar to what we have already seen out there:

  • 4 nodes in 2U each containing:
    • 2 x Intel® E5-2620 v2 six-core CPUs
    • 192 GB memory
    • 1 x SAS 300 GB 10k rpm drive ESXi boot device
    • 3 x SAS 1.2 TB 10k rpm drive (VSAN capacity tier)
    • 1 x 400 GB MLC enterprise-grade SSD (VSAN performance tier)
    • 1 x H220 host bus adapter (HBA) pass-through controller
    • 2 x 10GbE NIC ports
    • 1 x 1GbE IPMI port for remote (out-of-band) management

As soon as I find out more around integration of other components I will let you folks know.

What is new for Storage DRS in vSphere 6.0?

Storage DRS must be one of the most under-appreciated features that is part of vSphere. For whatever reason it doesn’t get the airtime it deserves, not even from VMware folks which is a shame if you ask me. I was reading the What’s New material for vSphere 6.0 and I noticed that the “What is new for Storage DRS in vSphere 6.0″ was completely missing. I figured I would do a quick write up of what has been improved and introduced for SDRS in 6.0 as some of the enhancements are quite significant! Lets start with a list and then look at these enhancements in more detail:

  • Deep integration with vSphere APIs for Storage Awareness (VASA)
  • Site Recovery Manager integration
  • vSphere Replication integration
  • Integration with Storage Policy Based Management

Lets start with the top one, deep integration with vSphere APIs for Storage Awareness (VASA) as that is the biggest improvement if you ask me. What the integration with VASA results in is fairly straight forward, when the VASA plugin for your storage system is configured then Storage DRS will understand what capabilities are enabled on your storage system and more specific your datastores. For example: when using Storage DRS previously on a deduplicated datastore it could happen that the migration initiated by Storage DRS had a negative result on the total available capacity on your storage system. This would be caused by the fact that the deduplication ratio was lower on the destination then it was on the source. Not a very pleasant surprise you can imagine. Also when for instance VMs are snapshotted from a storage system point of view or datastores are replicated… you can imagine that there would be an impact when moving a VM around in that scenario. With 6.0 Storage DRS is capable of understanding:

  • Array-based thin-provisioning
  • Array-based deduplication
  • Array-based auto-tiering
  • Array-based snapshot
  • Array-based replication

I guess you get the drill, SDRS is now fully capable of understanding the array capabilities and will make balancing decisions taking these capabilities in to consideration. For instance in the case of replication, when replication is enabled and your datastore is part of a consistency group then SDRS will ensure that the VM is only migrated to a datastore which belongs to the same consistency group! For deduplication this is the opposite by the way, in this case SDRS will be informed about which datastores belong to which deduplication domains and when datastores belong to the same domain it will know that moving between those datastores will have little to no effect on capacity. Depending on the level of detail the storage vendor provides through VASA SDRS will even be aware of how efficient the deduplication process is for a given datastore. (Not a VASA requirement, rather a recommendation so results may vary per vendor implementation) Auto-tiering is also an interesting one as this is something that comes up regularly. In this scenario with previous versions of SDRS it could happen that SDRS was moving VMs while the auto-tier array was just promoting or demoting blocks to a lower or higher tier. As you can imagine not a desired scenario and with the VASA integration this can be prevented from happening.

Second big thing is Site Recovery Manager and vSphere Replication integration. I already mentioned the consistency group awareness, of course this is also part of the SRM integration and when VMs are protected by SRM then SDRS will make sure that those VMs are only moved within their consistency group. If for whatever reason there is no way to move within a consistency group then SDRS as a second option can move VMs between datastores which are part of the same SRM Protection Group. Note that this could have an impact though on your workloads! SDRS of course will never automatically move a VM from a replicated to a non-replicated datastore. In fact, there is a strict hierarchy of what type of moves can be recommended:

  1. Moves within the same consistency group
  2. Moves across consistency groups, but within the same protection group
  3. Moves across protection groups
  4. Moves from a replicated datastore to non-replicated

Note that SDRS will try option 1 first, if it fails, will try option 2, if that fails will try option 3, and so on. Under no circumstances is a recommendation in the category of 2, 3 or 4 executed automatically. You will receive a warning after which you can manually apply the recommendation. This is done to ensure the administrator has full control and full awareness of the migration and can apply it during maintenance or during non-peak hours.

With regards to vSphere Replication also a lot has changed. So far there was no support for vSphere Replication enabled VMs to be part of an SDRS datastore cluster but with 6.0 it is fully supported. As of 6.0 Storage DRS will recognize replica VMs (which are replicated using vSphere Replication) and thresholds have been exceeded then SDRS will query vSphere Replication and will be able to migrate replicas to solve the resource constraint.

Up next the integration with Storage Policy Based Management. In the past when you had different tiers of datastores as part of the same Datastore Cluster then SDRS could potentially move a VM which was assigned policy “gold” to a datastore which was associated with a “silver” policy. With vSphere 6.0, SDRS is aware of storage policies in SPBM and will only move or place VMs to a datastore that can satisfy that VM’s storage policy.

Oh and before I forget, there is also the introduction of IOPS reservations on a per virtual disk level. This isn’t really part of Storage DRS but a function of the mClock scheduler and integrated with Storage IO Control and SDRS where needed. It isn’t available in the UI even in this release, only exposed through the VIM API so I doubt many of you will use it… figured though I would mention it already, and I will do a deeper write up later this week probably.