6.7

CTO2860BU & VIN2183BU: It is all about Persistent Memory

Duncan Epping · Sep 6, 2018 ·

I was going through the list of sessions when I spotted a session Persistent Memory by Rich Brunner and Rajesh V. Quickly after that I noticed that there also was a PMEM session by the perf team available. Both CTO2860BU and VIN2183BU I would highly recommend watching. I would recommend starting with CTO2860BU though, is it gives a great introduction to what PMEM brings to the table. I scribbled down some notes, and they may appear somewhat random, considering I am covering 2 sessions in 1 article, but hopefully the main idea is clear.

I think the sub-title of the sessions make clear what PMEM is about: Storage at Memory Speed. This is what Richard talks about in CTO2860BU during the introduction. I thought this slide explained the difference pretty well, it is all about the access times:

10,000,000 ns – HDD
100,000 ns – SAS SSD
10,000 ns – NVMe
50-300 ns – PMEM
30-100ns – DRAM

So that is 10 million nanoseconds vs 50 to 300 nanoseconds. Just to give you an idea, that is roughly the speed difference between the space shuttle and a starfish. But that isn’t the only major benefit of persistent memory. Another huge advantage is that PMEM devices, depending on how they are used, are byte addressable. Compare this to 512KB, 8KB / 4KB reads many storage systems require. When you have to change a byte, you no longer incur that overhead.

As of vSphere 6.7, we have PMEM support. A PMEM can be accessed as a block device or as a disk, but the other option would be to access it as “PMEM”. Meaning that in the latter case we serve a virtual PMEM device to the VM and the Guest OS sees this as PMEM. What also was briefly discussed in Richard’s talk was the different types of PMEM. In general, there are 4 different types, but most commonly talked about are 2. These two are NVDIMM-N and Intel Optane. With the difference being that NVDIMM-N has DRAM memory backed by NAND, and where persistence is achieved by writing to NAND only during shutdown / power-fail. Whereas with Intel Optane there’s what Intel calls 3D XPoint Memory on the DIMM directly addressable. The other two mentioned were “DRAM backed to NVMe” and NVDIMM-P, where the first was an effort by HPe which has been discontinued and NVDIMM-P seems to be under development and is expected in 2019 roughly.

When discussing the vSphere features that support PMEM what I found most interesting was the fact that DRS is fully aware of VMs using PMEM during load balancing. It will take this in to account, and as the cost is higher for a migration of a PMEM enabled VM it will most likely select a VM backed by shared storage. Of course, when doing maintenance DRS will move the VMs with PMEM to a host which has sufficient capacity. Also, FT is fully supported.

In the second session,VIN2183BU, Praveen and Qasim discussed performance details. After a short introduction, they dive deep into performance and how you can take advantage of the technology. First they discuss the different modes in which persistent memory can be exposed to the VM/Guest OS, I am listing these out as they are useful to know.

vPMEMDisk = exposed to guest as a regular SCSI/NVMe device, VMDKs are stored on PMEM Datastore
vPMEM = Exposes the NVDIMM device in a “passthrough manner, guest can use it as block device or byte addressable direct access device (DAX), this is the fastest mode and most modern OS’s support this
vPMEM-aware = This is similar to the mode above, but the difference is that the application understands how to take advantage of vPMEM

Next they discussed the various performance tests and comparisons they have done. What they have tested is various modes and compare that as well to the performance of NVMe SSD. What stood out most to me is that both the vPMEM and vPMEM-Aware mode provide great performance, up to an 8x performance increase. In the case of vPMEMDisk that is different, and that has to do with the overhead there is. Because it is presented as a block device there’s significant IO amplification which in the case of “4KB random writes” even leads to a throughput that is lower for NVMDIMM than it is for NVMe. During the session it is mentioned that both VMware as well as Intel are looking to optimize their part of the solution to solve this issue. What was most impressive though wasn’t the throughput but the latency, there was a 225x improvement measured between NVMe and vPMEM and vPMEM-Aware. Although vPMEMDisk was higher than vPMEM and vPMEM-aware, it was still significantly lower than NVMe and very consistent across reads and writes.

This was just the FIO example, this is followed by examples for various applications both scale out and scale up solutions. What I found interesting were the Redis tests, nice performance gains at a much lower latency, but more importantly, the cost will probably go down when leveraging persistent memory instead of pure DRAM.

Last but not least tests were conducted around performance during vMotion and the peformance of the vMotion process itself. In both cases using vPMEM or vPMEM-aware can be very beneficial for the application and the vMotion process.

Both great sessions, again highly recommended watching both.

UI Confusion: VM Dependency Restart Condition Timeout

Duncan Epping · Sep 3, 2018 ·

Various people have asked me, and I wrote about this before in several articles but as part of a longer article which makes it difficult to find. When specifying the restart priority or restart dependency you can specify when the next batch of VMs should be powered on. Is that when the VMs are powered on when they are scheduled for being powered on, when VMware Tools reports them as running or when the application heartbeat reports itself?

In most cases, customers appear to go for either “powered on” or “VMware Tools” heartbeat. But what happens when one of the VMs in the batch is not successfully restarted? Well HA waits… For how long? Well that depends:

In the UI you can specify how long HA needs to wait by using the option called “VM Dependency Restart Condition Timeout”. This is the time-out in seconds used when one (or multiple VMs) can’t be restarted. So we initiate the restart of the group, and we will start the next batch when the first is successfully restart or when the time-out has been exceeded. By default, the time-out is 600 seconds, and you can override this in the UI.

What is confusing about this setting is the name, it states “VM Dependency Restart Condition Timeout”. So does this time-out apply to “Restarts Priority” or does it apply to “Restart Dependency” or maybe both? The answer is simple, this only applies to “Restart Priority”. Restart Dependency is a rule, a hard rule, a must rule, which means there’s no time-out. We wait until all VMs are restarted when you use restart dependency. Yes, the UI is confusing as the option mentions “dependency” where it should really talk about “priority”. I have reported this to engineering and PM, and hopefully it will be fixed in one of the upcoming releases.

VMworld – VMware vSAN Announcements: vSAN 6.7 U1 and beta announced!

Duncan Epping · Aug 27, 2018 ·

VMworld is the time for announcements, and of course for vSAN that is no different. This year we have 3 major announcements and they are the following:

VMware vSAN 6.7 U1
VMware vSAN Beta
VMware Cloud on AWS new features

So let’s look at each of these, first of all, VMware vSAN 6.7 U1. We are adding a bunch of new features, which I am sure you will appreciate. The first one is various VUM Updates, of which I feel the inclusion of Firmware Updates through VUM is the most significant one. For now, this is for the Dell HBA330 only, but soon other controllers will follow. On top of that there now also is support for custom ISO’s. VUM will recognize the vendor type and validate compliance and update accordingly when/if needed.

The other big thing we are adding os the “Cluster Quickstart wizard“. I have shown this at various sessions already, so some of you may be familiar with it. It basically is a single wizard that allows you to select the required services, add the hosts and configure the cluster. This includes the configuration of HA, DRS, vSAN and the network components needed to leverage these services. I recorded a quick demo that actually shows you what this looks like

One of the major features in my opinion that is introduced is UNMAP. Yes, unmap for vSAN. So as of 6.7 U1 we are now capable of unmapping blocks when the Guest OS sends an unmap/trim command. This is great as it will greatly enhance/improve space efficiency. Especially in environments where for instance large files or many files are deleted. You need to enable it, for now, through “rvc”. And you can do this as follows:

/localhost/VSAN-DC/computers/6.7 u1> vsan.unmap_support -e .

When you run the above command you should see the below response.

Unmap support is already disabled 6.7 u1: success VMs need to be power cycled to apply the unmap setting /localhost/VSAN-DC/computers/6.7 u1>

Pretty simple right? Does it really require the VM to be power cycled? Yes, it does, as during the power-on the Guest OS actually queries for the unmap capability, there’s no way for VMware to force that query without power cycling the VM unfortunately. So power it off, and power it on if you want to take advantage of unmap immediately.

There are a couple smaller enhancements that I wanted to sum up for those who have been waiting for it:

UI Option to change the “Object Repair Timer” value cluster-wide. This is the option which determines when vSAN starts repairing an object which has an absent component.
Mixed MTU support for vSAN Stretched Clusters (different MTU for Witness traffic then vSAN traffic)
Historical capacity reporting
VROps dashboards with vSAN stretched cluster awareness
Additional PowerCLI cmdlets
Enhanced support experience (Network diagnostic mode, specialized dashboards), you can find the below graphs under Monitor/vSAN/Support
Additional health checks (storage controllers firmware, unicast network performance test etc)

And last but not least, with vSAN Stretched we have the capability to protect data within a site. As of vSAN 6.7 U1 we also now have the ability to protect data within racks, it is however only available through an RPQ request. So if you need protection within a rack, contact GSS and file an RPQ.

Another announcement was around a vSAN Beta which is coming up. This vSAN Beta will have some great features, three though have been revealed:

Data Protection (Snapshot based)
File Services
Persistent Storage for Containers

I am not going to reveal anything about this, simply to avoid violating the NDA around this. Sign up for the Beta so you can find out more.

And then the last set of announcements was around functionality introduced for vSAN in VMware Cloud on AWS. Here there were two major announcements if you ask me. The first one is the ability to use Elastic Block Storage (EBS volumes) for vSAN. Meaning that in VMware Cloud on AWS you are no longer limited to the storage capacity physically available in the server, no you can now extend your cluster with capacity delivered through EBS. The second one is the availability of vSAN Encryption in VMware Cloud on AWS. This, from a security perspective, will be welcomed by many customers.

That was it, well… almost. This whole week many sessions will reveal various new potential features and futures. I aim to report on those when sitting in on those presentations, or potentially after VMworld.

What happens if all hosts in a vSphere HA cluster are isolated?

Duncan Epping · Aug 15, 2018 ·

I received this question through twitter today from Markus who was going through the vSphere 6.7 Clustering Deep Dive. And it is fairly straightforward: what happens when all hosts are isolated in a cluster, will the isolation response be triggered?

https://twitter.com/RealRockaut/status/1029652167735631874

I wrote about this a long long time ago, but it doesn’t hurt to re-iterate this. Before triggering the isolation response HA will actually verify the state of the rest of the cluster. Does anyone own the datastore on which the VMs that are impacted by this isolation run? If the answer is no, the ownership of a datastore is dropped during the election, then HA will not trigger the isolation response. I will try to update the book when I have time to include that, hopefully, that means a new version of the ebook will be pushed out to all owners automatically.

Must read white paper: Persistent Memory performance with vSphere 6.7

Duncan Epping · Aug 14, 2018 ·

Today I noticed this whitepaper titled: Persistent Memory Performance on vSphere 6.7. An intriguing topic for sure as it is something “relatively new and something I haven’t encountered too much in the field. Yes, I talk about Persistent Memory, aka NVDIMMs, in my talks usually but then it typically relates to vSAN. I have not seen too many publications from VMware on this topic, so I figured I would share this publication with you:

Persistent Memory Performance in vSphere 6.7 – https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/pmem-vsphere67-perf.pdf
Persistent memory (PMEM) is a new technology that has the characteristics of memory but retains data through power cycles. PMEM bridges the gap between DRAM and flash storage. PMEM offers several advantages over current technologies like:
- DRAM-like latency and bandwidth
- CPU can use regular load/store byte-addressable instructions
- Persistence of data across reboots and crashes

The paper starts with a brief intro and then explains the different modes in which PMEM can be used, either as a “disk” (vPMEMDisk) or surfaced up to the guest OS as an NVDIMM (vPMEM). With the latter option, there’s also the ability to have some form of application awareness, which is referred to as the 3rd mode (vPMEM-aware).

I am not going to copy and paste the findings, as the paper has a lot of interesting data and you should go through it. One thing I found most interesting is the huge decrease in latency. Anyway, read the paper and get familiar with persistent memory / NVDIMMs, as this technology will start changing the way we design HCI platforms in the future and cater for low latency / high throughput applications in traditional environments.