intel

Impact of adding Persistent Memory / Optane Memory devices to your VM

Duncan Epping · May 22, 2019 ·

I had some questions around this in the past month, so I figured I would share some details around this. As persistent memory (Intel Optane Memory devices for instance) is getting more affordable and readily available more and more customers are looking to use it. Some are already using it for very specific use cases, usually in situations where the OS and the App actually understand the type of device being presented. What does that mean? At VMworld 2018 there was a great session on this topic and I captured the session in a post. Let me copy/paste the important bit for you, which discusses the different modes in which a Persistent Memory device can be presented to a VM.

vPMEMDisk = exposed to guest as a regular SCSI/NVMe device, VMDKs are stored on PMEM Datastore
vPMEM = Exposes the NVDIMM device in a “passthrough manner, guest can use it as block device or byte addressable direct access device (DAX), this is the fastest mode and most modern OS’s support this
vPMEM-aware = This is similar to the mode above, but the difference is that the application understands how to take advantage of vPMEM

But what is the problem with this? What is the impact? Well when you expose a Persistent Memory device to the VM, it is not currently protected by vSphere HA, even though HA may be enabled on your cluster. Say what? Yes indeed, the VM which has the PMEM device presented to it will be disabled for vSphere HA! I had to dig deep to find this documented anywhere, and it is documented in this paper. (Page 47, at the bottom.) So what works and what not? Well if I understand it correctly:

vSphere HA >> Not supported on vPMEM enabled VMs, regardless of the mode
vSphere DRS >> Does not consider vPMEM enabled VMs, regardless of the mode
Migration of VM with vPMEM / vPMEM-aware >> Only possible when migrating to host which has PMEM
Migration of VM with vPMEMDISK >> Possible to a host without PMEM

Also note, as a result (data is not replicated/mirrored) a failure could potentially lead to loss of data. Although Persistent Memory is a great mechanism to increase performance, it is something that should be taken into consideration when you are thinking about introducing it into your environment.

Oh, if you are wondering why people are taking these risks in terms of availability, Niels Hagoort just posted a blog with a pointer to a new PMEM Perf paper which is worth reading.

Re: Re: The Rack Endgame: A New Storage Architecture For the Data Center

Duncan Epping · Sep 5, 2014 ·

I was reading Frank Denneman’s article with regards to new datacenter architectures. This in its turn was a response to Stephen Fosket’s article about how the physical architecture of datacenter hardware should change. I recommend reading both articles as that will give a bit more background, plus they are excellent reads by itself. (gotta love these blogging debates) Lets start with an out take of both articles which summarizes blog posts for those who don’t want to read the full article.

Stephen:
Top-of-rack flash and bottom-of-rack disk makes a ton of sense in a world of virtualized, distributed storage. It fits with enterprise paradigms yet delivers real architectural change that could “move the needle” in a way that no centralized shared storage system ever will. SAN and NAS aren’t going away immediately, but this new storage architecture will be an attractive next-generation direction!

If you look at what Stephen describes I think it is more or less in line with what Intel is working towards. The Intel Rack Scale Architecture aims to disaggregate traditional server components and then aggregate by type of resource backed by a super performing and optimized rack fabric. Rack fabric enabled by the new photonic architecture Intel is currently working on. This is not long term future, this is what Intel showcased last year and said to be available in 2015 / 2016.

Frank:
The hypervisor is rich with information, including a collection of tightly knit resource schedulers. It is the perfect place to introduce policy-based management engines. The hypervisor becomes a single control plane that manages both the resource as well as the demand. A single construct to automate instructions in a single language providing a correct Quality of Service model at application granularity levels. You can control resource demand and distribution from one single pane of management. No need to wait on the completion of the development cycles from each vendor.

There’s a bit in Frank’s article as well where he talks about Virtual Volumes and VAAI and how long it took for all storage vendors to adopt VAAI and how he believes that the same may apply to Virtual Volumes and Frank aims more towards the hypervisor being the aggregator instead of doing it through changes in the physical space.

So what about Frank’s arguments? Well Frank has a point with regards to VAAI adoption and the fact that some vendors took a long time to implement these. However, reality is though that Virtual Volumes is going full steam ahead. With many storage vendors demoing it at VMworld in San Francisco last week I have the distinct feeling that things will be different this time. Maybe timing is part of it, as it seems that many customers or on a crosspoint and want to optimize their datacenter operations / architecture by adopting SDDC, of which policy based storage management happens to be a big chunk.

I agree with Frank that the hypervisor is positioned perfect to be that control plane. However, in order to be that control plane for the future there needs to be a way to connect “things” to it which allows for far better scale and more flexibility. VMware, if you ask me, has done that for many parts of the datacenter but one aspect that stills needs to be overhauled for sure is storage. VAAI was a great start, but with VMFS there simply are too many constraints and it doesn’t cater for granular controls.

I feel that the datacenter will need to change on both ends in order to take that next step in the evolution to the SDDC. Intel Rack Scale architecture will allow for far greater scale and efficiency then seen ever before. But it will only be successful when the layer that sits on top has the ability to take all of these disaggregated resources, turn them in to large shared pools and allows to assign resources in a policy driven (and programmable) manner. Not just assign resources but also allow you to specify what the level of availability (HA, DR but also QoS) should be for whatever consumes those resources. Granularity is important here and of course it shouldn’t stop with availability but applies to any other (data) service that one may require.

So where does what fit in? If you look at some of the initiatives that were revealed at VMworld like Virtual Volumes, Virtual SAN and vSphere APIs for IO Filters you can see where the world is moving towards fast. You can see how vSphere is truly becoming that control plane for all resources and how it will be able to provide you end-to-end policy driven management. In order to make all of this reality the current platform will need to change. Changes that allow for more granularity /flexibility and higher scalability and that is where all these (new) initiatives come in to play. Some partners may take longer to adopt than others, especially those that require fundamental changes to the architecture of underlaying platforms (storage systems for instance), but just like with VAAI I am certain that over time this will happen as customers will drive this change by making decisions based on availability of functionality.

Exciting times ahead if you ask me.

Scale UP!

Duncan Epping · Mar 17, 2010 ·

Lately I am having a lot of discussions with customers around sizing of their hosts. Especially Cisco UCS(with the 384GB option) and the upcoming Intel Xeon 5600 series with six cores per CPU takes the “Scale Up” discussion to a new level.

I guess we had this discussion in the past as well when 32GB became a commodity. The question I always have is how many eggs do you want to have in one basket. Basically do you want to scale up(larger hosts) or scale out(more hosts).

I guess it’s a common discussion and a lot of people don’t see the impact sizing your hosts correctly. Think about this environment, 250 VMs in total with the need of roughly 480GB of memory:

10 Hosts, each having 48GB and 8 Cores, 25 VMs each.
5 Hosts, each having 96GB and 16 Cores, 50 VMs each.

If you look at it from an uptime perspective; Would a failure occur in scenario 1 you will lose 10% of your environment. If you look at scenario 2 this is 20%. Clearly the associated cost with the down time for 20% of your estate is higher than for 10% of your estate.

Now it’s not only the associated cost with the impact of a host failure it is also for instance the ability of DRS to load balance the environment. The less hosts you will have the smaller the chances are DRS will be able to balance the load. Keep in mind DRS uses a deviation to calculate the imbalance and simulates a move to see if it results in a balanced cluster.

Another thing to keep in mind is HA. When you design for N+1 redundancy and need to buy an extra host the costs associated for redundancy is high for a scale up scenario. Not only the costs associated are high, the load when the fail-over needs to occur will also increase immense. If you only have 4 hosts and 1 host fails the added load on the 3 hosts will have a higher impact than it would have on for instance 9 hosts in a scale out scenario.

Licensing is another often used argument for buying larger hosts but for VMware it usually will not make a difference. I’m not the “capacity management” or “capacity planning” guru to be honest but I can recommend VMware Capacity Planner as it can help you to easily create several scenarios. (Or Platespin Recon for that matter.) If you have never tried it and are a VMware partner check it out and run the scenarios based on scale up and scale out principles and do the math.

Now, don’t get me wrong I am not saying you should not buy hosts with 96GB but think before you make this decision. Decide what an acceptable risk is and discuss the impact of the risk with your customer(s). As you can imagine for any company there’s a cost associated with down time. Down time for 20% of your estate will have a different financial impact than down time for 10% of your estate and this needs to be weighted against all the pros and cons of scale out vs scale up.

Nehalem and memory config

Duncan Epping · Jan 1, 2010 ·

Just a short article for today, or should I call it a tip. Take your memory configuration into account for Nehalem processors. There’s a sweet spot in terms of performance which might just make a difference. Read this article on Scott’s blog or this article on Anandtech where they did measure the difference in performance. Again it is not a huge difference, but when combining workloads it might just be that little extra you were looking for.