At VMworld, various cool new technologies were previewed. In this series of articles, I will write about some of those previewed technologies. Unfortunately, I can’t cover them all as there are simply too many. This article is about enhancements that will be introduced in the future to vMotion, the was session HBI1421BU. For those who want to see the session, you can find it here. This session was presented by Arunachalam Ramanathan and Sreekanth Setty. Please note that this is a summary of a session which is discussing a Technical Preview, this feature/product may never be released, and this preview does not represent a commitment of any kind, and this feature (or it’s functionality) is subject to change. Now let’s dive into it, what can you expect for vMotion in the future.
The session starts with a brief history of vMotion and how we are capable today to vMotion VMs with 128 vCPUs and 6 TB of memory. The expectation is though that vSphere in the future will support 768 vCPUs and 24 TB of memory. Crazy configuration if you ask me, that is a proper Monster VM.
You can imagine that this also introduces a lot of challenges for vMotion, as a VM like this will need to move without any kind of interruption, especially as typically apps like SAP HANA or Oracle run inside of these VMs. Of course, customers expect the behavior of this VM to be the same as a 1 vCPU / 1GB VM type of VM. But you can imagine this is very difficult as things like iterative copies can take a lot longer, same of course for memory pre-copy and the switch over. It is then explained how things like page tracing works today, but more importantly what the impact is on VMs when using this technique. For a VM with 72 vCPUs and 512GB of memory, enabling page tracing, for a total of 53 seconds the VM is not running, while the total vMotion time is 102 seconds, meaning that more than half of the time the VM is not actively running. And although the user does not typically notice it as the stops are very granular, there definitely is an impact for larger VMs. (The bigger the VM, the bigger the impact.) How can we reduce the cost?
Well what if we can enable traces without having to stop the vCPUs, this is when “loose page traces” are introduced. The solution is rather simple, only a single vCPU is stopped and that single vCPU takes care of enabling the trace (install as they call it) and then the flush will happen on all vCPUs individually. Since now only a single vCPU has to be stopped, there’s a huge performance benefit. Another optimization that is introduced as Large Trace Installs, this refers to the size of the memory page being traced. Instead of tracing at a 4KB level, 1GB pages are now traced. This reduces the number of pages that need to be set to read-only and again improves the performance of the guest and the vMotion process. The 53 seconds in the previous example is now reduced to 3 seconds, and also the vMotion time is now reduced from 102 seconds down to 67 seconds. This is HUGE. On top of that, the performance hit is as a result decreased from 46% to only 8%. I can’t wait for this to ship!
The second major change is around what we call Trace Fires, what happens when a guest tries to write to a memory page which is traced? What is the cost currently and how can this be optimized? Today when a page fault occurs for dirty page tracking, again, the vCPU temporarily stops executing guest instructions. It takes the actions required to inform vMotion that the page is dirtied, which means vMotion now knows it needs to resend the page to the destination. All of this costs a few thousands of CPU cycles. Especially the fact that the vCPU temporarily is able to execute guest instructions hurts performance. Especially with a larger VM, this is painful. This whole process, primarily the cost of locking, has been optimized. This also resulted in a huge performance benefit, for a 72 vCPU VM with 512GB it results in a 35% improvement of vMotion time, 90% reduction of vCPU tracing time and a guess performance improvement of 70% compared to the “old” mechanism. Again, huge improvements.
In the second half of this section all the different performance tests are shared, what is clear is that the improvements mainly apply to Monster VMs, except for the page trace (install) changes, they make a big difference for VMs of all sizes. As shown in the screenshot below, not only does vMotion benefit from it, but also the guest benefits from it.
Next what was being discussed was the Switch-Over process and what has been optimized in this space to improve vMotion. Typically the switch-over should happen within a second. The challenge with the switch-over typically is the Transfer Memory Changed Bitmap and the Transfer Swap Bitmap that needs to be sent. These Bitmaps are used to track the changes during the switch-over, which could be a significant number of pages. The larger (and more active) the VM, the larger the bitmap, a 1GB VM would have a 32KB bitmap where a 24TB VM would have a 768MB bitmap. Huge difference, and could be a problem as you can imagine. The optimization was simple, as the majority of the bitmap was sparse, they simply compacted the bitmap and transmit that. For a 24TB VM, the change was 2 seconds vs 175 milliseconds, which is huge. You can imagine that when you go to a VM with 100s of TBs of memory, this would even make a bigger difference.
Then last but not least Fast Suspend and Resume was discussed, this is one of the key features being used by Storage vMotion for instance (hot-add virtual devices also uses it). What is FSR? Well, in short, it is the mechanism which allows the SvMotion to occur on the same host. Basically it transfers the memory metadata to the new (shadow) VM so that the SvMotion can be completed, as we end up doing a compute migration on the same host. You would expect this process to not impact the workload or SvMotion process too much as it is happening on the same host, unfortunately, it does as the VMs vCPU 0 is used to transfer the metadata, and of course, depending on the size of the VM, the impact can be significant, especially as there is no parallelization. This is what will be changed in the future. All the vCPUs will help with copying the metadata, greatly reducing the switch-over time during an SvMotion, for a 1TB VM with 48 vCPUs, the switch-over time went from 7.7 seconds to 0.5 seconds. Of course, various performance experiments were discussed next and demonstrated.
I was surprised to see how much was shared during this session, it goes deep fast. Very good session, which I would highly recommend watching, you can find it here.
Davoud Teimouri says
Monster VMs are like a nightmare for us, we have to schedule maintenance plan for them like a physical server. Thank you for sharing, this new mechanism help to have bigger VM and reduce number of physical servers which used as physical database servers.