**disclaimer: this article is an out-take of our book: vSphere 5 Clustering Technical Deepdive**
There are some fundamental changes when it comes to vMotion scalability and performance in vSphere 5.0. Most of these changes have one common goal: being able to vMotion ANY type of workload. It doesn’t matter if you have a virtual machine with 32GB of memory that is rapidly changing memory pages any more with the the following enhancements:
- Multi-NIC vMotion support
- Stun During Page Send (SDPS)
Multi-NIC vMotion Support
One of the most substantial and visible changes is multi-NIC vMotion capabilities. vMotion is now capable of using multiple NICs concurrently to decrease the amount of time a vMotion takes. That means that even a single vMotion can leverage all of the configured vMotion NICs. Prior vSphere 5.0, only a single NIC was used for a vMotion enabled VMkernel. Enabling multiple NICs for your vMotion enabled VMkernel’s will remove some of the constraints from a bandwidth/throughput perspective that are associated with large and memory active virtual machines. The following list shows the currently supported maximum number of NICs for multi-NIC vMotion:
- 1GbE – 16 NICs supported
- 10GbE – 4 NICs supported
It is important to realize that in the case of 10GbE interfaces, it is only possible to use the full bandwidth when the server is equipped with the latest PCI Express busses. Ensure that your server hardware is capable of taking full advantage of these capabilities when this is a requirement.
Stun During Page Send
A couple of months back I described this cool vSphere 4.1 vMotion enhancement called Quick Resume and now it is replaced with Stun During Page Send, or also often referred to as “Slowdown During Page Send” is a feature that “slowsd own” the vCPU of the virtual machine that is being vMotioned. Simply said, vMotion will track the rate at which the guest pages are changed, or as the engineers prefer to call it, “dirtied”. The rate at which this occurs is compared to the vMotion transmission rate. If the rate at which the pages are dirtied exceeds the transmission rate, the source vCPUs will be placed in a sleep state to decrease the rate at which pages are dirtied and to allow the vMotion process to complete. It is good to know that the vCPUs will only be put to sleep for a few milliseconds at a time at most. SDPS injects frequent, tiny sleeps, disrupting the virtual machine’s workload just enough to guarantee vMotion can keep up with the memory page change rate to allow for a successful and non-disruptive completion of the process. You could say that, thanks to SDPS, you can vMotion any type of workload regardless of how aggressive it is.
It is important to realize that SDPS only slows down a virtual machine in the cases where the memory page change rate would have previously caused a vMotion to fail.
This technology is also what enables the increase in accepted latency for long distance vMotion. Pre-vSphere 5.0, the maximum supported latency for vMotion was 5ms. As you can imagine, this restricted many customers from enabling cross-site clusters. As of vSphere 5.0, the maximum supported latency has been doubled to 10ms for environments using Enterprise Plus. This should allow more customers to enable DRS between sites when all the required infrastructure components are available like, for instance, shared storage.
Kelly O says
Curious as to how the multi vmotion support will change design. Will the standard practice be less physical segregation and more vlans/same vswitch?
Rickard Nobel says
“As of vSphere 5.0, the maximum supported latency has been doubled to 10ms for environments using Enterprise Plus.”
Does this mean that the “Stun During Page Send” feature you described only works in Enterprise Plus?
Duncan Epping says
No this refers to Metro vMotion.
Bilal Hashmi says
Just out of curiosity, if one is lucky enough to rdp to a server while its being SDPS and the vCPU is being put to sleep, they may not be able to connect? I would assume the ones that are already connected will probably be fine as RDP will try to reconnect them, besides the vCPU is only put ot sleep for a few milliseconds.. this is really good information.
So unlike Quick Resume, no buffer file will be created here at all?
Duncan Epping says
SDPS slows down the vCPU in terms of scheduling, that doesn’t mean the VM is sleeping… Connecting will still work as expected, processes might just take longer to complete
DK says
“Processing may take longer to complete” … Could this cause the OS to page-out more if the workload is “queue’n” up? Could it be a tradeoff of workload from vCPU to disk I/O to increase the vMotion capabilities? Don’t get me wrong, i think it’s a great improvement but just want to understand what will be the potential repercussions.
cwjking says
That makes a lot of sense. However, as you recall in vSphere 4.1 10GB FcoE vmotions would hog the bandwidth with 1 vMotion. (1 vMotion would use 8GB of bandwidth unless you used some sort of traffic control mechanism)
Was this changed in vSphere 5? Also, could you explain some how that works? We currently use 2×10 GB FcoE’s. (Cisco UCS).
Warren says
Cwjking; have you used Network IO Control?
Jon says
Thanks Ducan good info. Doing some last minute prep before I take the VCP tomorrow.
Stunner Hawk says
Thanks & Was a good one.
In 4.X, During vMotion if the page copying rate is lower than the dirty page creation rate, vMotion decides whether to Fail the vMotion or Proceed with switchover based on the setting vmotion.maxSwitchoverSeconds.
(http://blogs.vmware.com/vsphere/2011/02/vmotion-whats-going-on-under-the-covers.html)
So In 5.x, with SDPS in place, there is no final situation like copying page rate is lower that the Dirty page creation rate and vMotion is not going to consider vmotion.maxSwitchoverSeconds????