I was troubleshooting an issue where vMotion would time-out constantly, I had no clue where it was coming from so I started digging. In this case the environment was using a regular vSwitch and 10GbE networking. When I took a closer look I noticed that some form of traffic shaping was applied, as unfortunately the Distributed vSwitch was not an option for this environment. Now traffic shaping was enabled and the peak value was specified and the rest was left to the default value… and unfortunately this is exactly what cause the problem.
So when it comes to vSwitch Traffic Shaping, what is what? There are 3 settings you can set per portgroup:
- Average Bandwidth – specified in Kbps
- Peak Bandwidth – specified in Kbps
- Burst Size – specified in KB
So if you have a 10Gbps NIC port for your traffic this means you have a total of 10,485,760 Kbps. When you enable vSwitch Traffic Shaping by default it is set to have “Average Bandwidth” to 100,000 Kbps , Peak Bandwidth to 100,000 Kbps and Burst Size to 1024,00 KB. So what does that mean? Well it means that if you enable it and do not change the values that the traffic is limited to 100,000 Kbps. 100,000 Kbps is… yes roughly 100Mbps, even less to be more precise: 97.6Mbps. Which is not a lot indeed, and not even a supported configuration for vMotion.
So what if I simply bump up the Peak Bandwidth to lets say 5Gbps, as I do not want vMotion to ever consume more than half of the NIC port (note, vSwitch traffic shaping is only for egress aka outbound traffic). Well setting the peak bandwidth sounds like it may do something, but probably not what you would hope for as this is how the settings are applied:
By default the traffic stream will get what is specified by “Average Bandwidth”. However, it is possible to exceed this when needed by specifying a higher “Peak Bandwidth” value. Your traffic will be allowed to burst until the value of “Burst Size” has been exceeded. In other words, in the above example when only Peak Bandwidth is increased this would lead to the following: By default the traffic is limited to 100Mbps, however it can peak to 5Gbps but only for 100MB worth of data traffic. As you can imagine in the case of vMotion when the full memory content of a VM is transferred that 100MB is hit within a second, after which the vMotion process is throttled back to 100Mbps and the remainder of the VM memory takes ages to copy and eventually times out.
So if you apply traffic shaping using your vSwitch, make sure to think through the numbers. In the above scenario for instance, specifying a 5Gbps Average and Peak would be what was desired.