4.1

Reminder: Free Kindle copy of vSphere 5 and 4.1 Clustering Deepdive

Duncan Epping · Jun 5, 2013 ·

Just a reminder, if you want your free Kindle copy of the vSphere 5.0 Clustering Deepdive or the vSphere 4.1 HA and DRS Deepdive, make sure to check Amazon (US Kindle Store) today and tomorrow (Wednesday June the 5th and Thursday June the 6th)! You can download the Kindle copy of both these books for free, yes that is correct ZERO dollars.

So make sure you pick it up either before the promo expires…

** Note I have linked the US Kindle stores, it is also available for free in local Kindle stores, just do search! **

Free Kindle copy of vSphere 5.0 Clustering Deepdive?

Duncan Epping · May 28, 2013 ·

Do you want a free Kindle copy of the vSphere 5.0 Clustering Deepdive or the vSphere 4.1 HA and DRS Deepdive? Well make sure to check Amazon next week! I just put both of the books up for a promotional offer… For 48 hours, Wednesday June the 5th and Thursday June the 6th, you can download the Kindle (US Kindle Store) copy of both these books for free, yes that is correct ZERO dollars.

So make sure you pick it up either Wednesday June the 5th or Thursday June the 6th, it might be the only time this year it is on promo.

What is static overhead memory?

Duncan Epping · May 6, 2013 ·

We had a discussion internally on static overhead memory. Coincidentally I spoke with Aashish Parikh from the DRS team on this topic a couple of weeks ago when I was in Palo Alto. Aashish is working on improving the overhead memory estimation calculation so that both HA and DRS can be even more efficient when it comes to placing virtual machines. The question was around what determines the static memory and this is the answer that Aashish provided. I found it very useful hence the reason I asked Aashish if it was okay to share it with the world. I added some bits and pieces where I felt additional details were needed though.

First of all, what is static overhead and what is dynamic overhead:

When a VM is powered-off, the amount of overhead memory required to power it on is called static overhead memory.
Once a VM is powered-on, the amount of overhead memory required to keep it running is called dynamic or runtime overhead memory.

Static overhead memory of a VM depends upon various factors:

Several virtual machine configuration parameters like the number vCPUs, amount of vRAM, number of devices, etc
The enabling/disabling of various VMware features (FT, CBRC; etc)
ESXi Build Number

Note that static overhead memory estimation is calculated fairly conservative and we take a worst-case-scenario in to account. This is the reason why engineering is exploring ways of improving it. One of the areas that can be improved is for instance including host configuration parameters. These parameters are things like CPU model, family & stepping, various CPUID bits, etc. This means that as a result, two similar VMs residing on different hosts would have different overhead values.

But what about Dynamic? Dynamic overhead seems to be more accurate today right? Well there is a good reason for it, with dynamic overhead it is “known” where the host is running and the cost of running the VM on that host can easily be calculated. It is not a matter of estimating it any longer, but a matter of doing the math. That is the big difference: Dynamic = VM is running and we know where versus Static = VM is powered off and we don’t know where it might be powered!

Same applies for instance to vMotion scenarios. Although the platform knows what the target destination will be; it still doesn’t know how the target will treat that virtual machine. As such the vMotion process aims to be conservative and uses static overhead memory instead of dynamic. One of the things or instance that changes the amount of overhead memory needed is the “monitor mode” used (BT, HV or HWMMU).

So what is being explored to improve it? First of all including the additional host side parameters as mentioned above. But secondly, but equally important, based on the vm -> “target host” combination the overhead memory should be calculated. Or as engineering calls it calculating “Static overhead of VM v on Host h”.

Now why is this important? When is static overhead memory used? Static overhead memory is used by both HA and DRS. HA for instance uses it with Admission Control when doing the calculations around how many VMs can be powered on before unreserved resources are depleted. When you power-on a virtual machine the host side “admission control” will validate if it has sufficient unreserved resource available for the “static memory overhead” to be guaranteed… But also DRS and vMotion use the static memory overhead metric, for instance to ensure a virtual machine can be placed on a target host during a vMotion process as the static memory overhead needs to be guaranteed.

As you can see, a fairly lengthy chunk of info on just a single simple metric in vCenter / ESXTOP… but very nice to know!

Increase Storage IO Control logging level

Duncan Epping · May 2, 2013 ·

I received this question today around how to increase the Storage IO Control logging level. I knew either Frank or I wrote about this in the past but I couldn’t find it… which made sense as it was actually documented in our book. I figured I would dump the blurp in to an article so that everyone who needs it for whatever reason can use it.

Sometimes it is necessary to troubleshoot your environment and having logs to review is helpful in determining what is actually happening. By default, SIOC logging is disabled, but it should be enabled before collecting logs. To enable logging:

Click Host Advanced Settings.
In the Misc section, select the Misc.SIOControlLogLevel parameter. Set the value to 7 for complete logging. (Min value: 0 (no logging), Max value: 7)
SIOC needs to be restarted to change the log level, to stop and start SIOC manually, use: /etc/init.d/storageRM {start|stop|status|restart}
After changing the log level, you see the log level changes logged in /var/log/vmkernel

Please note that SIOC log files are saved in /var/log/vmkernel.

Death to false myths: Admission Control lowers consolidation ratio

Duncan Epping · Dec 11, 2012 ·

Death to false myths probably sounds a bit euuhm well Dutch probably, or “direct” as others would label it. Lately I have seen some statements floating around which are either false or misused. One of them is around Admission Control and how it impacts consolidation ratio even if you are not using reservations. I have had multiple questions around this in the last couple of weeks and noticed this thread on VMTN.

The thread referred to is all about which Admission Control policy to use, as the selected policy potentially impacts the amount of virtual machines you can run on a cluster. Now lets take a look at the example in this VMTN thread, and I have rounded up some of the numbers to simplify things:

7 host cluster
512 GB of memory
132 GHz of CPU resources
217 MB of Memory Overhead (no reservations used)

So if you do the quick math. According to Admission Control (host failures example) you can power-on about ~2500 virtual machines. That is without taking N-1 resiliency in to account. When I take out the largest host we are still talking about ~1800 virtual machines that can be powered on. Yes that is 700 slots/virtual machines less due to the N-1, admission control needs to be able to guarantee that even if the largest host fails all virtual machines can be restarted.

Considering we have 512GB in total that means that if those 1800 virtual machines on average actively use 280MB we will see TPS / swapping / ballooning / compression. (512GB / 1800 VMs) Clearly you want to avoid most of these, swapping / ballooning / compression that is. Especially considering most VMs are typically provisioned with 2GB of memory or more.

So what does that mean or did we learn? Two things:

Admission Control is about guaranteeing virtual machine restarts
If you set no reservation you can power-on an insane amount of virtual machines

Let me reemphasize the last bullet, you can power-on an INSANE amount of virtual machines on just a couple of hosts when no reservations are used. In this case HA would allow for 1800 virtual machines to be powered-on before it starts screaming it is out of resources. Is that going to work in real life, would your virtual machines be happy with the amount of resources they are getting? I don’t think so… I don’t believe that 280MB of physically backed memory is sufficient for most workloads. Yes, maybe TPS can help a bit, but chances of hitting the swap file are substantial.

Let it be clear, admission control is no resource management solution. It is only guaranteeing virtual machines can be restarted and if you have no reservations set then the numbers you will see are probably not realistic. At least not from a user experience perspective. I bet your users / customers would like to have a bit more resources available than just the bare minimum required to power-on a virtual machine! So don’t let these numbers fool you.