4.1

Which metric to use for monitoring memory?

Duncan Epping · Apr 29, 2011 ·

** PLEASE NOTE: This article was written in 2011 and discussed how to monitor memory usage, which is different then memory / capacity sizing. For more info on “active memory” read this article by Mark A. **

This question has come up several times over the last couple of weeks so I figured it was time to dedicate an article to it. People have always been used to monitoring memory usage in a specific way, mainly by looking at the “consumed memory” stats. This always worked fine until ESX(i) 3.5 introduced the aggressive usage of Large Pages. In the 3.5 timeframe that only worked for AMD processors that supported RVI and with vSphere 4.0 support for Intel’s EPT was added. Every architectural change has an impact. The impact is that TPS (transparent page sharing) does not collapse these so called large pages. (Discussed in-depth here.) This unfortunately resulted in many people having the feeling that there was no real benefit of these large pages, or even worse the perception that large pages are the root of all evil.

After having several discussions with customers, fellow consultants and engineers we managed to figure out why this perception was floating around. The answer was actually fairly simple and it is metrics. When monitoring memory most people look at the following section of the host – summary tab:

However, in the case of large pages this metric isn’t actually that relevant. I guess that doesn’t only apply to large pages but to memory monitoring in general, although as explained it used to be an indication. The metric to monitor is “active memory“. Active memory is is what the VMkernel believes is currently being actively used by the VM. This is an estimate calculated by a form of statistical sampling and this statistical sampling will most definitely come in handy when doing capacity planning. Active memory is in our opinion what should be used to analyze trends. Kit Colbert has also hammered on this during his Memory Virtualization sessions at VMworld. I guess the following screenshot is an excellent example of the difference between “consumed” and “active”. Do we need to be worried about “consumed” well I don’t think so, monitoring “active” is probably more relevant at this point! However, it should be noted that “active” represents a 5 minute time slot. It could easily be that the first 5 minute value observed is the same as the second, yet they are different blocks of memory that were touched. So it is an indication of how active the VM is. Nothing more than that.

What have you been up to – part 2

Duncan Epping · Apr 28, 2011 ·

As I have been posting more regularly on the ESXi Chronicles blog I figured it made sense to make people aware of the series of articles I produced like I did last time. These are the articles I recently published, check them out as I feel they are worth reading. Also note that many of the “Ops Changes” articles will be rolled up in to an official whitepaper that will be published on the Tech Resources section of the VMware website.

I hope my efforts with regards to smoothing the transition to ESXi are helpful so far. If there are any specific areas which you feel need to be covered feel free to leave a comment and I will try to cover asap.

Fling: PXE Manager for vCenter

Duncan Epping · Apr 22, 2011 ·

It is finally released… PXE Manager for vCenter. My former Cloud colleague Max Daneri of VMTS fame has worked very very hard on this and actually demoed it at VMworld in 2009. I know Max is already working on the next release which of course will work with the upcoming vSphere version as well. So if you’ve tested it and have feedback don’t forget to leave a comment on labs.vmware.com.

PXE Manager for vCenter enables ESXi host state (firmware) management and provisioning. Specifically, it allows:

Automated provisioning of new ESXi hosts stateless and stateful (no ESX)
ESXi host state (firmware) backup, restore, and archiving with retention
ESXi builds repository management (stateless and statefull)
ESXi Patch management
Multi vCenter support
Multi network support with agents (Linux CentOS virtual appliance will be available later)
Wake on Lan
Hosts memtest
vCenter plugin
Deploy directly to VMware Cloud Director
Deploy to Cisco UCS blades

Free training and book on ESXi?

Duncan Epping · Apr 20, 2011 ·

I just wanted to point to this article on the ESXi Chronicles blog about a free training and free book on ESXi. I actually wrote an article about the book a while back and it is most definitely worth the effort of doing the training and survey!

You might wonder how much a person can write about ESXi, but Dave managed to fill up rough 490 pages. These pages contain solid and detailed information about the ESXi architecture but also for instance on how to implement or secure an ESXi environment. The book has 8 chapters which will take you through the installation and migration step by step but also will prepare you for operating ESXi based environments. On top of that Dave not only discusses the rCLI/vMA but also PowerCLI and it contains some really useful sample scripts.

You can find more details about the training and the book in this ESXi Chronicles article.

vMotion and Quick Resume

Duncan Epping · Apr 13, 2011 ·

I was reading up on vMotion today and stumbled on this excellent article by my colleague Kyle Gleed and noticed something that hardly anyone has blogged about…. Quick Resume. Quick Resume is a feature that allows you to vMotion a virtual machine which has a high memory page change rate. Basically when the change rate of your memory pages exceeds the capabilities of your network infrastructure you could end up in a scenario where vMotioning a virtual machine would fail as the change rate would make a switch-over impossible. With Quick Resume this has changed.

Quick Resume enables the source virtual machines to be stunned while starting the destination virtual machine before all pages have copied. However, as the virtual machine is already running at the destination it could possibly attempt to touch (read or write) a page which hasn’t been copied yet. In that case Quick Resume requests the page from the source to allow the guest to complete the action while continuously copying the remaining memory pages until all pages are migrated. But what if the network would fail at that point, wouldn’t you end up with a destination virtual machine which cannot access certain memory pages anymore as they are “living” remotely? Just like Storage IO Control, vMotion leverages shared storage. A special file would be created in the case Quick Resume is used and this file is basically used as a backup buffer. In the case the network would fail this file would allow for the migration to complete. This file is typically in the order of just a couple MBs. Besides being used as a buffer for transferring the memory pages it also enables bi-directional communication between the two hosts allowing the vMotion to complete as though the network hadn’t failed. Is that cool or what?

The typical question that arises immediately is if this will impact performance? It is good to realize that without Quick Resume vMotioning large memory active virtual machines would be difficult. The switch-over time could potentially be too large and lead to temporary loss of connection with the virtual machine. Although Quick Resume will impact performance when pages that are not copied yet are accessed, the benefits of being able to vMotion very large virtual machines with minimal impact by far outweigh this temporary increase of memory access time.

There is so many cool features and enhancements in vSphere that I just keep being amazed.