** PLEASE NOTE: This article was written in 2011 and discussed how to monitor memory usage, which is different then memory / capacity sizing. For more info on “active memory” read this article by Mark A. **
This question has come up several times over the last couple of weeks so I figured it was time to dedicate an article to it. People have always been used to monitoring memory usage in a specific way, mainly by looking at the “consumed memory” stats. This always worked fine until ESX(i) 3.5 introduced the aggressive usage of Large Pages. In the 3.5 timeframe that only worked for AMD processors that supported RVI and with vSphere 4.0 support for Intel’s EPT was added. Every architectural change has an impact. The impact is that TPS (transparent page sharing) does not collapse these so called large pages. (Discussed in-depth here.) This unfortunately resulted in many people having the feeling that there was no real benefit of these large pages, or even worse the perception that large pages are the root of all evil.
After having several discussions with customers, fellow consultants and engineers we managed to figure out why this perception was floating around. The answer was actually fairly simple and it is metrics. When monitoring memory most people look at the following section of the host – summary tab:
However, in the case of large pages this metric isn’t actually that relevant. I guess that doesn’t only apply to large pages but to memory monitoring in general, although as explained it used to be an indication. The metric to monitor is “active memory“. Active memory is is what the VMkernel believes is currently being actively used by the VM. This is an estimate calculated by a form of statistical sampling and this statistical sampling will most definitely come in handy when doing capacity planning. Active memory is in our opinion what should be used to analyze trends. Kit Colbert has also hammered on this during his Memory Virtualization sessions at VMworld. I guess the following screenshot is an excellent example of the difference between “consumed” and “active”. Do we need to be worried about “consumed” well I don’t think so, monitoring “active” is probably more relevant at this point! However, it should be noted that “active” represents a 5 minute time slot. It could easily be that the first 5 minute value observed is the same as the second, yet they are different blocks of memory that were touched. So it is an indication of how active the VM is. Nothing more than that.