performance

VMworld esxtop advanced session

Duncan Epping · Nov 8, 2010 ·

During my flight from Boston back to the Netherlands I listened to the VMworld esxtop session “Troubleshooting using ESXTOP for Advanced Users, TA6720“. As always an excellent session with a lot of in-depth info. Most of it was already documented, however there were a couple of key points that I hadn’t documented yet. I just added those to my esxtop page which I wanted to stress as I personally believe it is very useful info. It seems pretty random but it rolled up nicely into the esxtop page in my opinion.

%SYS should be less than 20, %SYS is the percentage of time spent by system services on behalf of the world. The possible system services are interrupt handlers, bottom halves, and system worlds.
-b = batch mode, adding “-a” will force all metrics to be gathered
Limit display to a single group (l)
- enables you to focus on a specific VM
Limiting the number of entities (#)
- this enables you for instance to watch the top 5 worlds for

I have also added thresholds for ZIP/s, UNZIP/s and CACHEUSD. These should of course be 0 from a performance perspective as anything larger than 0 means the host was overcommitted on memory and had to resort to memory compression.

If anyone has more metrics/thresholds to contribute which they used in the past to troubleshoot issues let me know!

How many pages can be shared if Large Pages are broken up?

Duncan Epping · Nov 7, 2010 ·

I have written multiple articles(1, 2, 3, 4) on this topic so hopefully by now everyone knows that Large Pages are not shared by TPS. However when there is contention the large pages will be broken up in small pages and those will be shared based on the outcome of the TPS algorythm. Something I have always wondered and discussed with the developers a while back is if it would be possible to have an indication of how many pages can possibly be shared when Large Pages would be broken down. (Please note that we are talking about Small Pages backed by Large Pages here.) Unfortunately there was no option to reveal this back then.

While watching the VMworld esxtop session “Troubleshooting using ESXTOP for Advanced Users, TA6720” I noticed something really cool. Which is shown in the quote and the screenshot below which is taken from the session. Other new metrics that were revealed in this session and shown in this screenshot are around Memory Compression. I guess the screenshot speaks for itself.

COWH : Copy on Write Pages hints – amount of memory in MB that are potentially shareable,

Potentially shareable which are not shared. for instance when large pages are used, this is a good hint!!

There was more cool stuff in this session that I will be posting about this week, or at least adding to my esxtop page for completeness.

Did you know? SCSI Reservations…

Duncan Epping · Oct 26, 2010 ·

Today we had an interesting discussion on the VCDX mailing list. One thing I noticed a while back when I was randomly looking around in “esxtop” was a new field. The field is called ” RESVSTATS and can be enabled in all disk related displays(d, u,v).

This will make troubleshooting storage related performance issues a bit easier as the SCSI Reservations(RESV/S) are shown a column(click the screenshot for a larger version) when enabled, and even more specifically SCSI Reservation Conflicts (CONS) are shown next to it):

Memory Limits

Duncan Epping · Jul 6, 2010 ·

We had a discussion internally around memory limits and what the use case would be for using them. I got some great feedback on my reply and comments so I decided to turn the whole thing into a blog article.

A comment made by one of our developers, which I highly respect, is what triggered my reply. Please note that this is not VMware’s view or usecase but what some of our customers feed back to our development team.

An admin may impose a limit on VMs executing on an unloaded host to better reflect the actual service a VM will likely get once the system is loaded; I’ve heard this use case from several admins)

From a memory performance perspective that is probably the worst thing an Admin can do in my humble opinion. If you are seriously overcommitting your hosts up to the point where swapping or ballooning will occur you need to think about the way you are provisioning. I can understand, well not really, people doing it on a CPU level as the impact is much smaller.

Andrew Mitchell commented on the same email and his reply is key to understanding the impact of memory limits.

“When modern OS’s boot, one of the first things they do is check to see how much RAM they have available then tune their caching algorithms and memory management accordingly. Applications such as SQL, Oracle and JVMs do much the same thing.”

I guess the best way to explain in one line is: The limit is not exposed to the OS itself and as such the App will suffer and so will the service provided to the user.

The funny thing about this is that although the App might request everything it can it, it might not even need it. In that case, more common than we think, it is better to decrease provisioned memory than to create an artificial boundary by applying a memory limit. The limit will more than likely impose an unneeded and unwanted performance impact. Simply lowering the amount of provisioned memory might impact performance but most likely will not as the OS will tune it’s caching algorithms and memory management accordingly.

CMDS/s vs IOPS?

Duncan Epping · Jun 24, 2010 ·

Today I received a question around the difference between IOPS and CMDS/s. The reason for this was the high value of CMDS/s in “esxtop” which exceeded the expected amount of IOPS the disks could actually digest. I thought it would useful for everyone to know what the difference is:

IOPS = Input/Output Operations Per Second
- Within esxtop this would be the outcome of “Number of Read commands(READS/s) + Number of Write commands(WRITES/s)”
CMDS/s = Total commands per second
- Within esxtop this includes any command(for instance SCSI reservations) and not necessary only read/write IOs

One thing to stress though is that in any case the CMDS/s should be relatively close to IOPS, but when there are a lot of metadata changes due to snapshots for instance the difference can be significant. Where this significant difference came from is something we are still investigating and we are hoping to solve pretty soon. If we manage to solve it you can expect an update here.