ESXTOP
Intro
Thresholds
Howto – Run
Howto – Capture
Howto – Analyze
References
Changelog
This page is solely dedicated to one of the best tools in the world for ESX; esxtop.
I am a huge fan of esxtop! I read a couple of pages of the esxtop bible every day before I go to bed. Something I however am always struggling with is the “thresholds” of specific metrics. I fully understand that it is not black/white, performance is the perception of a user in the end.
There must be a certain threshold however. For instance it must be safe to say that when %RDY constantly exceeds the value of 20 it is very likely that the VM responds sluggish. I want to use this article to “define” these thresholds, but I need your help. There are many people reading these articles, together we must know at least a dozen metrics lets collect and document them with possible causes if known.
Please keep in mind that these should only be used as a guideline when doing performance troubleshooting! Also be aware that some metrics are not part of the default view, I’ve added the character you need to add to the default view for your convenience. You can add fields to an esxtop view by clicking “f” on followed by the corresponding character.
I used VMworld presentations, VMware whitepapers, VMware documentation, VMTN Topics and of course my own experience as a source and these are the metrics and thresholds I came up with so far. Please comment and help build the main source for esxtop thresholds.
| Display | Metric | Threshold | Explanation |
| CPU | %RDY | 10 | Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set. See Jason’s explanation for vSMP VMs |
| CPU | %CSTP | 3 | Excessive usage of vSMP. Decrease amount of vCPUs for this particular VM. This should lead to increased scheduling opportunities. |
| CPU | %MLMTD | 0 | If larger than 0 the world is being throttled. Possible cause: Limit on CPU. |
| CPU | %SWPWT | 5 | VM waiting on swapped pages to be read from disk. Possible cause: Memory overcommitment. |
| MEM | MCTLSZ (I) | 1 | If larger than 0 host is forcing VMs to inflate balloon driver to reclaim memory as host is overcommited. |
| MEM | SWCUR (J) | 1 | If larger than 0 host has swapped memory pages in the past. Possible cause: Overcommitment. |
| MEM | SWR/s (J) | 1 | If larger than 0 host is actively reading from swap(vswp). Possible cause: Excessive memory overcommitment. |
| MEM | SWW/s (J) | 1 | If larger than 0 host is actively writing to swap(vswp). Possible cause: Excessive memory overcommitment. |
| MEM | N%L (F) | 80 | If less than 80 VM experiences poor NUMA locality. If a VM has a memory size greater than the amount of memory local to each processor, the ESX scheduler does not attempt to use NUMA optimizations for that VM and “remotely” uses memory via “interconnect”. |
| NETWORK | %DRPTX | 1 | Dropped packages transmitted, hardware overworked. Possible cause: very high network utilization |
| NETWORK | %DRPRX | 1 | Dropped packages received, hardware overworked. Possible cause: very high network utilization |
| DISK | GAVG (H) | 25 | Look at “DAVG” and “KAVG” as the sum of both is GAVG. |
| DISK | DAVG (H) | 25 | Disk latency most likely to be caused by array. |
| DISK | KAVG (H) | 2 | Disk latency caused by the VMkernel, high KAVG usually means queuing. Check “QUED”. |
| DISK | QUED (F) | 1 | Queue maxed out. Possibly queue depth set to low. Check with array vendor for optimal queue depth value. |
| DISK | ABRTS/s (K) | 1 | Aborts issued by guest(VM) because storage is not responding. For Windows VMs this happens after 60 seconds by default. Can be caused for instance when paths failed or array is not accepting any IO for whatever reason. |
| DISK | RESETS/s (K) | 1 | The number of commands reset per second. |
Although understanding all the metrics esxtop provides seem to be impossible using esxtop is fairly simple. When you get the hang of it you will notice yourself staring at the metrics/thresholds more often than ever. The following keys are the ones I use the most.
open console session or ssh to ESX(i) and type:
esxtop
By default the screen will be refreshed every 5 seconds, change this by typing:
s 2
Changing views is easy type the following keys for the associated views:
c = cpu m = memory n = network i = interrupts d = disk adapter u = disk device (includes NFS as of 4.0 Update 2) v = disk VM y = power states V = only show virtual machine worlds 2 = highlight a row, moving down 8 = highlight a row, moving up 4 = remove selected row from view e = statistics broken down per world 6 = statistics broken down per world
Ad/Remove fields:
f <type appropriate character>
Changing the order:
o <move field by typing appropriate character uppercase = left, lowercase = right>
Saving all the settings you’ve changed:
W
Keep in mind that when you don’t change the filename it will be saved and used as default settings.
Help:
?
In very large environments esxtop can high CPU utilization due to the amount of data that will need to be gathered and calculations that will need to be done. If CPU appears to highly utilized due to the amount of entities (VMs / LUNs etc) a command line option can be used which locks specific entities and keeps esxtop from gathering specific info to limit the amount of CPU power needed:
esxtop -l
More info about this command line option can be found here.
First things first. Make sure you only capture relevant info. Ditch the metrics you don’t need. In other words run esxtop and remove/add(f) the fields you don’t actually need or do need! When you are finished make sure to write(W) the configuration to disk. You can either write it to the default config file(esxtop4rc) or write the configuration to a new file.
Now that you have configured esxtop as needed run it in batch mode and save the results to a .csv file:
esxtop -b -d 2 -n 100> esxtopcapture.csv
Where “-b” stands for batch mode, “-d 2″ is a delay of 2 seconds and “-n 100″ are 100 iterations. In this specific case esxtop will log all metrics for 200 seconds.
You can use multiple tools to analyze the captured data.
- perfmon
- excel
- esxplot
Let’s start with perfmon as I’ve used perfmon(part of Windows also know as “Performance Monitor”) multiple times and it’s probably the easiest as many people are already familiar with it. You can import a CSV as follows:
- Run: perfmon
- Right click on the graph and select “Properties”.
- Select the “Source” tab.
- Select the “Log files:” radio button from the “Data source” section.
- Click the “Add” button.
- Select the CSV file created by esxtop and click “OK”.
- Click the “Apply” button.
- Optionally: reduce the range of time over which the data will be displayed by using the sliders under the “Time Range” button.
- Select the “Data” tab.
- Remove all Counters.
- Click “Add” and select appropriate counters.
- Click “OK”.
- Click “OK”.
The result of the above would be:

With MS Excel it is also possible to import the data as a CSV. Keep in mind though that the amount of captured data is insane so you might want to limit it by first importing it into perfmon and then select the correct timeframe and counters and export this to a CSV. When you have done so you can import the CSV as follows:
- Run: excel
- Click on “Data”
- Click “Import External Data” and click “Import Data”
- Select “Text files” as “Files of Type”
- Select file and click “Open”
- Make sure “Delimited” is selected and click “Next”
- Deselect “Tab” and select “Comma”
- Click “Next” and “Finish”
All data should be imported and can be shaped / modelled / diagrammed as needed.
Another option is to use a tool called “esxplot“. You can download version 1.0 here.
- Run: esxplot
- Click File -> Import -> Dataset
- Select file and click “Open”
- Double click host name and click on metric

As you can clearly see in the screenshot above the legend(right of the graph) is too long. You can modify that as follows:
- Click on “File” -> preferences
- Select “Abbreviated legends”
- Enter appropriate value
The following documents / articles have been used as a reference:
- Interpreting esxtop Statistics
- Performance Troubleshooting for VMware vSphere 4 and ESX 4.0
- Hypervisor.fr – Easter Eggs esxtop
- Performance Training
- Using Perfmon for esxtop
- esxplot
- remove vertical lines from perfmon
07-01-2010 | decreased %RDY from 20 to a value of 10
22-01-2010 | added CPU –> TIMER/S
22-01-2010 | added MEM –> N%L
24-01-2010 | added sections (howto)
02-02-2010 | expanded analyze section and included screenshots
10-02-2010 | decreased %CSTP from 100 to 5
10-02-2010 | decreased KAVG from 5 to 2
23-03-2010 | increase %SWPWT from 1 to 5
23-03-2010 | added “e”, “V”, “i”, “2″, “4″, “6″, “8″ in the “views” section
16-06-2010 | added “-l” functionality and stressed NFS added option





vSphere 4.0 Quick Start Guide
Hi Duncan, first off great article, it’s taking shape nicely.
Quick question for you.
When adding / removing fields with the f switch and then saving the changes using the w key does this affect every future ESXTOP session or is it just the active session?
Thanks
Craig
Craig,
Yes, when saving with w it will save the settings for esxtop for the current user. The settings are saved in a file (.esxtop3rc, .esxtop310rc or .esxtop4rc depending on your esx version) stored in the home directory of the user. Since the file starts with a ‘.’ it’s hidden from ls unless you use the ‘-a’ parameter.
Lars
If you save it with the default file name: yes. best option would be to pick a custom name and refer to it from the command line
Great article. Had no idea we could leverage perfmon, so in a MS-centric environment this is great news.
I wonder then if we could leverage PAL to create some compelling reports for management.
I’ll be coming back to this site. Keep up the good work!