For GAVG, did you mean to use 30ms, (the sum of KAVG + DAVG)? Of course the impact of latency depends on the application, but 30ms at the guest is the number I usually look for to denote a possible performance impact. However, the following whitepaper states 50ms as the threshold of definite storage latency based performance degradation. http://www.vmware.com/files/pdf/scalable_storage_performance.pdf.
No I did not mean to use the sum of KAVG and DAVG. I think a 25ms latency is just bad overall, if that is 20 on disk and 5 on kernel or 25 on disk and 0 on kernel, it’s still latency. I do understand what you are saying and will try to mention this explicitly.
You will be hard pressed to come up with one %RDY value because the value carries different weighting depending on the number of vCPUs in the VM. The %RDY value is a sum of all vCPU %RDY for the VM. Some examples:
The max %RDY value of a 1vCPU VM is 100%
The max %RDY value of a 4vCPU VM is 400%
%RDY 20 for a 1vCPU VM is bad. It means 1vCPU is waiting 20% of the time to be scheduled by the VMkernel.
%RDY 20 for a 2vCPU VM is moderately bad. It means 2vCPUs are each waiting 10% of the time to be co-scheduled by the VMkernel.
%RDY 20 for a 4vCPU VM is borderline reasonable. It means 4vCPUs are each waiting 5% of the time to be co-scheduled by the VMkernel.
%RDY 20 for a 1vCPU VM is roughly equivilent to %RDY 80 for a 4vCPU VM.
In the end, the best judge of %RDY severity is end user perception and the threshold is going to vary depending on application characteristics and end user tolerance.
10% RDY could be a good threshold per cpu, IMHO 20% per cpu is a bit too much even though it was mentioned in a VMTN doc somewhere. It depends on the workload running. For some workloads 10% could be too much while it would be fine in others.
I treat %RDY like I do context switching – in a green/yellow/red stoplight fashion.
My threshold preference for %RDY is something like:
0-4 per vCPU = green
5-9 per vCPU = yellow
10+ per vCPU = red
Doug Baersays
Thanks Duncan. This seems to be a global issue with performance monitoring — everyone agrees that performance is generally a perception issue, but it is difficult to find ‘recommended’ or ‘guideline’ threshold values. Part of the problem is that, if they’re published by the vendor, admins tend to treat them as absolute; the other part is that vendors just don’t publish the numbers.
Maybe it is me, but read the table it says “Look at “DAVG” and “KAVG” as the sum of both is GAVG.”
The values in the article I disagree with. They are mention 10 and 100 as a threshold? Weird.
Fred Petersonsays
Over how many intervals would you consider these thresholds an issue?
That was kind of a rhetorical question, but for all of us VM admins, we’ve all seen one or more of these values exceed the thresholds defined above….but we also recognize that once every 100 intervals isn’t a big deal…necessarily 🙂
extended period, peaks for 30 – 60 seconds I would say. But again, it also depends on the metric.
Craig Risingersays
Just to complicate matters with SMP VMs, you can’t assume the sum is evenly split among all vCPUs.
Where possible, look at %Ready for each vCPU. If any is high, there’s probably a problem.
Ochoasays
In regards to disk performance,
Is System Center Operation Manager (MS) a good tool to use for gathering disk performance data for VM’s? What metrics should one use to gather this type of information? My goal is to use SCOM to gather disk performance statistics for VM’s but I’m not sure how to go about it at this point and if this tool will do the job for the VM environment. Also, are the same metrics used to gather stats for the VM and host?
dharmeshsays
Is there any relation between SPLTCMD/s and GAVG / rd Or GAVG / wr latencies.
I am confused as I see high SPLTCMD/s where latency is high for Guest reads or writes.
Basically SPLTCMD/s is for multipathing ? or IO sizes or partition boundary conditions?
does this affect latency per command? eg if GAVG / cmd = 300 ms and SPLTCMD / s = 30
does this mean that latency per read is now 10 ms (for whatever MB /s throughput)
Sudharsan says
Another Great one from Duncan again to prove why you alwasy top the list !! Good handy reference whenever we troubleshoot performance issues
Forbes Guthrie says
Duncan, really great idea. (Forbes is now frantically looking to see what he can dump from his reference card so he can squeeze these in :))
PiroNet says
WOW a great idea to put all that on a score card.
Personally I use a threshold of 10 for %RDY. I use TOP as well to compare that info (IOWAIT).
For me KAVG/cmd above 3ms requires immediate attention!
By the way, GAVG is the sum of DAVG and KAVG, so you should put 30 there…
Cheers,
Didier
Wade H. says
Hi Duncan,
For GAVG, did you mean to use 30ms, (the sum of KAVG + DAVG)? Of course the impact of latency depends on the application, but 30ms at the guest is the number I usually look for to denote a possible performance impact. However, the following whitepaper states 50ms as the threshold of definite storage latency based performance degradation. http://www.vmware.com/files/pdf/scalable_storage_performance.pdf.
Wade
Duncan says
No I did not mean to use the sum of KAVG and DAVG. I think a 25ms latency is just bad overall, if that is 20 on disk and 5 on kernel or 25 on disk and 0 on kernel, it’s still latency. I do understand what you are saying and will try to mention this explicitly.
Jason Boche says
You will be hard pressed to come up with one %RDY value because the value carries different weighting depending on the number of vCPUs in the VM. The %RDY value is a sum of all vCPU %RDY for the VM. Some examples:
The max %RDY value of a 1vCPU VM is 100%
The max %RDY value of a 4vCPU VM is 400%
%RDY 20 for a 1vCPU VM is bad. It means 1vCPU is waiting 20% of the time to be scheduled by the VMkernel.
%RDY 20 for a 2vCPU VM is moderately bad. It means 2vCPUs are each waiting 10% of the time to be co-scheduled by the VMkernel.
%RDY 20 for a 4vCPU VM is borderline reasonable. It means 4vCPUs are each waiting 5% of the time to be co-scheduled by the VMkernel.
%RDY 20 for a 1vCPU VM is roughly equivilent to %RDY 80 for a 4vCPU VM.
In the end, the best judge of %RDY severity is end user perception and the threshold is going to vary depending on application characteristics and end user tolerance.
Jas
Cody Bunch says
I agree with Jason about %RDY varying based on vCPUs in a VM. A “threshold” for this based on the number of vCPUs could work however.
%RDY 10 for 1 vCPU
%RDY 20 for 2 vCPU
%RDY 40 for 4 vCPU
Or so… User perception is key, but it’s good to have some thresholds setup for troubleshooting/alerting. Better to be a bit ahead of the phone calls.
-Cody
http://professionalvmware.com
Duncan says
Agreed Jason, I will try to add that to the explanation! Would you also say that 10% is the threshold?
Lars Troen says
10% RDY could be a good threshold per cpu, IMHO 20% per cpu is a bit too much even though it was mentioned in a VMTN doc somewhere. It depends on the workload running. For some workloads 10% could be too much while it would be fine in others.
Lars
Jason Boche says
I treat %RDY like I do context switching – in a green/yellow/red stoplight fashion.
My threshold preference for %RDY is something like:
0-4 per vCPU = green
5-9 per vCPU = yellow
10+ per vCPU = red
Doug Baer says
Thanks Duncan. This seems to be a global issue with performance monitoring — everyone agrees that performance is generally a perception issue, but it is difficult to find ‘recommended’ or ‘guideline’ threshold values. Part of the problem is that, if they’re published by the vendor, admins tend to treat them as absolute; the other part is that vendors just don’t publish the numbers.
Forbes Guthrie says
I’ve taken these thresholds and made myself a little performance card.
http://www.vreference.com/downloads/vReference-esxtop0.2.pdf
Its a credit card sized reminder, that I can keep in my wallet. Question to the gurus here – anything I should add/remove?
Horst Mundt says
Could a memory limit that’s smaller than the VMs configured memory also cause ballooning / swapping without the host being overcommited on memory?
bjorn bats says
very great post.
you are not telling if the davg kavg and gavg are about reading or writing or commands.
Duncan says
That’s because these are the sum of read+write. For both read and write there are separate views which can be enabled if needed.
Duncan Epping says
@horst, yes it could/can/will!
bjorn bats says
i know that gavg/cmd is the sum of gavg/rd + gavg/wr but maybe its better to mention it in the table.
if i look at gavg, davg and kavg values at kb knowlegde base article
at http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008205
then i find other (lower) values.
Duncan says
Maybe it is me, but read the table it says “Look at “DAVG” and “KAVG” as the sum of both is GAVG.”
The values in the article I disagree with. They are mention 10 and 100 as a threshold? Weird.
Fred Peterson says
Over how many intervals would you consider these thresholds an issue?
That was kind of a rhetorical question, but for all of us VM admins, we’ve all seen one or more of these values exceed the thresholds defined above….but we also recognize that once every 100 intervals isn’t a big deal…necessarily 🙂
Duncan Epping says
extended period, peaks for 30 – 60 seconds I would say. But again, it also depends on the metric.
Craig Risinger says
Just to complicate matters with SMP VMs, you can’t assume the sum is evenly split among all vCPUs.
Where possible, look at %Ready for each vCPU. If any is high, there’s probably a problem.
Ochoa says
In regards to disk performance,
Is System Center Operation Manager (MS) a good tool to use for gathering disk performance data for VM’s? What metrics should one use to gather this type of information? My goal is to use SCOM to gather disk performance statistics for VM’s but I’m not sure how to go about it at this point and if this tool will do the job for the VM environment. Also, are the same metrics used to gather stats for the VM and host?
dharmesh says
Is there any relation between SPLTCMD/s and GAVG / rd Or GAVG / wr latencies.
I am confused as I see high SPLTCMD/s where latency is high for Guest reads or writes.
Basically SPLTCMD/s is for multipathing ? or IO sizes or partition boundary conditions?
does this affect latency per command? eg if GAVG / cmd = 300 ms and SPLTCMD / s = 30
does this mean that latency per read is now 10 ms (for whatever MB /s throughput)