Yellow Bricks

vSphere 6.5 and limits, do they still apply to SvMotion?

Duncan Epping · Jul 11, 2018 ·

I had a question this week and I thought I wrote about this before but apparently, I did not. Hopefully by now most of you the I/O scheduler has changed over the past couple of years. Within ESXi, we moved to a new scheduler, which often is referred to as mClock. This new scheduler, and a new version of SIOC (storage i/o control) also resulted in some behavioral changes. Some which may be obvious, others which are not. I want to explicitly point out two things which I have discussed in the past which changed with vSphere 6.5 (and 6.7 as such) and both are around the use of limits.

First of all: Limits and SvMotion. In the past when a limit was applied to a VM, this would also artificially limit SvMotion. As of vSphere 6.5 (may apply to 6.0 as well but I have not verified this) this is no longer the case. Starting vSphere 6.0 the I/O scheduler creates a queue for every file on the file system (VMFS), this to avoid for instance a VM stalling other types (metadata) of IO. The queues are called SchedQ and briefly described by Cormac Hogan here. Of course, there’s a lot more to it than Cormac or I discuss here, but I am not sure how much I can share so I am not going to go there. Either way, if you were used to limits being applied to SvMotion as well you are warned… this is no longer the case.

Secondly, the normalization of I/Os changed with limits. In the past when a limit was applied IOs were normalized at 32KB, meaning that a 64KB I/O would count as 2 I/Os and a 4KB I/O would count as 1. This was confusing for a lot of people and as of vSphere 6.5 this is no longer the case. When you place a limit of 100 IOPS the VMDKs will be limited at 100 IOPS, regardless of the I/O size. This, by the way, was documented here on storagehub, not sure though how many people realized this.

Opvizor Performance Analyzer for vSAN

Duncan Epping · Jul 10, 2018 ·

At a VMUG a couple of months ago I bumped into my old friend Dennis Zimmer. Dennis told me that he was working on something cool for vSAN but couldn’t reveal what it was just yet. Last week I had a call with Dennis about what that thing was. Dennis is the CEO for Opvizor, and some of you may recall the different tooling that Opvizor has produced over the years, of which the Health Analyzer was probably the most famous one back then. I’ve used it in the past on various occasions and I had various customers using it. During the briefing, Dennis explained to me that Opvizor started focussing on performance monitoring and analytics a while ago as the health analyzer market was overly crowded and had the issue that is was a one-off business (checks once in a while instead of daily use). On top of that, many products now come with some form of health analysis included. (See vSAN for instance.) I have to agree with Dennis, so this pivot towards Performance Monitoring makes much sense to me.

Dennis explained to me how they are seeing more and more customer demand for vSAN performance monitoring especially combined with VMware ESXi, VM and App data. Although vCenter has various metrics, and there’s VROps, he told me that Opvizor has many customers who need more than vCenter or vROPS standard has to offer today and don’t own VROps advanced. This is where Opvizor Performance Analyzer comes in to play and that is why today Opvizor announced they are including vSAN specific dashboards. Now, this isn’t just for vSAN of course. Opvizor Performance Analyzer includes not just vSAN but also vSphere and various other parts of the stack. When talking with Dennis one thing became clear, Opvizor is taking a different approach than most other solutions. Where most focus on simplifying, hiding, and aggregating, the focus for Opvizor is on providing as much relevant detail as possible to fulfill the needs of beginner and professional.

So how does it work? Opvizor provides a virtual appliance. You simply deploy it in your environment and connect it to vCenter and you are ready to go. The appliance collects data every 5 minutes (but 20 seconds intervals of these 5 minutes) and has a retention of up to 5 years. As I said, the focus is on infrastructure statistics and performance analytics and as such Opvizor delivers all the data you ever need.

It doesn’t just provide you with all the info you will ever need. It will also allow you to overlay different metrics, which makes performance troubleshooting a lot easier, and will allow you to correlate and pinpoint particular problems. Opvizor comes with dashboards for various aspects, here are the ones included in the upcoming release for vSAN:

Capacity and Balance
Storage Diskgroup Stats
VM View
Physical disk latency breakdown
Cache Diskgroup stats
vSAN Monitor

Now I said this is the expert´s troubleshooting tool, but Opvizor Performance Analyzer also provided in-depth information about what each metric is / means and provides starter dashboards for beginners. You can simply click on the “i” in the top left corner of the widget and you get all the info about that particular widget.

When you do know what you are looking for you can click, hover, and zoom when needed. Hover over the specific section in the graph and the point in time values of the metrics will pop up. In the case below I was drilling down on a VM in the vSAN cluster and looking at write latency in specific. As you can see we have 3 objects and in particular 2 disks and a “vm name space”.

And this is just a random example, there are many metrics to look at and many different widgets and overviews. Just to give you an idea, here are some of the metrics you can find in the UI:

Latency (for all different components of the stack)
IOPs (for all different components of the stack)
Bandwidth (for all different components of the stack)
Congestion (for all different components of the stack)
Outstanding I/O (for all different components of the stack)
Read Cache Hit rate (for all different components of the stack)\
ESXi vSAN host disk usage
ESXi vSAN host cpu usage
Number of Components
Disk Usage
Cache Usage

And there;s much more, too many to list in this blog. And again, not just vSAN, but there are many dashboards to chose from. If you don’t have a performance monitoring solution yet and you are evaluating solutions like SolarWinds, Turbunomics and others make sure to add Opvizor to that list. One thing I have to say, I spotted a couple of things that I liked to see changed, and I think within 24hrs the Opvizor guys managed to incorporate the feedback. That was a crazy fast turnaround, good to see how receptive they are.

Oh, one more thing I found in the interface, it is these dashboards that deal with things like NUMA. But also things like the Top 10 VMs in terms of IOPS. Both very useful, especially when doing deep performance troubleshooting and optimizing.

I hope that gives you a sense of what they can do. There’s a fully functional 30-day trial, check it out if you want to find out more about Performance Analyzer or simply just want to play around with it. Opvizor announced this brand new version on their own blog here, make sure to give that a read as well!

Adding a fifth (virtual) ESXi host to vCenter Foundation

Duncan Epping · Jul 6, 2018 ·

When running a 4 node stretched cluster environment it should be possible to use “cheaper” vCenter Server licenses, namely vCenter Foundation. One of the limitations of vCenter Foundation is that you can only manage 4 hosts with it. This is where some customers who wanted to manage a stretched cluster hit some issues. The issue occurs at the point where you want to add the Witness VM to the inventory. Deploying the VM, of course, works fine, but it becomes problematic when you add the virtual ESXi host (Witness Appliance) to the vCenter Foundation instance as vCenter simply will not allow you to add a 5th host. Yes, this 5th host would be a witness, and will not be running any VMs, and even has a special license. Yet, the “add host” wizard does not differentiate between a regular host and a virtual witness appliance.

Fortunately, there’s a workaround. It is fairly straightforward, and it has to do with the order in which you add hosts to vCenter Foundation. If you add the witness VM before the physical hosts then the appliance is not counted against the license. The license count (and allocation) apparently happens after the host has been added, but somehow vCenter does validate beforehand. I guess we do this to avoid abuse.

So if you have vCenter Foundation, and want to build a stretched cluster leveraging a 2+2+1 configuration, meaning 4 physical hosts and 1 witness VM, then simply add the Witness VM to the inventory as a host first and then add the rest. For those wondering, yes this is documented in the release notes of vSphere 6.5 Update, all the way at the bottom.

Trigger APD on iSCSI LUN on vSphere

Duncan Epping · Jun 21, 2018 ·

I was testing various failure scenarios in my lab today for the vSphere Clustering Deepdive session I have scheduled for VMworld. I needed some screenshots and log files of when a datastore hit an APD scenario, for those who don’t know APD stands for all paths down. In other words: the storage is inaccessible and ESXi doesn’t know what has happened and why. vSphere HA has the ability to respond to that kind of failure. I wanted to test this, but my setup was fairly simple and virtual. So I couldn’t unplug any cables. I also couldn’t make configuration changes to the iSCSI array as that would rather trigger a PDL (permanent device loss), so how do you test and APD scenario?

After trying various things like killing the iSCSI daemon (it gets restarted automatically with no impact on the workload) I bumped in to this command which triggered the APD:

SSH in to the host you want to trigger the APD on, run the following command
```
esxcli iscsi session remove  -A vmhba65
```
Make sure of course to replace “vmhba65” with the name of your iSCSI adapter

This triggered APD, as witness in the fdm.log and vmkernel.log, and ultimately resulted in vSphere HA killing the impacted VM and restarting it on a healthy host. Anyway, just wanted to share this as I am sure there are others who would like to test APD responses in their labs or before their environment goes in to production.

There may be other easy ways as well, if you know any, please share in the comments section.

Module MonitorLoop power on failed error when powering on VM on vSphere

Duncan Epping · Jun 12, 2018 ·

I was playing in the lab for our upcoming vSphere Clustering Deepdive book and I ran in to this error when powering on a VM. I had never seen it before myself, so I was kind of surprised when I figured out what it was referring to. The error message is the following:

Module MonitorLoop power on failed when powering on VM

Think about that for a second, if you have never seen it I bet you don’t know what it is about? Not strange as the message doesn’t give a clue.

f you go to the event however there’s a big clue right there, and that is that the swap file can’t be extended from 0KB to whatever it needs to be. In other words, you are probably running out of disk space on the device the VM is stored on. In this case I removed some obsolete VMs and then powered on the VM that had the issue without any problems. So if you see this “Module MonitorLoop power on failed when powering on VM” error, check your free capacity on the datastore the VM sits on!

More details:

Strange error message, for a simple problem. Yes, I will file a request to get this changed.