troubleshooting

Testing your infrastructure!

Duncan Epping · Jul 16, 2013 ·

Last week I was helping someone on the VMTN community forums. They were hitting what appeared to be strange HA behavior. After some standard questions this person told me that all VMs were powered down after a network outage. Sounds like a familiar problem? Yes I can hear most of you think: Isolation response set to “power off” and no proper network redundancy?

Well yes and no. They had the isolation response indeed configured to “power off” all VMs when the host is isolated. They did however have proper network redundancy, so how on earth did this happen? With 2 physical NICs and 2 physical switches and only 1 being impacted this should not have happened right?!?

Wrong! In this case the fail-over from a “vmkernel” perspective worked fine. The first “path” went down, so the second was used for this management vmkernel. All VMs were up and running until this point, and they remained running until… network connection was restored and the vmnic returned to the original physical NIC. Meaning that the mac address that showed up on port 1 popped up on port 2 and then went back to 1 again. The switch was not impressed and went through the spanning tree process and traffic was blocked instantly as a result of it. Now when traffic is blocked bad things can happen, especially when you configure HA to “power off” VMs. Basically what caused this issue to happen was the fact the spanning tree was not set to the recommended “port fast”, more details here.

I knew instantly that this was the reason for this problem, not because I know stuff about HA but because I had seen this many times in the past while testing environments I configured and designed. Not just testing after implementing a new infrastructure, but also testing after making changes to an infrastructure or introducing a new version / feature. I guess this kind of comes back to the “disaster” scenario as well, test it if you want to know if it works as expected. Just a simple example, I want to introduce QoS for my vMotion network and make changes to my physical network. Now what? How do I test these changes? How many times do I run through my test scenarios? What kind of “problems” do I introduce during my tests?

So I guess by now some might wonder why on earth I brought this up… well the problem above could have been prevented by simply testing the infrastructure when implemented and after changes have been introduced, and maybe even on a regular basis. If HA / Networking was tested properly, those VMs would not have been powered off…

Why is %WAIT so high in esxtop?

Duncan Epping · Jul 17, 2012 ·

I got this question today around %WAIT and why it was so high for all these VMs. I grabbed a screenshot from our test environment. It shows %WAIT next to %VMWAIT.

First of all, I suggest looking at %VMWAIT. This one is more relevant in my opinion than %WAIT. %VMWAIT is a derivative of %WAIT, however it does not include %IDLE time but does include %SWPWT and the time the VM is blocked for when a device is unavailable. That kind of reveals immediately why %WAIT seems extremely high, it includes %IDLE! Another thing to note is the %WAIT for a VM is multiple worlds collided in to a single metric. Let me show you what I mean:

As you can see 5 worlds, which explains the %WAIT time to be around 500% constantly when the VM is not doing much. Hope that helps…

<edit> I just got pointed to this great KB article by one of my colleagues. It explains various CPU metrics in-depth. Key take away from that article for me is the following: %WAIT + %RDY + %CSTP + %RUN = 100%. Note that this is per world! Thanks Daniel for pointing this out!</edit>

Problems using the vCenter Web Client

Duncan Epping · May 7, 2012 ·

I was doing some upgrades in my lab and ran in to an issue. Whenever I started the vCenter Web Client I got a message that the vCenter Inventory Service wasn’t running. I looked at my Services section in Windows 2008 and found that it wasn’t started. Starting it gave me a new error: 1067. This is very generic but I figured I would google it anyway. That actually brought me to our own documentation, yes I should check that first next time, and it mentioned I could reset the inventory service as follows:

Stop the service (was already stopped)
Delete the entire contents of the Inventory_Service_Directory/data directory
Change directory to Inventory_Service_directory/scripts
Run the createDB.bat command, with no arguments, to reset the vCenter Inventory Service database
Run the register.bat command to update the stored configuration information of the Inventory Service
register.bat vcenter-tm01.testlab.local 443
Restart the vCenter Inventory Service

I also had to re-register the Web Client to vCenter Server. This is what I had to do:

admin-cmd.bat register https://vcenter-tm01.testlab.local:9443/vsphere-client https://vcenter-tm01.testlab.local administrator password

Hope it helps,

vSphere HA Waiting for cluster election to complete Operation timed out?

Duncan Epping · Jan 4, 2012 ·

I noticed this thread on the VMTN communtity which discussed a time-out during a cluster election process. The one thing all scenarios described in the topic is that they upgraded from 4.1 to 5.0 or 5.0 base to a higher patch level. Marc Sevigny posted in the same thread that it is a known issue which the HA team is currently investigating…

After an upgrade, under conditions we’re still investigating, an error is occurring when issuing a start request of the HA service on the upgraded host. When that fails, HA then tries to re-install HA, and the re-install does nothing because the service is already there (and the right version) but we’re left without an HA service running.

This is the way to fix it if you are experiencing this issue. Now, if you do experience this issue please report it to VMware and submit log files as that will help the HA team fixing the problem.

Place host into Maintenance Mode
Take a copy of /opt/vmware/uninstallers/VMware-fdm-uninstall.sh (we copied to /tmp)
From the location you made a copy of the file, run the command (./VMware-fdm-uninstall.sh)
You should see a short pause before it gets back to the prompt (you’ll see why I mention this below)
Exit host out of Mainenance Mode and within the “Recent Tasks” area you should see the client being pulled from vCenter and installing

ESXi commandline work….

Duncan Epping · Nov 16, 2011 ·

I am just playing around in my lab and needed to do a couple of common ESXi commandline tasks which I figured I would document as they will come in handy at some point.

List all VMs registered to this host (This reveals the Vmid needed for other commands)
vim-cmd /vmsvc/getallvms
Unregister a VM
vim-cmd /vmsvc/unregister <Vmid>
Register a VM
vim-cmd /solo/register /path/to/file.vmx
Get power state of a VM
vim-cmd /vmsvc/power.getstate <Vimid>
Power off a VM
vim-cmd /vmsvc/power.off <Vmid>
Power on a VM
vim-cmd /vmsvc/power.on <Vmid>