ESX

Increasing the time-out within vCenter for remote ESX hosts

Duncan Epping · Feb 11, 2009 ·

One of my colleagues is deploying an enormous VI3 environment. The customer wanted to have 1 central management console for all ESX hosts of which most hosts are located in a satellite offices. (One central management system for more than 200 hosts remote) With a 1Gb or more link this shouldn’t be a problem, but this customer had 64Kb links between these satellite offices and head quarters. This means that most ESX hosts were displayed as “disconnected” most of the time. To avoid this a time-out value for vCenter was increased:

The ESX Host sends heartbeats every 10 seconds, VirtualCenter server has window of 20 seconds to receive it. If the UDP Heartbeat message is not received VirtualCenter server will treat ESX as not responding.

By increasing the timeout limit in VirtualCenter, it will show the ESX host as continuously “connected”.

Edit C:/Documents and Settings/All Users/Application Data/VMware/VMware VirtualCenter/vpxd.cfg
Add the following in the <vpxd> tags.
<heartbeat> <notRespondingTimeout>60</notRespondingTimeout> </heartbeat>
Restart VirtualCenter server service

Blades and HA / Cluster design

Duncan Epping · Feb 9, 2009 ·

After reading Aaron‘s excellent articles(1, 2) on Scott Lowe’s Blog I remembered a discussion I had with a couple of co-workers. The discussion was about VMware HA Cluster design in Blade Environments.

The thing that started this discussing was an HA “problem” that occurred at a customer site. This specific customer had 2 Blade chassis to avoid a single point of failure in his virtual environment. All blade servers were joined in one big cluster to get the most out of the environment in terms of Distributed Resource Scheduling.

Unfortunately for this customer at one point in time one of his blade chassis failed. In other words, power off on the chassis, all blades gone at the same time. The firs thing that comes to mind is: HA will kick in and the VM’s will be up and running within no-time. [Read more…] about Blades and HA / Cluster design

Patches for 3.5

Duncan Epping · Jan 31, 2009 ·

VMware just released 9 new patches for ESX 3.5:

ESX350-200901401-SG – PATCH – Security – KB 1006651 – Updates VMkernel VMX hostd. But most important fix that this patch containts definitely is: VMware ESX and ESXi 3.5 U3 I/O failure on SAN LUNs, and LUN queue is blocked indefinitely. For a full description of this issue, see http://kb.vmware.com/kb/1008130.
ESX350-200901402-SG – PATCH – Security – KB 1006652 – Security Update to ESX Scripts
ESX350-200901404-BG – PATCH – Critical – KB 1006654 – Updates VMware Tools
ESX350-200901405-BG – PATCH – General – KB 1006655 – Updates VMware-esx-lnxcfg
ESX350-200901406-BG – PATCH – General – KB 1006656 – Updates Kernel Source and VMNIX
ESX350-200901407-BG – PATCH – General – KB 1006657 – Updates Pegasus
ESX350-200901408-BG – PATCH – Critical – KB 1006658 – Updates SATA Drivers
ESX350-200901409-SG – PATCH – Security –KB 1006659 – SNMP Security Update
ESX350-200901410-SG – PATCH – Security – KB 1006660 – Security Update for libxml2

There’s one “patch” released for ESXi (installable and embedded) 3.5:

ESXe350-200901401-O-SG – PATCH – Security – KB 1006661 – KB 1006662 – KB 1007058 – Firmware Update
If you are using SRM in combination with ESX besure to read the KB1006661 cause this patch contains several fixes related to SRM.

There’s also a whole bunch of 3.0.x patches released, so if you’re still running 3.0.x besure to look into these new patches.

RE: ESXTOP Drilldown (Jason Boche)

Duncan Epping · Jan 29, 2009 ·

I was working on an ESXTOP post when Jason Boche published his blog post “ESXTOP Drilldown“. My post was similar so I decided to dump the post and start over again within a few weeks or so.

Yesterday I encountered a performance issue at a customer site. One thing I’ve learned over the last couple of years is that “ESXTOP” can be very useful in pinpointing performance issues, so writing this article happened sooner than I expected. The customer measured all sorts of counters within the VM and all the symptoms made the customer conclude that the problem was related to the virtual SCSI controller and / or the virtual harddisks(vmdk’s). The symptoms were high “Physical Disk\Avg. Disk sec/Transfer” and peak “Physical Disk\Avg. Disk Writes/Sec” behaviour. In other words, transferring data to and from the disk took too long and there wasn’t a constant stream of I/O.

Replicate Datacenter Analyzer 1.2

Duncan Epping · Jan 29, 2009 ·

I was just testing Replicate Datacenter Analyzer(RDA) 1.2 at a customer site. Well “testing” might not be the correct word in this case. RDA 1.2 discovered several things which are impossible to discover manually when you’ve got 5+ hosts. In this case there were over 50 hosts and RDA exposed the following:

inconsistent portgroup names
inconsistent portgroup provisioning on hosts
multiple VM’s with diskfiles on more than one datastore
multiple VMr’s with more than one connected NIC

RDA can do a lot more of course, so I suggest you head over to their website and download the demo and see if your VI3 environment is healthy or not. For those that already tested the previous release, 1.2 offers the following new capabilities:

New IP Knowledge Module – including the ability to detect and resolve configuration issues across a broader range of network issues. RDA can now identify routing and subnet misconfiguration and can determine if a guest VM network stack is operating correctly, as well as check for duplicate IP address usage in a common subnet.

Expanded drill down diagnostics – providing data to explain issues and guide IT towards a quick resolution – going beyond the basic identification of errors to save IT time and money.

Advanced item level notification – offering email notifications which now include full details on the exact changes that RDA has detected. The detailed notifications provide IT administrators with the latest information, delivered directly to their inbox.

Broader platform support – including support for VMware ESXi.

Increased scalability – offering significant performance improvements, including enhanced support for large scale datacenters of 100+ hosts.