Bugs

patches for 3.0.x released

Duncan Epping · Oct 1, 2008 ·

VMware just released a whole bunch of patches for 3.0.x, so if you’re still using 3.0.x than you should take a look here. At least three of them are related to Linux Guests! There are 6 patches for 3.0.2 and 4 for 3.0.3. Have fun, 😉

Storage VMotion fails after Service Console IP change

Duncan Epping · Sep 29, 2008 ·

As some of you know I spend a lot of time on the VMTN forums, helping people out and learning from other experts. Today someone posted about Storage VMotion not working after a Service Console IP change. It reminded me of a problem I faced a while ago. The solution wasn’t obvious but fairly easy, because the problem was solved with a slightly different approach I wrote it down:

Disconnect the ESX host from VirtualCenter
Stop the VMware VirtualCenter Server service
Remove the /etc/opt/vmware/vpxa/vpxa.cfg file from the ESX host that’s affected
Run this script on the database:
———–
UPDATE [VCDB].[dbo].[VPX_HOST]
SET [IP_ADDRESS] = ‘w.x.y.z’
WHERE [DNS_NAME] = ‘name of esx host as it is listed in the table’
———–
“w.x.y.z” above is the new ip address
Start the VMware VirtualCenter Server service
Add the host to the cluster again

Thanks goes out to BigRolTide for pointing me out to this solution. (In my case updating the database wasn’t necessarry)

Update: http://kb.vmware.com/kb/1006768

Back to business again, patches!!

Duncan Epping · Sep 19, 2008 ·

So it’s back to business again after a couple of crazy days during VMworld. All the blogs can return to their normal “technical” writing.

VMware just released a whole bunch of patches for ESX(i) 3.5, so for you SysAdmins it’s also back to work again. You can find them here.

There’s also a new release for VMware’s Lab Manager available, version 3.0 Patch 1. There’s a KB article about this patch release, read it before you run it.

Weird problems with enabling HA on ESXi

Duncan Epping · Sep 18, 2008 ·

A couple of days ago an ex-colleague phoned me about a weird problem with enabling HA in a ESXi cluster. The following errors occurred:

Configuration of host IP address is inconsistent on host : address resolved to Host misconfigured. IP address of not found on local interfaces
cmd addnode failed for primary node: Internal AAM Error – agent could not start

So the first error(1.) was reported by esxhost01 and the second(2.) by esxhost02.

Let’s start with esxhost01.

So this customer had a VMotion and Management portgroup on two seperate vSwitches. This error seems to indicate that during the configuration HA is using the VMotion portgroup. These hosts have been added to VC with the management portgroup IP(IP+Name also in dns). So how do I make sure that HA isn’t using the VMotion network for HA, it’s easy go to your cluster and open up the advanced options for HA and add the following key with the value false:

das.allowVmotionNetworks=false

In other words, don’t use the VMotion network for the HA heartbeat. The weird thing in this case is that it shouldn’t use the VMotion network by default so there seems to be a glitch here…

So now for the second problem.

The HA(AAM) agent could not start. So just to make sure that the USB key wasn’t corrupt the key was recreated. But still this error occurred. As some of you might now, that if you want to use HA with a disk less server you will need to create a userworld swap on the SAN. (Read this KB for more info on that one…) So just to make sure that the swap wasn’t causing this problem the directory was cleaned out and and HA was reconfigured. When the directory was emptied the HA agent installed without any problem at all…

When reinstalling ESXi or when strange HA errors occur clean up the userworld swap!

Thanks goes out to Remco for providing me with the additional details!

Why I dislike agents in my Service Console

Duncan Epping · Aug 27, 2008 ·

I’ve never been a huge fan of agents in the Service Console. Too many times I’ve seen hosts fail because of an agent that had a memory leak etc. Now it seems that running the HP IM agents causes your ESX 3.5 U2 to become unavailable after a certain amount of time.

The errors that appear:

0 Z root 8536 3673 0 79 0 – 0 nct> Aug05 ? 00:00:00 cimservera
0 Z root 8537 3673 0 79 0 – 0 nct> Aug05 ? 00:00:00 cimservera
0 Z root 8543 3673 0 78 0 – 0 nct> Aug05 ? 00:00:00 cimservera
0 Z root 32350 3673 0 79 0 – 0 nct> Aug06 ? 00:00:00 cimservera
0 Z root 32351 3673 0 79 0 – 0 nct> Aug06 ? 00:00:00 cimservera
0 Z root 32352 3673 0 79 0 – 0 nct> Aug06 ? 00:00:00 cimservera
0 Z root 32353 3673 0 78 0 – 0 nct> Aug06 ? 00:00:00 cimservera

HStrydom on the VMTN forum posted the following:

I am having the same issue. What happens after 17 days is that there are about 32000 of these processes. ESX has a max value of +- 32000 PID’s. Thus when all have been used up, one cannot SSH into the server, log in from the console or the ESX server disconnects from VC.

Also we have HP servers with the HP agents loaded. Our Dell servers does not have this problem.

Have a look at your cron log, /var/log/cron & cron.1. you might see that some of the job have not run. Also look in your /var/log/messages. There is a lot of login failures.

In other words, if you see the same thing happening call HP and let’s hope they release a fix soon! And in the meanwhile start thinking about ESXi, it’s problems like these that makes you think about why you even need a Service Console in the first place.