ESX

Leave powered on…

Duncan Epping · Dec 4, 2008 ·

On Friday I got a question about the “Leave powered on” setting for HA. This setting is used for the Isolation Response. In other words, what does HA needs to do when a network isolation is detected. The question was pretty straight forward:”What happens when a host is isolated from the network or a host dies completely?”.

This question was asked because the default setting changed in ESX 3.5 / VirtualCenter 2.5 from “Power off vm” to “Leave powered on”. The old value was simple, what ever happens the VM will be powered off and restarted on a different host. [Read more…] about Leave powered on…

Patches, also for ESX 3.0.x

Duncan Epping · Dec 3, 2008 ·

I didn’t even notice it this morning, but it was 05:45 when I woke up so that should be an excuse. It’s not only ESX 3.5 that needs to be patched, same goes for the 3.0.x hosts. So be sure to check out the patches section of the VMware website if you’ve still got 3.0.x running.

New patches for ESX 3.5

Duncan Epping · Dec 3, 2008 ·

VMware just released a bunch of patches for ESX 3.5, three security, two critical and one general:

And I know for a fact that some of you where waiting on ESX350-200811401 to drop:

A memory corruption condition may occur in the virtual machine hardware. A malicious request sent from the guest operating system to the virtual hardware may cause the virtual hardware to write to uncontrolled physical memory.The Common Vulnerabilities and Exposures project (cve.mitre.org) has assigned the name CVE-2008-4917 to this issue.
VMotion might trigger VMware Tools to automatically upgrade. This issue occurs on virtual machines that have the setting for Check and upgrade Tools before each power-on enabled, and the affected virtual machines are moved, using VMotion, to a host with a newer version of VMware-esx-tools. Symptoms seen without this patch:
- Virtual machines unexpectedly restart during a VMotion migration
- The guest operating systems might stall (reported on forums).
Note: After patching the ESX host, you need to upgrade VMware Tools in the affected guests that reside on the host.
Swapping active and standby NICs results in a loss of connectivity to the virtual machine.
A race issue caused an ASSERT_BUG to unnecessarily run and caused the ESX host to crash. This change removes the invalid ASSERT_BUG.Symptoms seen without this patch: The ESX host crashes with an ASSERT message that includes fs3DiskLock.c:1423.Example: ASSERT /build/mts/release/bora-77234/bora/modules/vmkernel/vmfs3/fs3DiskLock.c:1423 bugNr=147983
A virtual machine can become registered on multiple hosts due to a .vmdk file locking issue. This issue occurs when network errors cause HA to power on the same virtual machine on multiple hosts, and when SAN errors cause the host on which the virtual machine was originally running to lose its heartbeat. The original virtual machine becomes unresponsive.With this patch, the VI Client displays a dialog box warning you that a .vmdk lock is lost. The virtual machine is powered off after you click OK.
This change fixes confusing VMkernel log messages in cases where one of the storage processors (SP) of an EMC CLARiiON CX storage array is hung. The messages now correctly identify which SP is hung.Example of confusing message:vmkernel: 1:23:09:57.886 cpu3:1056)WARNING: SCSI: 2667: CX SP B is hung. vmkernel: 1:23:09:57.886 cpu3:1056)SCSI: 2715: CX SP A for path vmhba1:2:2 is hung.

vmkernel: 1:23:09:57.886 cpu3:1056)WARNING: SCSI: 4282: SP of path vmhba1:2:2 is hung. Mark all paths using this SP as dead. Causing full path failover.

In this case, research revealed that SP A was hung, but SP B was not.
This patch allows VMkernel to successfully boot on unbalanced NUMA configurations—that is, those with some nodes having no CPU or memory. When such unbalanced configuration is detected, VMkernel shows an alert and continues booting. Previously, VMkernel failed to load on such NUMA configurations.Sample alert message when memory is missing from one of the nodes (here, node 2):

No memory detected in SRAT node 2. This can cause very bad performance.
When the zpool create command from a Solaris 10 virtual machine is run on a LUN that is exported as a raw device mapping (RDM) to that virtual machine, the command creates a partition table of type GPT (GUID partition table) on that LUN as part of creating the ZFS filesystem. Later when a LUN rescan is run on the ESX server through VirtualCenter or through the command line, the rescan takes a significantly long amount of time to complete because the VMkernel fails to read the GUID partition table. This patch fixes this problem.
Symptoms seen without this patch: Rescanning HBAs takes a long time and an error message similar to the following is logged in /var/log/vmkernel:Oct 31 18:10:38 vmkernel: 0:00:45:17.728 cpu0:8293)WARNING: SCSI: 255: status Timeout for vml.02006500006006016033d119005c8ef7b7f6a0dd11524149442030. residual R 800, CR 80, ER 3
A race in LVM resignaturing code can cause volumes to disappear on a host when a snapshot is presented to multiple ESX hosts, such as in SRM environments.

Symptoms: After rescanning, VMFS volumes are not visible.
This change resolves a rare VMotion instability.Symptoms: During a VMotion migration, certain 32-bit applications running in 64-bit guests might crash due to access violations.
Solaris 10 Update 4, 64-bit graphical installation fails with the default virtual machine RAM size of 512MB.
DRS development and performance improvement. This change prevents unexpected migration behavior.
In a DRS cluster environment, the hostd service reaches a hard limit for memory usage, which causes hostd to restart itself.Symptoms: The hostd service restarts and temporarily disconnects from VirtualCenter. The ESX host stops responding before hostd reconnects.
Fixes for supporting Site Recovery Manager (upcoming December 2008 release) on ESX 3.5 Update 2 and Update 3.

Gabes Virtual World coverage on Hyper-v vs ESX

Duncan Epping · Dec 2, 2008 ·

Gabe from “Gabes Virtual World” is on a roll lately. He wrote three great articles which describe why he wouldn’t want to use Hyper-V in his Datacenter at this moment. I linked to his blog articles and grabbed a quote from each article that sets the tone in my opinion:

Part 1 Hardware:

In fact, I think you will come up with a smaller number of supported nics for Hyper-V, because ESX does the VLAN trunking and teaming independent of any drivers. In ESX you can easily create a virtual switch that has an HP, Intel, Broadcom and whatever nic combined and still do VLAN trunking and teaming. Have a look at the VMware I/O HCL and learn which nics are supported. Please try to find as many nics for Hyper-V.

Part 2 Guest OS and memory Overcommit:

Looking at this environment, there is quite a number of systems that would not be supported on Hyper-V. Big deal you say? Actually, yes it is a big deal. Wasn’t cost saving by getting rid of a lot of physical hardware one of the main drivers to go virtual? Leaving a bunch of servers (around 130 in our case) run physical is quite a lot, especially because these will probably be somewhat older servers that are getting more expensive in their support contract and more vulnerable to failure.

Part 3 Motions and Storage:

While searching for more info on Hyper-V QuickMigration, I found a rather disturbing video by VMware. Now, it is fair to say that VMware which is the biggest competitor of Hyper-V, but somehow I do believe the video is 100% accurate. Should it not be, please let me know. The video shows then when a QuickMigration is performed between hosts with different CPU features, Hyper-V does not check the compatibility between these features. The implications here is that you could have applications crash because the application is using CPU features on host-A that are not available when running on host-B.

I think Gabe did an excellent job on these blog posts. Not only did he take time to actually discover what is and what isn’t possible he also questions himself on objectivity.

For me personally Hyper-V isn’t an enterprise solution like ESX/Virtual Infrastructure is. But I also have the feeling that Microsoft isn’t after the Enterprise market at this point in time. They try to win the SMB market and than climb their way up. Just like they tried with for instance MS SQL vs Oracle. I said tried cause I still don’t think they succeeded and probably never will.

If you want to know why I don’t think that Hyper-V is an Enterprise Solution, just read Gabrie’s blogs. Especially the section on Storage, Motions, Hardware(Nic teaming) and memory overcommitment should give you an idea why.

Great job Gabe, and keep them coming!

Replicate Technologies and RDA 1.1

Duncan Epping · Dec 2, 2008 ·

Just received an email that RDA has been updated. Version 1.1 has just been released and the following changes have been incorporated:

Easy wizard based setup – get up and running even faster!
Larger scale support – RDA now supports much larger deployments.
Highly detailed diagnostic analysis, helping you find and fix problems more quickly.
Full support for ESXi

I’m downloading RDA 1.1 right now and will try to right a follow up of my original article “Replicate Technologies Datacenter Analyzer” as soon as I can find some spare time.