ESXi – lessons learned part 1

I am working on a large ESXi deployment and thought I would start writing down some of the lessons learned. I will try to create a single post every week, if I can find the time that is.

Scratch!

There are two things that stood out the couple of days, on a technical level, when I was reading the ESXi installable documentation:

One of the things that used to be a requirement was the Scratch Partition. It appears that with vSphere this requirement has been removed:

During the autoconfiguration phase, a 4GB VFAT scratch partition is created if the partition is not present on another disk. When ESXi boots, the system tries to find a suitable partition on a local disk to create a scratch partition. The scratch partition is not required.

Of course this does not necessarily mean that you do not need one as explained in the second part of the paragraph:

It is used to store vm-support output, which you need when you create a support bundle. If the scratch partition is not present, vm-support output is stored in a ramdisk. This might be problematic in low-memory situations, but is not critical.

So the question remains what would my recommendation be? The answer is it depends, yes I know the easy way out. But when you have enough RAM on a host and from experience know that usually you only create support dumps on hosts which are in maintenance mode then don’t worry about it and don’t create it. However if you feel there is a need to create vm-support dumps while running production make sure there is a scratch partition with enough free space available.

Support

Yes ESXi is fully supported but there are some restrictions:

  • Boot from FC SAN – Experimental Support
  • Stateless PXE Boot – Experimental Support

Now what does “experimental support” mean? According to the VMware website it means the following:

VMware includes certain “experimental features” in some of our product releases. These features are there for you to test and experiment with. VMware does not expect these features to be used in a production environment. However, if you do encounter any issues with an “experimental feature”, VMware is interested in any feedback you are willing to share. Please submit a support request through the normal access methods. VMware cannot, however, commit to troubleshoot, provide workarounds or provide fixes for these “experimental features”.

So does that mean that in the case of stateless the booting process is experimental? Or the installation process in the case of boot from FC SAN?

No it does not. Everything related to ESXi is “experimental”. So what does this mean? Imagine you are facing serious storage issues and you just called VMware. VMware analyzes your environment and notices that it’s a PXE booted environment, they will more than likely give your support call a lower priority. Not only a lower priority but the support is “best effort”, no guarantees.

Using the VSS Driver for backups?

I received an email from one of my readers, Kevin, about using the VSS Driver for Backups. As of 3.5 Update 2 this feature has been introduced. I knew that there was a catch to using the VSS Driver but it seems that many people have overlooked a little detail on page 50 of the documentation of which Kevin was one.

NOTE The VSS component gets installed by default when you do a fresh installation of VMware Tools shipped with ESX Server 3.5 Update 2. If you upgrade from an earlier version, you need to install the VSS component manually.

Please be aware that when you want to use VSS on your existing environment you will need to manually upgrade your version of VMware Tools and select the VSS driver. For Windows 2008 and Vista the VSS driver is not installed by default even when doing a clean install this means that you will need to manually add the driver when VSS based backups is a requirement.

You don’t need any brains to listen to music pt II

Mike Laverick just posted an article about music and that triggered this one. I already did a couple of posts with youtube clips so let’s continue the tradition. Unfortunately I could not find the official videos of the latter two… but who cares anyway.

Editors – Papillon
Probably one of the best concerts I’ve seen over the last couple of years. There’s not much more I can say, they played an amazing show and this is one of the tracks of their latest album. Each album they’ve produced has a different sound but it doesn’t sound forced, it’s an evolution….

Besides music that’s almost depressing I also love music which sounds energetic and aggressive but has positive touch. One of the bands that most definitely falls into that category is Minor Threat. Minor Threat is more or less responsible for the existence of a sub genre called straight edge hardcore. For those who don’t know what “straight edge” is check wikipedia.

One of the bands that I haven’t seen live but that blew me away with a new album is Them Crooked Vultures. Although not all review were positive I think people reviewing music should take a couple of steps back. Expectations around Them Crooked Vultures were so high that it was impossible to meet these. That was also never the goal of Them Crooked Vultures. These guys wanted to produce a solid rock album and they did.

Killing in the name of

Since vSphere has been introduced more and more of my customers are migrating to ESXi. It makes sense as the thin hypervisor is the way of the future according to VMware.

One common used argument by the admins to not use ESXi is killing a rogue VM. Normally an SSH session would be opened to the Service Console and with “kill -9″ the VM would be killed when a “power off” did not work”. Because ESXi is COS-less this is not an option. However, it is still possible to kill these VMs by using the following procedure:

login in unsupported mode:
Press <alt> + <f1> and type “unsupported” <enter>
List all running VMs:

vim-cmd vmsvc/getallvms

Kill VM with vm id:

vim-cmd vmsvc/poweroff <vm id>

Issue with Update 1 and COS Agents

I wrote this article after the KB was release but forgot to click “publish”. Those planning to upgrade their vSphere environment to Update 1 please read the following:

KB Article 1016070

Issue
Upgrading ESX 4.0 to 4.0 U1 fails or times out and rebooting the host results in a purple diagnostic screen

Who is affected
Customers using VMware vSphere 4 upgrading to Update 1 with 3rd party management agents running.

To identify whether you have a 3rd party management agent installed:

  1. Log into the ESX host as root.
  2. Run the esxupdate query command.
  3. Anything listed under New packages: may be a 3rd party management agent in the service console.

Note: Consult your hardware vendor documentation for specific package names that are installed in the service console.

Solution

To avoid this issue, prior to the update, disable all 3rd party management agents running on the ESX 4.0.0 server before applying the update.

Note: The 3rd party management agents can be enabled after the upgrade is completed.

If you have already updated the ESX host, do not reboot the ESX host. Open a support request with VMware support. For more information, see How to Submit a Support Request.
WARNING: Rebooting the host means the host has to be reinstalled because it is not recoverable after a reboot.

WARNING: If you have virtual machines running on local storage, they may not be retained if you reinstall ESX 4.0 as a result of this issue. Contact VMware Support for assistance in recovering those virtual machines.

New whitepapers

VMware just published two whitepapers. I hadn’t noticed them yet and especially the second one is a very good read!

  1. VMCI Socket Performance
    The VMCI (Virtual Machine Communication Interface) device allows fast, efficient communication between virtual machines running on the same host, without using the guest networking stack. This paper presents VM-VM performance results using VMCI Sockets and compares these results to the VM-VM performance achieved using regular TCP/IP sockets.
  2. VMware vCenter Site Recovery Manager 4.0 Performance and Best Practices for Performance
    The goal of this white paper is to provide you with Site Recovery Manager performance data and recommendations so that you can architect an efficient recovery plan that minimizes the downtime for your environment.
    This white paper addresses various dimensions on which the recovery time depends:

    • Recoveries with iSCSI, FC, and NFS storage
    • Number of virtual machines and protection groups associated with a recovery plan
    • Virtual machine to protection group relation
    • Recovery site performance in a cluster with DPM and DRS
    • Configuration of various recovery plan parameters
    • Priority assignment of virtual machines in the recovery plan
    • High latency network between protected and recovery sites

    Furthermore, best practices are suggested in applicable areas so that you can optimize the recovery time using Site Recovery Manager.

vSphere and Service Console Memory

Today I read something I have not seen anywhere else before. I have always been under the impression that the memory reserved for the Service Console was increased from 272MB to 300MB. Although the bare minimum is indeed 300MB there’s another side to this story, something I did not expect but actually does make sense. As of ESX 4.0 the allocated Service Console memory automatically scales up and down when there is enough memory available during installation. Let’s make try to make that crystal clear:

  • ESX Host – 8GB RAM -> Default allocated Service Console RAM = 300MB
  • ESX Host – 16GB RAM -> Default allocated Service Console RAM = 400MB
  • ESX Host – 32GB RAM -> Default allocated Service Console RAM = 500MB
  • ESX Host – 64GB RAM -> Default allocated Service Console RAM = 602MB
  • ESX Host – 96GB RAM -> Default allocated Service Console RAM = 661MB
  • ESX Host – 128GB RAM -> Default allocated Service Console RAM = 703MB

Lessons learned:

  1. Allocated Service Console memory is based on a formula which takes available RAM into account. (Haven’t found the exact formula yet, if I do I will of course add it to this article.)
  2. Always make your swap partition 1600MB; as an increase of RAM might automatically lead to a swap partition which is too small.
Subscribe to RSS Feed Follow me on Twitter!