Lessons Learned

ESXi – lessons learned part 1

Duncan Epping · Dec 3, 2009 ·

I am working on a large ESXi deployment and thought I would start writing down some of the lessons learned. I will try to create a single post every week, if I can find the time that is.

Scratch!

There are two things that stood out the couple of days, on a technical level, when I was reading the ESXi installable documentation:

One of the things that used to be a requirement was the Scratch Partition. It appears that with vSphere this requirement has been removed:

During the autoconfiguration phase, a 4GB VFAT scratch partition is created if the partition is not present on another disk. When ESXi boots, the system tries to find a suitable partition on a local disk to create a scratch partition. The scratch partition is not required.

Of course this does not necessarily mean that you do not need one as explained in the second part of the paragraph:

It is used to store vm-support output, which you need when you create a support bundle. If the scratch partition is not present, vm-support output is stored in a ramdisk. This might be problematic in low-memory situations, but is not critical.

So the question remains what would my recommendation be? The answer is it depends, yes I know the easy way out. But when you have enough RAM on a host and from experience know that usually you only create support dumps on hosts which are in maintenance mode then don’t worry about it and don’t create it. However if you feel there is a need to create vm-support dumps while running production make sure there is a scratch partition with enough free space available.

Support

Yes ESXi is fully supported but there are some restrictions:

Boot from FC SAN – Experimental Support
Stateless PXE Boot – Experimental Support

Now what does “experimental support” mean? According to the VMware website it means the following:

VMware includes certain “experimental features” in some of our product releases. These features are there for you to test and experiment with. VMware does not expect these features to be used in a production environment. However, if you do encounter any issues with an “experimental feature”, VMware is interested in any feedback you are willing to share. Please submit a support request through the normal access methods. VMware cannot, however, commit to troubleshoot, provide workarounds or provide fixes for these “experimental features”.

So does that mean that in the case of stateless the booting process is experimental? Or the installation process in the case of boot from FC SAN?

No it does not. Everything related to ESXi is “experimental”. So what does this mean? Imagine you are facing serious storage issues and you just called VMware. VMware analyzes your environment and notices that it’s a PXE booted environment, they will more than likely give your support call a lower priority. Not only a lower priority but the support is “best effort”, no guarantees.

EMC SRDF Storage Replication Adapter

Duncan Epping · Jan 30, 2009 ·

I was delivering a Site Recovery Manager Jumpstart today. During the configuration of the Storage Replication Adapter(SRA) the task got stuck at 23%. I’ve seen the configuration of the SRA get stuck once before at 23% so I knew it was the “DiscoverLuns” command that failed for one reason or the other.

We ran the configuration of the SRA again and it stopped after exactly five minutes. We decided to run the DiscoverLuns task again but this time manual with use of command.pl and a xml file as input. If you read the previous article on DiscoverLuns you know how to feed command.pl with the xml file and what info the file should contain.

Running the DiscoverLuns manually worked great, but it actually took little over 15 minutes to complete before the complete results were returned by the EMC DMX3. During the configuration via the GUI the task failed after exactly 5 minutes, so it seemed to time-out. Opening up vmware-dr.xml, which can be found in the Site Recovery Manager installation folder, revealed a time out of 300 seconds:

We changed the value to 1800, restarted the SRM service and reconfigured the SRA successful.

Failover using SRM might be slow…

Duncan Epping · Jan 26, 2009 ·

I was just reading an excellent weekly technical digest by VMware’s Michael White and noticed the mention of a KB article on SRM. This KB article has the following describtion:

With VMware Site Recovery Manager 1.0 Update 1, recovery of a VM might take a long time. The recovery time during a test or real recovery will be longer when more VM’s are involved. The Change Network Settings task might time out during the test or real failover. This is due to the serial fashion in which Site Recovery Manager waits until a guest heartbeat is seen prior to customizing the VM.

This problem can be encountered when running the following ESX versions:

ESX 3.5 Update 2 and Update 3
ESX 3.0.2 and 3.0.3

In other words, the behaviour of ESX has changed and it might be useful and beneficial for SRM to change this behaviour again. We are talking about a 5 minute delay, that’s 5 minutes for each VM. You can imagine that running a recovery plan can and will take a long time when this setting isn’t changed. Here’s the solution which has also been outlined in the KB article.

Set hostd heartbeat delay to 40.
Disconnect the host from VC (Right click on host in VI Client and select “Disconnect” )
Login as root to the ESX Server with SSH.
Using a text editor such as nano or vi , edit the file /etc/vmware/hostd/config.xml
Set the “heartbeatDelayInSecs” tag under “vmsvc” to 40 seconds as shown here:

<vmsvc> <heartbeatDelayInSecs>40</heartbeatDelayInSecs> <enabled>true</enabled> </vmsvc>

Restart the management agents for this change to take effect. See Restarting the Management agents on an ESX Server (1003490).
Reconnect the host in VC ( Right click on host in VI Client and select “Connect” )

Storage Replication Adapter: discoverLuns…

Duncan Epping · Jan 20, 2009 ·

Today I was implementing Site Recovery Manager with a colleague(Thanks Andy!!). During the configuration of the HP EVA SRA(Storage Replication Adapter) we received the following error:

discoverLuns script failed to execute properly

The error indicates that that the first part of the SRA configuration “discoverArrays” worked but when discovering the LUN’s and it’s replica’s it bailed out(23%). So after checking the config files and log files we decided to run the scriptfile, that the SRA uses, manually and see what happens.

First we created an XML file which feeds the script. The XML file contained the following, which can be copied from the SRM Log files:

<?xml version=”1.0″ encoding=”ISO-8859-1″?>
<Command>
<Name>discoverLuns</Name>
<ConnectSpec>
<Name>HP StorageWorks EVA Virtualization Adapter</Name>
<Address>san.yellow-bricks.com</Address>
<Username>user</Username>
<Password>password</Password>
</ConnectSpec>
<ArrayId>YB-SAN-01</ArrayId>
<OutputFile>C:\TEMP\SAN.Log</OutputFile>
<LogLevel>trivia</LogLevel>
</Command>

Now we were able to run the script with the XML file as input:

perl command.pl < file.xml

In our case running the script manually with the XML file as input didn’t return an error. This gave us the idea that it might be account or permissions related. During the configuration of the SRA we entered domain credentials, which were the same as the account being used during the manual run of the script. So it wasn’t the SRA account that was causing these problems.

After diving into the configuration we stumbled upon the SRM Service. The SRM service was started with a Local System account. We decided to change the account used for the Service from “Local” to a domain account… and indeed problem solved.

One would expect this to be part of the SRA documentation, but it isn’t. We contacted VMware Support and they had the same configuration running in their test environment except for the fact that they weren’t using AD authentication. In their case the Local System account just worked fine.

I’ve emailed Support all the log files and according to them our suspicion was correct. It seems to be related to the HP EVA SRA. The HP SRA seems to use the wrong account for authentication at one point during the script. Next up: Contact HP Support and let’s see if they can a) fix this or b) update their documentation.

Site Recovery Manager is not about installing… Part II

Duncan Epping · Jan 12, 2009 ·

I’ve been playing around with Site Recovery Manager these last couple of days. Installing it was really easy and same goes for the basic configuration. I already wrote a blog about this topic a month ago or so but now I’ve experienced it myself. Most of the time during a Site Recover Manager project will be spent during the Plan & Design phase and writing documentation. I will just give you one example why. The following was taken from the SRM Course material:

Datastore Group
Replicated datastores containing the complete set of virtual machines that you want to protect with SRM

Protection Group
A group of virtual machines that are failed over together during test and recovery

For those who don’t know, there’s a one on one mapping between Datastore Groups and Protection Groups. So in other words, once you’ve mapped a Datastore Group to a Protection Group there’s no way of changing it without having to recreate the Protection Group.

I think a picture says more than a 1000 words so I stole this one from the Evaluator Guide to clarify the relationship between datastore, Datastore Groups and Protection Groups:

Notice that there are multiple datastores in Datastore Group 2 because VM4 has disks in both datastores. So these datastores are joined into one Datastore Group. This Datastore Group will have a one to one relationship with a Protection Group. Keep in mind, this is really important: a Protection Group contains VM’s that are failed over together during test and recovery.

If you’ve got VM’s with multiple disks on multiple datastores with no logic in which disk is placed on which datastore you could and probably will end up with all datastores being member of the same Datastore Group. Being member of the same Datastore Group means being part of the same Protection Group. Being part of the same Protection Group will result in a less granular fail-over. It’s all or nothing in this case and I can imagine most companies would like to have some sort of tiering model in place or even better fail over services one at a time. (This doesn’t mean by the way that if you create multiple Protection Group that you can’t fail over everything at the same time, they can all be joined in a Recovery Plan)

Some might think that you would be able to randomly add disks to datastores after you finished configuring. This clearly isn’t the case. If you add a disk to a protected(!) VM the Datastore Group will be recomputed. In our situation this meant that all VM’s in the “Medium Priority” Protection Group were moved over to the “High Priority” Protection Group. This was caused by the fact that we added a disk to a “Medium Priority” VM and placed it on a “High Priority” datastore. As you can imagine this also causes your Recovery Plans to end up with a “warning”, you will need to reconfigure the moved VM’s before you can fail them over as part of your “High Priority” datastore. (Which probably wasn’t the desired strategy…)

When I was searching the internet for information on SRM I stumbled upon this article on the VMware Uptime blog by Lee Dilworth. I’ve taken the following from the “What we’ve learnt” post, which confirms what we’ve seen the last couple of days:

Datastore Group computation is triggered by the following events:

Existing VM is deleted or unregistered

VM is storage vmotioned to a different datastore

New disk is attached to VM on a datastore previously not used by the VM

New datastore is created

Existing datastore is expanded

So in other words, moving VM’s from one Datastore to another or creating a new disk on a different Datastore can cause problems because the Datastore Group computation will be re-run. Not only do you need to take virtual disk placement in consideration when configuring SRM, you will also need to be really careful when moving virtual disks. Documentation, Design and Planning is key here.

I would suggest documenting current disk placement before you even start implementing SRM, and given the results you might need to move disks around before you start with SRM. Make sure to check your documentation and design before randomly adding virtual disks when SRM has been implemented. Documenting your current disk placement can be done easily with the script that Hugo created this week by the way, and I would suggest to regularly create reports and save them.

Expect some more SRM stuff coming up over the next couple of weeks.