BC-DR

Health Check tools I use

Duncan Epping · Dec 18, 2008 ·

A few days ago Scott Lowe asked me which tools I use to deliver a health check engagement. A health check is a standard VMware PSO engagement, a VMware Consultant will be on site to check the status of your environment and will draw up a report.

I personally use the following tools:

Health Check script by A.Mikkelsen → for a quick overview of the current situation and setup, small files and easy to carry around, runs from the Service Console.
VMware Health Analyzer Appliance → A linux appliance that can connect to your VC/ESX and analyze log files. At this point in time it’s only available for VMware Employees or Partners with access to Partner Central.
Powershell: Report into MS Word → Alan Renouf created this great reporting powershell scripts. It dumps info into a word document. (And i’ve heard he’s also working on a Visio export)
Powershell: Health Check Script → Create an html report with datastore, cpu, memory and snapshot info… and more.
RVTools → Gives a quick overview of current VM setup like snapshots, memory, cpu etc.
Common sense → I hardly encounter really huge problems, mainly decreased availability cause of choices made during implementation / design phase without following VMware’s guidelines. Use common sense is the best advise in this case and read the best practice documents and VMware’s collection of pdf’s!
And when there are some disturbing errors in one of the various log files you have the option to run it through one of the many toolkits we internally have.

I’m not using the following tools actively during engagements because of licensing but they can be very usefull in your enviroment:

Replicate Datacenter Analyzer → Analyze your VI3 environment, I wrote an article a few weeks ago on RDA, click here
Veeam Monitor → Monitor your VI3 environment including performance graphs etc.
Veeam Reporter → A reporting tool, which will come in handy when documenting environments and comparing the current config to an old config.
Vizioncore vFoglight → Might come in handy when doing analyses of trends and pinpointing resource contention.
Tripwire Configcheck → Analyze the security of your VMware ESX environment. Check my blog post on Configcheck here.

Scripts, scripts, scripts… come and get them!

Duncan Epping · Dec 18, 2008 ·

I was just pointed out to this amazing topic on the VMTN Forums by William Lam and Tuan Duong. These guys created a whole bunch of scripts and decided to share them with the rest of the world:

My colleague (Tuan Duong) and I (William Lam) have been working on a virtualization/VDI deployment project over the last six months. The result of this work is a set of scripts that assist in provisioning and managing the server and lab environment for the Residential Networking Services (ResNet) at the University of California, Santa Barbara.

We took the approach of developing scripts that would be free in nature to support a variety of offerings that currently exist in the enterprise space. One such tool that we would like to share with the VMware community is our Linked Clones script that was developed at the beginning of the summer of 2008. This script functions similarly to the View Composer component in the recent release of VMware View 3 but with relatively relaxed requirements.

A description and more details of the Linked Clones script can be found at:

http://communities.vmware.com/docs/DOC-9020

Another script that complements the Linked Clone’s script is our custom management script “*my-vmware-cmd*” which can be found at:

http://communities.vmware.com/docs/DOC-9061

An example of our implementation of these scripts can be found at:

http://communities.vmware.com/docs/DOC-9201

We also have other scripts and resources that have been consolidated onto a webpage and would like to share it:

http://www.engineering.ucsb.edu/~duonglt/vmware/

We hope that the community finds some of these scripts to be useful in aiding VI administrators to manage their virtual infrastructure and look forward to any feedback that is provided.

Thanks
William lamw and Tuan tlduong

Check these scripts out if you’re looking for a linked clone solution on a non View Composer environment. Their website also contains a bunch of scripts, tips and tricks. One that really stands out is the RDM script, believe me this is a must have for your toolkit:

Download: rdm.sh – 11/03/08
Compatiable with: ESX 3.5+ and ESXi

This script is used to locate all virtual machines that have an RDM mapping and provides the VMs Name, Hard Disk label shown on the VIC/VC, Datastore, LUN UUID, HBA/LUN, Compatibility Mode (Phys/Virt), DiskMode and Capacity.

HA: who decides where a VM will be restarted?

Duncan Epping · Dec 15, 2008 ·

During the Dutch VMUG someone walked up to me and asked a question about High Availability. He read my article on Primary and Secondary nodes and was wondering who decided where and when VM would be restarted.

Let’s start with a short recap of the “primary/secondary” article: “The first five servers that join the cluster will become a primary node, and the others that will join will become a secondary node. Secondary nodes send their state info to primary nodes and also contact the primary nodes for their heartbeat notification. Primary nodes replicate their data with the other primary nodes and also send their heartbeat to other primary nodes.”

The question was, when a fail-over needs to take place cause an isolation occurred who decides on which host a specific VM will be restarted. The obvious answer is one of the primaries. One of the primaries will be selected as the “fail-over coordinator”. The fail-over coordinator coordinates the restart of virtual machines on the remaining hosts. The coordinator takes restart priorities in account. Keep in mind, when two hosts fail at the same time it will handle the restart sequentially. In other words, restart the VM’s of the first failed host(taking restart priorities in account) and then restart the VM’s of the host that failed as second(again taking restart priorities in account). If the fail-over coordinator fails one of the primaries will take over.

By the way, this is another reason why you can only account for 4 host failures. You need at least 1 primary, this primary will be the fail-over coordinator. When the last primary dies….

EnableResignature and/or DisallowSnapshotLUN

Duncan Epping · Dec 11, 2008 ·

I’ve spend a lot of time in the past trying to understand the settings for EnableResignature and DisallowSnapshotLUN. It had me confused and dazzled a couple of times. Every now and then I still seem to have trouble to actually understand these settings, after a quick scan through the VCDX Enterprise Study Guide by Peter I decided to write this post and I took the time to get to the bottom of it. I needed this settled once and for all, especially now I start to focus more on BC/DR.

According to the San Config Guide(vi3_35_25_san_cfg.pdf) there are three states:

EnableResignature=0, DisallowSnapshotLUN=1 (default)
In this state, you cannot bring snapshots or replicas of VMFS volumes by the array into the ESX Server host regardless of whether or not the ESX Server has access to the original LUN. LUNs formatted with VMFS must have the same ID for each ESX Server host.
EnableResignature=1, (DisallowSnapshotLUN is not relevant)
In this state, you can safely bring snapshots or replicas of VMFS volumes into the same servers as the original and they are automatically resignatured.
EnableResignature=0, DisallowSnapshotLUN=0 (This is similar to ESX Server 2.x behavior.)
In this state, the ESX Server assumes that it sees only one replica or snapshot of a given LUN and never tries to resignature. This is ideal in a DR scenario where you are bringing a replica of a LUN to a new cluster of ESX Servers, possibly on another site that does not have access to the source LUN. In such a case, the ESX Server uses the replica as if it is the original.

The advanced LVM setting EnableResignature is used for resignaturing a VMFS volume that has been detected with a different LUN ID. So what does the LUN ID has to do with the VMFS volume? The LUN ID is stored in the LVM Header of the volume. The LUN ID is used to check if it’s the same LUN that’s being (re)discovered or a copy of the LUN that’s being presented with a different ID. If this is the case the VMFS volume needs to be resignatured, in other words the UUID will be renewed and the LUN ID will be updated in the LVM header.

UUID, what’s that? Chad Sakac from EMC described it as follows in his post on VMFS resignaturing:

It’s a VMware generated number – the LVM signature aka the UUID (it’s a long hexadecimal number designed to be unique). The signature itself has little to with anything presented by the storage subsystem (Host LUN ID, SCSI device type), but a change in either will cause a VMFS volume to get resigned (the ESX server says “hey I used to have a LUN with this signature, but it’s parameters were different, so I better resign this”).

Like Chad says the UUID has little to do with anything presented by the storage subsystem. A VMFS volume ID aka UUID looks like this:

42263200-74382e04-b9bf-009c06010000

1st part – The COS Time when the file-system was created or re-signatured
2nd part – The TSC Time; an internal time stamp counter kept by the CPU
3rd part – A random number
4th part – The Mac Address of the COS NIC

Like I said before, and this is a common misconception so I will say it again, the LUN ID and the Storage System product ID are stored in the LVM header and not the actual UUID itself. Not that it really matters for the way the process works though.

I think that makes it clear when to use EnableResignature and when not to use it. Use it when you want to access VMFS volumes of which the LUN ID changed for whatever reason. For instance a fail over to a DR Site with different LUN numbering or SAN upgrades which caused changes in LUN numbering.

That leaves DisallowSnapshotLun. I had a hard time figuring out when to set it to “0” and when to leave it at the default setting “1”. But found the following in a VMworld Europe 2008 presentation:

DisallowSnapshotLun: Should be set to “0” if SCSI Inquiry string differs between the two Array’s in order to allow access to datastores.

I googled for “SCSI Inquiry” and I found the following in a document by HP:

The storage system product ID retrieved from the SCSI Inquiry string (Example: HSV210)

In other words, when you’ve got an HP EVA 4000 and an HP EVA 8000 which are mirrored you need to set DisallowSnapshotLun to 0, when a fail-over has occurred. The SCSI Inquiry string would differ because the controllers would be of a different model. (The SCSI Inquiry string also contains the LUN ID by the way.)

When both sites are exactly the same, including LUN ID’s, you don’t need to change this setting. Leave it set to 1. Be absolutely sure that when you set DisallowSnapshotLun to 0 that there’s only 1 “version” of the VMFS volume presented to the host. If for some reason both are presented data corruption can and probably will occur. If you need to present both LUNs at the same time, use EnableResignature instead of DisallowSnapshotLun.

Depending on the way your environment is setup and the method you chose to re-enable a set of LUNs you may need to re-register your VM’s. The only way to avoid this is to use DisallowSnapshotLun and pre-register all VM’s on the secondary VirtualCenter server or use just one VirtualCenter server.

Re-registering can be done with a couple of lines of script on just one ESX box:

for i in `find /vmfs/volumes/ -name "*.vmx" ` do echo "Registering VM $i" vmware-cmd -s register $i done

You can change the EnableResignature or DisallowSnapshotLun setting as follows:

open vCenter
Click on a host
Click on “Configurations” tab
Click on “Advanced Settings”
Go to “LVM”
Change appropriate setting
Click “Ok”
Rescan your HBA’s (Storage Adapters, Rescan)

It’s also possible to use the command line to enable DisallowSnapshotLun or EnableResignature:

echo 0 > /proc/vmware/config/LVM/DisallowSnapshotLUN echo 1 > /proc/vmware/config/LVM/EnableResignature

I do want to stress that setting the options should always be used temporarily considering the impact these changes can have! When you set any of both options reset them to the default. The big question still remains, would I prefer resignaturing my VMFS volumes or setting “DisallowSnapshotLun” to “0” to be able to access the volumes? Well the answer is:”It depends”. It heavily depends on the type of setup you have, I can’t answer this question without knowing the background of an environment. The safest method definitely is Resignaturing.

Before you decide read this post again and read the articles/pdf’s in the links below that I used as a reference:

Updates for the VMFS volume resignaturing discussion
HP disaster tolerant solutions using Continuous Access for HP EVA in a VI 3 environment
Fibre Channel SAN Configuration Guide
VMFS Resignaturing by Chad Sakac

vCenter Site Recovery Manager 1.0 Update 1

Duncan Epping · Dec 5, 2008 ·

VMware just released vCenter Site Recovery Manager 1.0 Update 1. Scott already dropped the news that it doesn’t contain NFS support, and that has always kept me away from NFS and iSCSI for that matter. But with 10GB ethernet becoming more mainstream this focus might change.

Anyway, the following features have been added:

New Permission Required to Run a Recovery Plan
SRM now distinguishes between permission to test a recovery plan and permission to run a recovery plan. After an SRM server is updated to this release, existing users of that server who had permission to run a recovery plan no longer have that permission. You must grant Run permission to these users after the update is complete. Until you do, no user can run a recovery plan. (Permission to test a recovery plan is unaffected by the update.)

Full Support for RDM devices
SRM now provides full support for virtual machines that use raw disk mapping (RDM) devices. This enables support of several new configurations, including Microsoft Cluster Server. (Virtual machine templates cannot use RDM devices.)

Batch IP Property Customization
This release of SRM includes a tool that allows you to specify IP properties (network settings) for any or all of the virtual machines in a recovery plan by editing a comma-separated-value (csv) file that the tool generates.

Limits Checking and Enforcement
A single SRM server can support up to 500 protected virtual machines and 150 protection groups. This release of SRM prevents you from exceeding those limits when you create a new protection group. If a configuration created in an earlier release of SRM exceeds these limits, SRM displays a warning, but allows the configuration to operate.

Improved Support for Virtual Machines that Span Multiple Datastores.
This release provides improved support for virtual machines whose disks reside on multiple datastores.

Single Action to Reconfigure Protection for Multiple Virtual Machines
This release introduces a Configure All button that applies existing inventory mappings to all virtual machines that have a status of Not Configured.

Simplified Log Collection
This release introduces new utilities that retrieve log and configuration files from the server and collect them in a compressed (zipped) folder on your desktop.

Improved Acceptance of Non-ASCII Characters
non-ASCII characters are now allowed in many fields during installation and operation.

Be sure to read the “known issues” section before you start implementing.