BC-DR

SRM 4.0 released

Duncan Epping · Oct 5, 2009 ·

VMware just released version 4.0 of SRM. Just to be clear this is not the 4th version of SRM, the version number is aligned with vCenter and ESX. I’ve highlighted the new features which I think are really really useful or exciting.

Site Recovery Manager 4.0 | 05-October-2009 | Build 192921

Release Note
New Features:

Full compatibility with vCenter 4.

Full support for NFS-based arrays.

Support for shared recovery sites.
Enables many-to-one pairings of protected sites with a recovery site. For more information, see the technical note Installing, Configuring, and Using Shared Recovery Site Support, which is available at http://www.vmware.com/support/pubs/srm_pubs.html.

Resilience in the face of vCenter unavailability during a test recovery.
Placeholder virtual machines can be quickly repaired after the protected site vCenter becomes available again.

New repair-mode installation features.
You can run the SRM installer in repair mode if you need to change configuration parameters such as vCenter credentials, database connection information or credentials, and certificate details.

Graphical interface to advanced settings.
Eliminates most requirements to edit the XML configuration file

Support for DB2 as an SRM database server.

New licensing options.

Improved scalability.
A single protection group can now include up to 1000 virtual machines.

Full Compatibility With DPM (Distributed Power Management)
SRM recovery plans can now power-on or power-off a host that is in standby mode.

New Option to dr-ip-customizer Utility
The dr-ip-customizer utility now logs less verbose diagnostic output by default. To force dr-ip-customizer to log the same level of diagnostic output that it produced in earlier releases, use the -verbose option.

Change in Certificate Validation
When you select certificate authentication, the SRM installation validates the certificate you supply before continuing. Certificates signed with an MD5 key are no longer allowed.

Support for Protecting Fault-Tolerant Virtual Machines.
SRM can now protect virtual machines that have been configured for fault-tolerant operation. When recovered, these virtual machines lose their fault tolerance, and must be manually reconfigured after recovery to restore fault tolerance.

Improved context-sensitive Help.

PDF documents available on release media
Current versions of the PDF documents for this release are available in the docs folder at the root of the SRM 4.0 CD. Updated versions of these documents may be available at http://www.vmware.com/support/pubs/srm_pubs.html.

Keep in mind that if you want to do an upgrade you need to use a specific method to be successful. It’s described here. Now go ahead, download it and try it out!

HA: Did you know?

Duncan Epping · Sep 20, 2009 ·

Did you know that…

the best practice to increase the isolation response time(das.failuredetectiontime) from 15000 to 60000 for an Active/Standby situation for your service console has been deprecated as of vSphere.
(In other words for active/standby leave it set to the default 15000 for vSphere)
the limit of 100 VMs per host is actually “100 powered on and HA enabled VMs”. Of course this also goes for the 40 VM limit for clusters with more than 8 hosts.
the limit of 100VMs per host in an HA cluster less than 9 hosts is a soft limit.
das.isolationaddress[0-9] is one of the most underrated advanced settings.
It should be used as an additional safety net to rule out false positives.

Just four little things most people don’t seem to realize or know…

VMware Availability Solutions and Futures (BC3425 – Banjot Chanana)

Duncan Epping · Sep 16, 2009 ·

I was just replaying Banjot Chanana’s session “VMware Availability Solutions and Futures“. Banjot is the product manager for the availability solutions HA and FT. I met Banjot in Palo Alto the week before VMworld and we spoke about HA, present and futures. Unfortunately I can’t elaborate on anything that has been discussed but I can however repeat what Banjot spoke about during his session at VMworld.

The most exciting part of the presentation, for me at least, start at roughly 35:40. Banjot start to elaborate on futures especially when the 3D model gets expanded with “Stretched Clusters with FT” and “Stretched HA Clusters” I start to get interested. Some bullet points on future developments:

VM Component Protection -> loss of storage / loss of VM network -> fail-over / alert
Drives higher availability against granular outages
Stretched HA Clusters -> Carving up Clusters in “sub-clusters” by tagging VMs -> fail-over to other “sub-cluster” based on affinity
Drives higher availability against site failures
Application Monitoring -> Application awareness / correlation between infrastructure and application events -> SLA awareness also performance by using DRS
Drives higher availability against application / service failure
Host Retirement -> Host health scores would also indicate “VM readiness” of a host -> VMotion based on host health scores ->
Drives higher availability by monitor host health and taking action when thresholds are exceeded
Integrated Availability -> Availability Policies vs per VM settings -> Defining tiers and applying them to sets of VMs -> Based on SLA
Decreases operational efforts and increases availability by reducing “human errors”

Although some people were disappointed by the lack of announcements of new products I think there’s more than enough exciting features coming up if you know where to find them. Thanks Banjot for these insights,

Future HA developments… (VMworld – BC3197)

Duncan Epping · Sep 15, 2009 ·

I was just listening to “BC3197 – High Availability – Internals and Best Practices” by Marc Sevigny. Marc is one of the HA engineers and is also my primary source of information when it comes to HA. Although most information can be found on the internet it’s always good to verify your understanding with the people who actually wrote it.

During the session Marc explains, and I’ve written about in this article, that when a dual host failure occurs the global startup order is not taking into account. The startup order will be processed per host with the current version. In other words “Host a” first with taking startup order into account and then “Host B” with taking startup order into account.

During the session however Marc revealed that in a future version of HA global startup settings(Cluster based) will be taken into account for any number of host failures! Great stuff, another thing to mention is that they are also looking into an option which would enable you to pick your primary hosts. For blade environment this will be really useful. Thanks Marc for the insights,

Site Recovery Manager 1.0 Update 1 Patch 4

Duncan Epping · Sep 14, 2009 ·

One of my colleagues, Michael White, just pointed out that VMware released a patch for Site Recovery Manager:

Site Recovery Manager 1.0 Update 1 Patch 4
File size: 7.9 MB
File type: .msi

Here are the most important fixes:

a problem that could cause a recovery plan to fail and log the message
Panic: Assert Failed: “_pausing” @ d:/build/ob/bora-172907/santorini/src/recovery/secondary/recoveryTaskBase.cpp:328

a problem that caused the SRM SOAP API method getFinalStatus() to write all XML output on a single line

full session keys are no longer logged (partial keys are used in the log instead)

a problem that could cause SRM to crash during a test recovery and log the message
Exception: Assert Failed: “!IsNull()” @ d:/build/ob/bora-128004/srm101-stage/santorini/public\common/typedMoRef.h:168

a problem that could cause a recovery plan test to fail to create test bubble network when recovering virtual machines that had certain types of virtual NICs

a problem that could cause incorrect virtual machine start-up order on recovery hosts that enable DRS

a problem that could cause the SRM server to crash while testing a recovery plan

a problem that could cause SRM to fail and log a “Cannot execute scripts” error when customizing Windows virtual machines on ESX 3.5 U1 hosts.

support for customizing Windows 2008 has been added

a problem that could prevent network settings from being updated during test recovery for guests other than Windows 2003 Std 32-bit

a problem that prevents protected virtual machines from following recommended Distributed Resource Scheduler (DRS) settings when recovering to more than one DRS cluster.

a problem observed at sites that support more than seven ESX hosts. If you refresh inventory mappings when connected to such a site, the display becomes unresponsive for up to ten minutes.

a problem that could prevent SRM from computing LUN consistency groups correctly when one or more of the LUNs in the consistency group did not host any virtual machines.

a problem that could cause the client user interface to become unresponsive when creating protection groups with over 300 members

several problems that could cause SRM to log an error rmessage vim.fault.AlreadyExists when recomputing datastore groups

a problem that could cause SRM to log an Assert Failed: “ok” @ src/san/consistencyGroupValidator.cpp:64 error when two different datastores match a single replicated device returned by the SRA

a problem that could cause SRM to remove static iSCSI targets with non-test LUNs during test recovery

several problems that degrade the performance of inventory mapping