BC-DR

HA: Did you know?

Duncan Epping · Sep 20, 2009 ·

Did you know that…

the best practice to increase the isolation response time(das.failuredetectiontime) from 15000 to 60000 for an Active/Standby situation for your service console has been deprecated as of vSphere.
(In other words for active/standby leave it set to the default 15000 for vSphere)
the limit of 100 VMs per host is actually “100 powered on and HA enabled VMs”. Of course this also goes for the 40 VM limit for clusters with more than 8 hosts.
the limit of 100VMs per host in an HA cluster less than 9 hosts is a soft limit.
das.isolationaddress[0-9] is one of the most underrated advanced settings.
It should be used as an additional safety net to rule out false positives.

Just four little things most people don’t seem to realize or know…

VMware Availability Solutions and Futures (BC3425 – Banjot Chanana)

Duncan Epping · Sep 16, 2009 ·

I was just replaying Banjot Chanana’s session “VMware Availability Solutions and Futures“. Banjot is the product manager for the availability solutions HA and FT. I met Banjot in Palo Alto the week before VMworld and we spoke about HA, present and futures. Unfortunately I can’t elaborate on anything that has been discussed but I can however repeat what Banjot spoke about during his session at VMworld.

The most exciting part of the presentation, for me at least, start at roughly 35:40. Banjot start to elaborate on futures especially when the 3D model gets expanded with “Stretched Clusters with FT” and “Stretched HA Clusters” I start to get interested. Some bullet points on future developments:

VM Component Protection -> loss of storage / loss of VM network -> fail-over / alert
Drives higher availability against granular outages
Stretched HA Clusters -> Carving up Clusters in “sub-clusters” by tagging VMs -> fail-over to other “sub-cluster” based on affinity
Drives higher availability against site failures
Application Monitoring -> Application awareness / correlation between infrastructure and application events -> SLA awareness also performance by using DRS
Drives higher availability against application / service failure
Host Retirement -> Host health scores would also indicate “VM readiness” of a host -> VMotion based on host health scores ->
Drives higher availability by monitor host health and taking action when thresholds are exceeded
Integrated Availability -> Availability Policies vs per VM settings -> Defining tiers and applying them to sets of VMs -> Based on SLA
Decreases operational efforts and increases availability by reducing “human errors”

Although some people were disappointed by the lack of announcements of new products I think there’s more than enough exciting features coming up if you know where to find them. Thanks Banjot for these insights,

Future HA developments… (VMworld – BC3197)

Duncan Epping · Sep 15, 2009 ·

I was just listening to “BC3197 – High Availability – Internals and Best Practices” by Marc Sevigny. Marc is one of the HA engineers and is also my primary source of information when it comes to HA. Although most information can be found on the internet it’s always good to verify your understanding with the people who actually wrote it.

During the session Marc explains, and I’ve written about in this article, that when a dual host failure occurs the global startup order is not taking into account. The startup order will be processed per host with the current version. In other words “Host a” first with taking startup order into account and then “Host B” with taking startup order into account.

During the session however Marc revealed that in a future version of HA global startup settings(Cluster based) will be taken into account for any number of host failures! Great stuff, another thing to mention is that they are also looking into an option which would enable you to pick your primary hosts. For blade environment this will be really useful. Thanks Marc for the insights,

Site Recovery Manager 1.0 Update 1 Patch 4

Duncan Epping · Sep 14, 2009 ·

One of my colleagues, Michael White, just pointed out that VMware released a patch for Site Recovery Manager:

Site Recovery Manager 1.0 Update 1 Patch 4
File size: 7.9 MB
File type: .msi

Here are the most important fixes:

a problem that could cause a recovery plan to fail and log the message
Panic: Assert Failed: “_pausing” @ d:/build/ob/bora-172907/santorini/src/recovery/secondary/recoveryTaskBase.cpp:328

a problem that caused the SRM SOAP API method getFinalStatus() to write all XML output on a single line

full session keys are no longer logged (partial keys are used in the log instead)

a problem that could cause SRM to crash during a test recovery and log the message
Exception: Assert Failed: “!IsNull()” @ d:/build/ob/bora-128004/srm101-stage/santorini/public\common/typedMoRef.h:168

a problem that could cause a recovery plan test to fail to create test bubble network when recovering virtual machines that had certain types of virtual NICs

a problem that could cause incorrect virtual machine start-up order on recovery hosts that enable DRS

a problem that could cause the SRM server to crash while testing a recovery plan

a problem that could cause SRM to fail and log a “Cannot execute scripts” error when customizing Windows virtual machines on ESX 3.5 U1 hosts.

support for customizing Windows 2008 has been added

a problem that could prevent network settings from being updated during test recovery for guests other than Windows 2003 Std 32-bit

a problem that prevents protected virtual machines from following recommended Distributed Resource Scheduler (DRS) settings when recovering to more than one DRS cluster.

a problem observed at sites that support more than seven ESX hosts. If you refresh inventory mappings when connected to such a site, the display becomes unresponsive for up to ten minutes.

a problem that could prevent SRM from computing LUN consistency groups correctly when one or more of the LUNs in the consistency group did not host any virtual machines.

a problem that could cause the client user interface to become unresponsive when creating protection groups with over 300 members

several problems that could cause SRM to log an error rmessage vim.fault.AlreadyExists when recomputing datastore groups

a problem that could cause SRM to log an Assert Failed: “ok” @ src/san/consistencyGroupValidator.cpp:64 error when two different datastores match a single replicated device returned by the SRA

a problem that could cause SRM to remove static iSCSI targets with non-test LUNs during test recovery

several problems that degrade the performance of inventory mapping

VMware Data Recovery 1.0.2

Duncan Epping · Sep 10, 2009 ·

VMware just released a brand new version of VMware Data Recovery.

Version 1.0.2
Build Number 188925
Release Date 2009/09/09

This releases fixes a couple of known issues:

Various Integrity Check Issues
Under certain circumstances, integrity checks reported damaged restore points and cannot load session errors. For example, such problems might be reported if:
- A combination of simultaneous overlapping backups and integrity checks are started.
- A backup is stopped before completion because the backup window closes. In such a case, the deduplication store records transactions, but the closing of the backup window prevents recording the transaction to the catalog.
When integrity checks failed in such cases, Data Recovery would mark restore points as damaged or report that the backup session could not be found. Data Recovery integrity check now handles these conditions properly, so these problems no longer occur.
Connections Using Alternate Ports not Supported
By default, connections to vCenter Server use port 443. If vCenter Server is configured to use an alternate port, Data Recovery continued to attempt to connect using the default port. This caused the Data Recovery plug-in to report authentication failures when attempting to connect to the Data Recovery appliance. Alternate vCenter Server port configurations are now supported.

Multiple VMDKs with the Same Name not Handled Properly
A virtual machine can have multiple VMDK files with the same name that are stored on different LUNs. In such a case, Data Recovery would only restore one of the disks. Data Recovery now restores all disks.

You can find the full release notes here.