I receive this great newsletter via email every week from Michael White, he’s one of our Specialist SE’s. Michael created a great VMware SRM document and this FAQ is part of it. I want to thank Michael for sharing it with the rest of the world.
Generic
I want to install SRM, what do I need to do?
It is important to understand the SRM installation overview. You must install using the order of operation as shown in the lab section of this document. You must do this on the protected site first, followed by the recovery side. Here is the outline:
- SRM application installed at Protected Site
- SRM application plug in installed in VI clients that connect with the Protected Site
- SRA installed at the Protected Site
- SRM application installed at Recovery Site
- SRM application plug in installed in VI clients that connect with the Recovery Site
- SRA installed at the Recovery Site
- SRM configured at the Protected Site
- SRM server pairing
- Array Configured – both Protected Site and Recovery Site
- Inventory Mapping
- Protection Group
- SRM configured at the Recovery Site
- Recovery Plan created
You should now test and tweak SRM. Remember the goal is to have the required VM’s running at the recovery site in the least amount of time.
What does an SRM lab require?
The ideal SRM lab requires the following:
- Two VirtualCenter servers
- Each VirtualCenter server would require at least one ESX server, and the Recovery should have two to show the integration with DRS as part of a recovery plan.
- Each of the two sites requires shared storage that replicates. And it needs to be on the compatibility list. This shared storage can be the NetApp simulator, HP / LeftHand VSA, or the EMC Simulator. It can also be actual hardare based shared storage that can replicate.
Some of the activities that can be shown would include:
- Test failover
- Actual failover
- Failover with IP customization
- Failover where multiple VM’s start on various ESX servers
- The use of a virtual switch that can connect VM’s on different ESX servers to a private network. This is very useful for testing. By using a VLAN testing is possible that doesn’t impact the public network. Remember that the test bubble network that can be used in SRM only provides for communications on a per ESX host basis.
When does SRM raise VC events?
SRM will raise VC events for the following conditions:
- Disk space low
- CPU use exceeded limit
- Memory low
- Remote Site not responding
- Remote Site heartbeat failed
- Recovery Plan Test started, ended, succeeded, failed, or cancelled
- Virtual Machine Recovery started, ended, succeeded, failed, or reports a warning
What are the recommended minimum alarm notifications?
We suggest the following alarm notifications. You can set them on the Alarm tab of the SRM status summary page. Most organization will utilize email notifications but there are other choices as well. Remember to set these suggest alarm notifications at both sides.
- Remote Site Down
- Remote Site Ping Failed
- Replication Group Removed
- Recovery Plan Destroyed
- License Server Unreachable
How do I plan for disk utilization due to SRM database?
Recently we brought out the database sizing tool. Find it at http://www.vmware.com/files/pdf/Site_Recovery_Manager_1.0U1_Database_Sizing_Calculator.xls.
Where can I find help for installing different array products?
The obvious is you can always visit the vendor of the array for help in the form of documents. You can also find information elsewhere. A VMware support person has written how to guides for a variety of different arrays. They can be bound at http://viops.vmware.com/home/people/chogan?view=overview . He has done an excellent job and I hope that his guides help you out.
How can I capture the log and configuration information for support to work with?
This is most easily done after Update 1 by the use of the “Generate Site Recovery Manager Log Bundle” command in the VMware VMware Site Recovery Manager Start Menu folder. Run this command on the SRM server. This command will produce a zipped file on your desktop. It will be in a MM-DD-YYYY-HH-MM.zip format where is it Month – Day – Year – Hours – Minutes. Always provide the logs with your request for help!
What is the account that is asked for during install used for?
The 1.0 installer prompted for a username during installation. This is the account SRM will use to communicate with the local VC server. Since SRM constantly monitors the local VC inventory, this user will be constantly logged into the local VC server. Changing the password for this account will make it impossible to use SRM. Please note that this should be an account in the Administrators group. By default, when you install SRM 1.0 or SRM 1.0 U1, all accounts in the Administrators group have complete access to SRM managed objects. Again, this has not changed with U1. Please try to use AD accounts when you install SRM, and when you log into SRM. Using local accounts can work, but it is a little tricky. If you need some guidance on using local accounts I can help. This account is NOT the account used by the system – the SRM service uses the Local System Account.
Can I change the IP information for the SRM server?
I would like to change the IP info for the SRM server once it is installed. Is this safe or is there a specific way to do this without issues? When changing the IP info for the SRM server, or if the credentials (account or password) need to be changed you will need to use a special utility to accomplish either of these changes. Once the change is done you will also need to pair the two sites again. You can find detailed info on how to do this on page 85, in Appendix C of the SRM Admin Guide.
How do I add a script to a Recovery Plan in a call out?
When you add a script to a call out in a recovery plan, it is an empty dialog. Use the information below to add a script that will work as expected. It is important to understand that the scripts or commands must be in the path of the VirtualCenter.
- Use full paths to all executables – for example “c:windowssystem32cmd.exe” instead of “cmd.exe”.
- You can use .exe or .com files only! Command line scripts can only call executables.
- To run a batch file you should start the shell command with “c:windowssystem32cmd.exe”. So it would look like “c:windowssystem32cmd.exe /c c:scriptsalarmscript.bat”.
How do I change the value for a script timeout?
You can increase or decrease this value by editing the SRM configuration file (vmware-dr.xml). Look for the following section:
<calloutCommandLineTimeout>600</calloutCommandLineTimeout>
Change value to the appropriate value.
During the configuration of SRM I receive a timeout after 300 seconds, how do I change the value for this timeout?
You can increase or decrease this value by editing the SRM configuration file (vmware-dr.xml). Look for the following section:
<CommandTimeout>300</CommandTimeout>
Change value to the appropriate value.
I would like to use trusted certificates with SRM – help!
You can use your own trusted certificates with SRM but it is more complicated than you might expect. There is some excellent information to help you be successful at http://viops.vmware.com/home/docs/DOC-1261 .
What happens if you move one of the protected VM’s to a datastore that is not part of the VM’s current protected group?
Protection will be revoked for the VM. It will have a small yellow triangle associated with it in its protection group. This will be true even if you move (such as storage VMotion) the VM to another different datastore that is replicated to the recovery site.
Can network customization work for operating systems other than Windows?
Yes. This includes operating systems from Novell, and Red Hat. The specific version information can be found in the SRM Compatibility Matrix document.
Understanding order of operation for bringing VM’s back online.
During the recovery period, the order of recovery VMs is not as obvious as it may suggest. Normal and Low priority protection groups (VMs) will be started one VM per ESX host at the same time. So you could have a number of Normal priority VM’s starting at the same time – but spread across various ESX servers. However, High priority starts VM’s serially regardless of how many hosts are involved. Misconfiguration of the security for storage arrays may impact the start order of VM’s. For example, if the security of the array means it cannot talk to a particular ESX host than that host will not be used to start VM’s during a recovery plan. It is possible to see this without any obvious error messages!
Can I fail-over VMs which have disks on two different arrays, for instance NetApp and EMC?
No, although you can install SRA’s of multiple vendors failing over a VM which has a disk on both arrays will not work.
What does the Repair button do?
The repair button is used when the protected site is not available, and some array reconfiguration is required. Normally it would be done at the protected site, but if it is not available than the repair button can be used.
Is it all over when the recovery plan fails?
You can have a recovery plan fail with some sort of error, but it will complete anything that it can complete. You could then address and solve the error, and run the recovery plan again and if you have correctly address the error your test may in fact correctly complete this time. It will not redo things that it has done correctly already. Once I had a problem with a VM starting and I let the replication finish, did a manual HBA refresh, and tried again. The two VM’s that had already started were not touched, but the third VM that had finished replicating now, was in fact started.
Troubleshooting
Where is the new Run and Test privileges?
After you update to Update 1 you should see a Run and a Test privilege in the roles and priviledges area but you may not. Restart VC and you will see them.
Where are the SRM server logs stored?
They can be found in:
C:Documents and SettingsAll UsersApplication DataVMwareVMware Site Recovery ManagerLogs
You will need to check the vmware-dr-index file to see what is the current log file.
I see a lot of recomputed datastore failures in my mixed 2.5 / 3.5 environment, what’s happening?
If you have ESX 2.5 hosts accessing a protected datastore you will see datastore recomputed datastore failures. Remove the ESX 2.5 host from the datastore.
I’m having pairing issues and it fails at specific %, why?
If you have an issue at approximately 24% it could be related to the license file not being live or installed. Reread the license file or restart the license service.
If you have an issue at approximately 82 or 84% you should make sure that the account you used to connect to the Recovery site has both VC and SRM admin rights. The specific role for SRM is Protected Site Administrator and on the Recovery Site it is called Recovery Site Administrator. This issue occurs most in a Microsoft domain world. The Administrator role includes both the Protected and Recovery site admin roles.
Things to check during troubleshooting of pairing issues would include firewalls between the sites and is the recovery site running VC successfully?
I’m configuring the SRDF SRA and although we replicated storage and it contains VMs I still don’t see “replicated LUNs”.
After checking all the configuration settings on the SRA side, SRM side and the SAN we noticed that the SPC-2 bit was not enabled. This setting is mandatory according to the FC San Config Guide(page 57) and solved our issues.
“Failed to connect to the management system address when executing the discoverArrays command.”
You should not often see this but it can be addressed by making sure the SRA is in fact installed on the recovery side. You may also need to check routing between the sites (in particular to the Recovery side SRA / storage management interface.
How do I change the SRM change of power state time out values?
The default value is 120 seconds which might not be long enough and could lead to issues when a power off is forced of a VM. You can increase or decrease this value by editing the SRM configuration file (vmware-dr.xml). Look for the following section:
<Recovery>
<powerStateChangeTimeout>120</ powerStateChangeTimeout>
</Recovery>
If this section is not in the .xml file add it. Don’t forget to restart the SRM Service.
Error: Failed to recover datastore:
This error usually indicates that the recovery side cannot communicate with the array on the recovery side. In the SRM logs on the recovery side you can see a Mapped LUN line (s) that will help you see what the protected side is mapped to on the recovery side. This will sometimes help you fix this error message.
We noticed a “SRM unlicensed error” in the logs but we have a good license installed.
If you change the SRM license file(s) you may have a small issue, as it is not the same process as changing an ESX or VC license. You would follow the normal steps of dropping the file in the license folder and rereading the license folder in the license tool. This would be enough for VC or ESX but is not enough for SRM. You could after these steps see the license in the VC Admin License view, but would still see the unlicensed errors in the SRM log. You need to restart the SRM service for the new license change to occur.
I cannot uninstall SRM successfully – what can I do?
Uninstalling SRM will normally require access to the VC that it is paired with. If you do not have that VC running it is hard to uninstall SRM. If you don’t cleanly uninstall SRM you cannot install it again. It is possible to uninstall with no VC if you read the screens carefully and answer appropriately, but I have seen where that doesn’t work. Use one of the ideas below to help if you need it. It is always best to use the Add Remove programs method to uninstall but if that doesn’t work the ideas below should.
msiexec.exe /qn /x {35A202EA-1549-4592-97A5-65F5E4CCDEC9}
Microsoft’s uninstall utility: http://support.microsoft.com/kb/29031
Only three Recovery Plans can run at the same time.
Not sure what the error message is if you try to do more than 3 but at least you now know that only 3 should be executed at the same time. This is due to the QA level of testing and will be significantly improved in the future.
Can I automatically rename my datastore back to it’s original name?
Edit the vmware-dr.xml file in the C:Program FilesSite Recovery ManagerConfig directory and look for a line that reads:
-
<fixRecoveredDatastoreNames>false</fixRecoveredDatastoreNames>
Change it to:
-
<fixRecoveredDatastoreNames>true</fixRecoveredDatastoreNames>
Can I change the administrator’s email address after the installation?
Extension.xml is the configuration xml file where you can change the Administrator Email:
<adminEmail>[email protected]</adminEmail>
Why is Port 80 used in the install but port 443 later?
During install of SRM port 80 is specified and you cannot type in 443, but after the install is complete than SRM talks to VC on 443, so why is 80 specified in the install? Even though SRM uses SSL when it communicates to VC, it does not use port 443. SRM establishes a TCP connection to port 80, than uses an HTTP CONNECT request to establish a tunnel to the VC servers, then does an SSL handshake with the VC over that tunneled connection. The SRM installation enforces these semantics.
I need to rescan my storage twice before I actually see my LUNs can SRM also do this?
To enable the additional rescan, edit the vmware-dr.xml file at both the protected and recovery sites to add a <hostRescanRepeatCnt> element within the <SanProvider> element. Set the value of <hostRescanRepeatCnt> to 2, as shown in the following example:
<SanProvider>
.
.
.
<hostRescanRepeatCnt>2</hostRescanRepeatCnt>
</SanProvider>
For SQL server use, does the SRM DB user need the DB_OWNER permission?
For SQL server, the SRM DB user doesn’t not need the DB_OWNER permissions. As long as the schema has the same name as the username, and is the default schema for that user, and is owned by that user, then you are ok.
Unexpected MethodFault (dr.san.fault.ManagementSystemNotFound)
This error occurs after you upgrade the EqualLogic PS Series Interface SRA adapter to the Dell EqualLogic PS Series Interface. You can uninstall the new SRA and install the old one as a work around, but there is another option. You can locate the manifest.xml file in the SRA installation directory, modify the SRA name in it, and restart the SRM service and you would be good to go.
The password of my SRM account has changed how do I change the password for SRM?
You can have some issues with changing account passwords after everything is working. In theory you can use the installcreds.exe file but it has been reported to not always work. In a near future there will be an update to make this process easier but for now you must use the srm-config.exe command. When it is complete you will be able to restart the SRM service and have communication between the SRM servers (will need to repair the communication by doing the pairing again). The format is complex for this command. You must ran it twice, the first time to obtain a thumbprint, and than the second time to actually make the change. Below is a sample command line. This utility is found in the bin directory of the c:program filesVMwareVMware Site Recovery Managerconfig folder. You can find parameter names (such as value for –sitename) in the vmware-dr.xml file found in the config folder.
Srm-config.exe –cmd confuserbased –sitename <local site name> -cfg <SRM configuration file> -u <username> -vc <host[:port]> [-thumbprint <sha-1 server certificate thumbprint]
Srm-config.exe –cmd confuserbased –sitename srm-primary –cfg vmware-dr-primary.xml –u administrator –vc 10.10.10.10 –thumbprint 96:E0:E8:F5:59:1C:BF:6D:81:6C:A2:AB:51:76:24:DE:31:D1:E8
Without the password you will need to use the thumbprint. So run this command the first time without the thumbprint parameter and you will be shown the thumbprint and than run it again with the thumbprint.
If your site name contains spaces enclose the name in quotes.
My recovery site is only using x number of hosts to start VM’s but it should be using y number.
When I experienced this, it was due to the host that was not starting VM’s not having access to the storage array. This was due to it not having a VMkernel port that LHN required. I have seen this with other vendor where there was no security between the ESX host in questions and the storage array. There are no error messages associated with this situation so make sure you test for it. I have seen a similar error where the single host at the recovery site didn’t have an IP entered for the iSCSI array.
Priority Levels in Recovery Plan don’t reflect my changes.
You have made changes in the Protection Group to the priority level of some of your protected VM’s. But when you refresh the Recovery Steps you see your VM’s with the original priority and not the new that you changed in the Protection Group. This is correct behavior. It may be improved in the future. It is due to the difference in security permissions on both sides. It would be possible from someone on the Protected side to make changes that affect VM’s on the recovery side. This may or may not be appropriate. Until there is a good solution, just right click on the VM in question and use the Move Up or Move Down options to change its execution order priority.
Error:Expected virtual machine file path ….. vm-vmname/vm-vmname.vmx cannot be found
This can occur during test or recovery and it means quite simply the VM reference in the error is not in the replicated SAN datastore where it is expected. This most often occurs when you add another VM to the protected datastore and before it has time to replicate start a test recovery. The solution is to wait until the replication catches up and try the test again.
Database access issues
Use Windows Authentication if the DB server is local to the SRM server, and SQL Authentication if the DB server is remote to the SRM server.
How can I tell the SRM version from the log files?
The first line of the SRM log files will hold the release info. The version=1.0.0 tells the version and build=build-97878 tells the build.
Installation logs
You can create an installation log using the command line parameters of /s /V”lve installlog.txt”. The command line will look like:
VMware-srm-1.0.0=.exe /s /V”lve installlog.txt” .
How do I change the log level?
You can easily change the log level by editing a configuration file. However, to have that change read by SRM you will need to restart the SRM service. The file name is vmware-dr.xml and is found by default in C:Program FilesVMwareVMware Site Recovery Managerconfig . Remember that when you restart the service that you will interrupt anyone working with SRM.
Look for the line that looks like:
C:Documents and SettingsAll UsersApplication DataVMwareVMware Site Recovery ManagerLogs
Below it you will find a line that looks like:
<level>verbose</level>
You can change the verbose to trivia, which will generate more entries, or to info, which generates less. In the RC builds it didn’t seem to make much of a difference what the setting was.
No available Customization specifications found.
You can create customizations using the View Edit Customization command in the VI client. This is how you can change a network setting in a recovery. This is like sysprep, and you are required to fill in all of the necessary information, but only the network info will be used. You will need to create your customization specification on the recovery site. Remember that you can export and import customizations so if necessary it doesn’t take much to move them between your protected and recovery sites.
Net::SSLeay::load_error_strings
This comes from the Perl module for OpenSSL, which is required by some SRA’s (such as NetApp) and means that perl is not installed on the recovery SRM server.
Is there a limitation of DR failover LUNs for some iSCSI arrays and some Hosts?
There is a hard limit of 64 iSCSI arrays per host. However, when using SRM there is a limit of approximately 23 recovery LUNs on the recovery side only. For more information about this please visit http://kb.vmware.com/kb/1005867 . This is not specific to SRM but to any DR setup you might test.
A general system error occurred – unable to get configuration information for the recovery VM
This error will occur when a VM has been added to a protected datastore, and is part of a recovery plan, but during test fail over it has not be replicated so it is not available to the recovery side. This can happen during a non – test failover as well. This can happen with LHN but the error message is more obvious of the problem.
Failed to launch SAN integration scripts
If you are using SRDF and get the error below when configuring your array you have a path issue. The error is “Failed to launch SAN integration scripts to execute discoverArrays command.” The issue is a missing path to the SYMCLI folder in the path. The solution is to add the path to the SYMCLI bin folder to the System variables PATH environment. The default path is C:Program FilesEMCSYMCLIbin and you will need to restart the SRM server service after the PATH change. This exact error is from an issue with SRDF it may occur with other SRA’s from other or the same vendor.
No visible LUN’s during configuration of the array
This will occur if there is NO VM’s in the protected datastore. Add a VM to the protected datastore and the LUN will be visible in the array configuration.
Null parameter name:key error
If you are adding a protection group and you get a error with a value of null parameter name:key in it, the solution at this time is to restart the SRM service on both the protected and recovery sites.
Missing testbubble switch on recovery host.
When you are checking your test recovery VM’s for network connectivity you find that while one ESX host worth of VM’s can talk to each other, but on other ESX hosts there is no connectivity. Further checking shows that only one recovery ESX host has the testbubble switch and the other hosts do not have that switch even though the recovery VM’s are configured to use it. Therefore the VM’s configured to use the test bubble switch that doesn’t exist will not be able to communicate.
Review Replicate Datastores window of Array Manager is blank.
When you are configuring your SRA and the last step in it is to show you the replicated LUN’s, but you see nothing you have a problem. Using the Rescan button doesn’t cause the LUN(s) to be displayed. To work around this issue, use the following steps:
- In the VI Client,
- Goto the ESX host configuration area
- Now select Storage
- In the upper right area select the Refresh option.
- Now return to the SRM Array Manager configuration,
- Select Rescan,
- Than select Back,
- Now select Next
- You should now see your LUN information displayed.
SRM will raise VC events for the following conditions:
· Disk space low
· CPU use exceeded limit
· Memory low
· Remote Site not responding
· Remote Site heartbeat failed
· Recovery Plan Test started, ended, succeeded, failed, or cancelled
· Virtual Machine Recovery started, ended, succeeded, failed, or reports a warning
Smasher777 says
Hi, where is the SRM FAQ?!
Mikael Sennerholm says
Hi!
I want to rescan the storage twice, but I couldn’t find in vmware-dr.xml in SRM5. Do you know if it’s moved to another file?
Sincerely
Mikael
Mikael Sennerholm says
I found it, it’s moved to advanced settings in Sites.
Sincerely
Mikael
JimmyChien says
Hi Duncan ! My Customer has Production SRM Host & DR SRM Host in the same location with the same subnet at the first time, After the testing they move the DR host to the real DR Site, and change IP address of the DR Host IP / vCenter with SRM IP ;After Changed, the DR vCenter Server could not start, so we re-install the DR vCenter, the vcenter works, but then we find the SRM could not work, and when we try to break the pair,it says connect to the DR Site first. when I try to connect DR Site, it says it is not the original pair ,and I’ve shutdown the original DR vCenter. Do you have any idea to build up the SRM pairing with either the old or the new DR vCenter?
Hope to hear from you soon.
JimmyChien
Chad Owens says
Duncan, I have to say everything I read from you is awesome. Thanks for all the advice.
I work for VMware and was wondering if you have ever seen this error on the vSphere Replication Appliance when trying to start the service.
“Unable to obtain SSL Certificate: Bad server response: is a vCenter server listening on the given host and port.
Daniel Bedard says
I get the same error 🙁
Marcin says
I’m using RecoverPoint spliter with SRM (primary is VPLEX and recovery site is VNX) and getting an error when i fail back. It seems to do it only with VPLEX volumes. Error below, i appreciate any comments
STEP 8: Change Recovery Site Storage to Writeable Error – Failed to recover datastore ‘srm-VPLEX-RP-DS02’. VMFS volume residing on recovered devices ‘”xx:xx:xx:xx:xx:xx:F8:74:9B:xx:36″‘ cannot be found. Recovered device ‘xx:xx:xx:xx:xx:xx: 10:F0:5D:55:F8:74:xx:B1:36’ not found after HBA rescan.
When i look at the hosts after this fails, the LUNs are there but are not attached. If i manually attach them and restart SRM recovery it works then
Marcin says
I have already tried to increase a number of rescans and timeout value. Did not help.