Just a second ago the GSS/KB team republished the KB article that explain the vSphere 5.0 problem around SvMotion / vDS / HA. I wrote about this problem various times and would like to refer to that for more details. What I want to point out here though is that the KB article now has a script attached which will help preventing problems until a full fixed is released. This script is basically the script that William Lam wrote, but it has been fully tested and vetted by VMware. For those running vSphere 5.0 and using SvMotion on VMs attached to distributed switches I urge you to use the script. I expect that the PowerCLI version will also be attached soon.
Search Results for: ha vds svmotion
HA fails to initiate restart when a VM is SvMotioned and on a VDS!
<Update>I asked William Lam if he could write a script to detect this problem and possibly even mitigate it. William worked on it over the weekend and just posted the result! Head over to his blog for the script! Thanks William for cranking it out this quick! For those who prefer PowerCLI… Alan Renouf just posted his version of the script! Both scripts provide the same functionality though!</Update>
A couple of weeks back Craig S. commented on my blog about an issue he ran in to in his environment. He was using a Distributed vSwitch and testing certain failure scenarios. One of those scenarios was failing a host in the middle of a Storage vMotion process of a virtual machine. After he had failed the host he expected HA to restart the virtual machine but this did not happen unfortunately. He also could not get the virtual machine up and running again himself. Unfortunately in this case it was the vCenter Server that was used to test this scenario with, which brought him in to a difficult position. This was the exact error:
Operation failed, diagnostics report: Failed to open file /vmfs/volumes/4f64a5db-b539e3b0-afed-001b214558a5/.dvsData/71 9e 0d 50 c8 40 d1 c3-87 03 7b ac f8 0b 6a 2d/1241 Status (bad0003)= Not found
Today I spotted a KB article which describes this scenario, the error mentioned in this KB articles reveals a bit more what is going wrong I guess:
2012-01-18T16:23:17.827Z [FFE3BB90 error 'Execution' opID=host-6627:6-0] [FailoverAction::ReconfigureCompletionCallback] Failed to load Dv ports for /vmfs/volumes/UUID/VM/VM.vmx: N3Vim5Fault19PlatformConfigFault9ExceptionE(vim.fault.PlatformConfigFault) 2012-01-18T16:23:17.827Z [FFE3BB90 verbose 'Execution' opID=host-6627:6-0] [FailoverAction::ErrorHandler] Got fault while failing over vm. /vmfs/volumes/UUID/VM/VM.vmx: [N3Vim5Fault19PlatformConfigFaultE:0xecba148] (state = reconfiguring)
It seems that at the time of fail-over the “dvport” information cannot be loaded by HA as after the Storage vMotion process the dvport file is not created on the destination datastore. Now please note that this applies to all VMs attached to a VDS which have been Storage vMotioned using vCenter 5.0. However the problem will only be witnessed during time of HA fail-over.
This dvport info is what I mentioned in my “digging deeper into the VDS construct” article. I already mentioned there that this is what HA uses to reconnect the virtual machine to the Distributed vSwitch… And when files are moving around you can imagine it is difficult to power-on a virtual machine.
I reproduced the problem as shown in the following screenshot. The VM has port 139 assigned by the VDS, but on the datastore there is only a dvport file for 106. This is what happened after I simply Storage vMotioned the VM from Datastore-A to Datastore-B.
For now, if you are using a Distributed vSwitch and running a virtual vCenter Server and have Storage DRS enabled… I would recommend disable Storage DRS for vCenter specifically, just to avoid getting in these scenarios.
Go to Datastore & Datastore Clusters view, Edit properties on Datastore Cluster and change automation level:
The problem itself can be mitigated, as described by Michael Webster here, by simply selecting a different dvPortgroup or vSwitch for the virtual machine. After the reconfiguration has completed you can select the original portgroup again, this will recreate the dvport info on the datastore.
vSphere 5.0 U1a was just released, vDS/SvMotion bug fixed!
Many of you who hit the SvMotion / VDS / HA problem requested the hotpatch that was available for it. Now that Update 1a has been released with a permanent fix how do you go about installing it? This is the recommended procedure:
- Backup your vCenter Database
- Uninstall the vCenter hot-patch
- Install the new version by pointing it to the database
The reason for this is that the hot-patch increased the build number, and this could possibly conflict with later versions.
And for those who have been waiting on it, the vCenter Appliance has also been update to Update 1 and now includes a vPostgress database by default instead of DB2!
Scripts release for Storage vMotion / HA problem
Last week when the Storage vMotion / HA problem went public I asked both William Lam and Alan Renouf if they could write a script to detect the problem. I want to thank both of them for their quick response and turnaround, they cranked the script out in literally hours. The scripts were validated multiple times in a VDS environment and worked flawless. Note that these scripts can detect the problem in an environment using a regular Distributed vSwitch and a Nexus 1000v, the script can only mitigate the problem though in a Distributed vSwitch environment. Here are the links to the scripts:
- Perl: Identifying & Fixing Virtual Machines Affected By SvMotion / VDS Issue (William Lam)
- PowerCLI – Identifying and fixing VMs Affected By SvMotion / VDS Issue (Alan Renouf)
Once again thanks guys!
Clarifying the SvMotion / VDS problem
<Update>I asked William Lam if he could write a script to detect this problem and possibly even mitigate it. William worked on it over the weekend and just posted the result! Head over to his blog for the script! Thanks William for cranking it out this quick! For those who prefer PowerCLI… Alan Renouf just posted his version of the script! Both scripts provide the same functionality though!</Update>
I think there is some confusion around the SvMotion / VDS problem I described a couple of days back. Let me try to clarify it in a couple of simple steps.
First of all, this only applies to virtual machines that have been Storage vMotioned by vCenter 5.0 and are connected to a Distributed vSwitch. This could be either manually or using Storage DRS. So what is the exact problem?
- When a VM is attached to a dvPortgroup it is connected to a port. This information is stored locally on the host and on the VMFS volume this VM is stored on.
- This volume will contain a file which is named equal to the port number of this VM.
- When the VM is Storage vMotioned to a different datastore this file is not created on the destination datastore
- When the host fails on which the Storage vMotioned VM resides HA will attempt to restart that VM.
- In order for HA to restart it and connect it to the dvPortgroup this file is required.
- As the file is not available the restart fails.
You can simply resolve this by connecting the impacted VMs to a different dvPortgroup temporarily and then reconnect them back to the original portgroup. As soon as you’ve done that the file will be created on the datastore. For now this is a manual task, but I am sure some of my teammembers are working on a scripted solution as we speak… right Alan / William? 🙂