<Update>I asked William Lam if he could write a script to detect this problem and possibly even mitigate it. William worked on it over the weekend and just posted the result! Head over to his blog for the script! Thanks William for cranking it out this quick! For those who prefer PowerCLI… Alan Renouf just posted his version of the script! Both scripts provide the same functionality though!</Update>
A couple of weeks back Craig S. commented on my blog about an issue he ran in to in his environment. He was using a Distributed vSwitch and testing certain failure scenarios. One of those scenarios was failing a host in the middle of a Storage vMotion process of a virtual machine. After he had failed the host he expected HA to restart the virtual machine but this did not happen unfortunately. He also could not get the virtual machine up and running again himself. Unfortunately in this case it was the vCenter Server that was used to test this scenario with, which brought him in to a difficult position. This was the exact error:
Operation failed, diagnostics report: Failed to open file /vmfs/volumes/4f64a5db-b539e3b0-afed-001b214558a5/.dvsData/71 9e 0d 50 c8 40 d1 c3-87 03 7b ac f8 0b 6a 2d/1241 Status (bad0003)= Not found
Today I spotted a KB article which describes this scenario, the error mentioned in this KB articles reveals a bit more what is going wrong I guess:
2012-01-18T16:23:17.827Z [FFE3BB90 error 'Execution' opID=host-6627:6-0] [FailoverAction::ReconfigureCompletionCallback] Failed to load Dv ports for /vmfs/volumes/UUID/VM/VM.vmx: N3Vim5Fault19PlatformConfigFault9ExceptionE(vim.fault.PlatformConfigFault) 2012-01-18T16:23:17.827Z [FFE3BB90 verbose 'Execution' opID=host-6627:6-0] [FailoverAction::ErrorHandler] Got fault while failing over vm. /vmfs/volumes/UUID/VM/VM.vmx: [N3Vim5Fault19PlatformConfigFaultE:0xecba148] (state = reconfiguring)
It seems that at the time of fail-over the “dvport” information cannot be loaded by HA as after the Storage vMotion process the dvport file is not created on the destination datastore. Now please note that this applies to all VMs attached to a VDS which have been Storage vMotioned using vCenter 5.0. However the problem will only be witnessed during time of HA fail-over.
This dvport info is what I mentioned in my “digging deeper into the VDS construct” article. I already mentioned there that this is what HA uses to reconnect the virtual machine to the Distributed vSwitch… And when files are moving around you can imagine it is difficult to power-on a virtual machine.
I reproduced the problem as shown in the following screenshot. The VM has port 139 assigned by the VDS, but on the datastore there is only a dvport file for 106. This is what happened after I simply Storage vMotioned the VM from Datastore-A to Datastore-B.
For now, if you are using a Distributed vSwitch and running a virtual vCenter Server and have Storage DRS enabled… I would recommend disable Storage DRS for vCenter specifically, just to avoid getting in these scenarios.
Go to Datastore & Datastore Clusters view, Edit properties on Datastore Cluster and change automation level:
The problem itself can be mitigated, as described by Michael Webster here, by simply selecting a different dvPortgroup or vSwitch for the virtual machine. After the reconfiguration has completed you can select the original portgroup again, this will recreate the dvport info on the datastore.
Allan Christiansen says
I have seen this happen even after a succesfull svmotion. it seams that the vm’s .dvsdata part is not always moved with the vm.
We currently have a case open for this exact issue.
Bilal Hashmi says
Thx for sharing and great post indeed. I am assuming one work around if a VM is already in such a state would be to connect a VM to the vSwitch, power it on and then migrate it back to the vDS? Of course, disabling sDRS would help to avoid getting in the situation to begin woh like u recommended. I wonder if this issue also exits on the 1000v.
karlochacon says
Interesting I think I have to disable sVmotion for my VC then
Brad Clarke says
I don’t think it has much to do with svmotion/storage drs. I’ve seen multiple instances of HA failing to restart a vm and it always seems to leave the vm unable to reconnect to the dvswitch. The only fix I’ve found (and the only option I was given by vmware support) is to move the VM to a new port on the dvswitch.
Craig Seidler says
So my testing matches Allan’s response, the .dvsData port reference file does not appear to move with the SVmotioned machine. The issue then occurs if the VC is unavailable when a moved VM attempts to power on after an HA event. We first ran into it in testing a VC failover, the VC had been SVmotioned prior and didn’t come up, but we also saw some other machines on the same failed host come up while others didn’t. We were able to trace it back to ones that had been SVmotioned after they had been added to the DVS. So I also ran a net-dvs -f /etc/vmware/dvsdata.db and could find no reference to the port listed on the hosts the VMs tried to boot on.
Although I don’t know definitively, it appears that when booting on the new host the VM will check the ./dvsData folder in it’s local datastore, possibly before or after checking the hosts local dvsdata.db, and then resorts to looking to the VC. I imagine this is listed somewhere in the VCdb, but I haven’t piled through that yet. Anyway, if the VC isn’t available, for whatever reason when that VM tries to fire up (after an HA event of course) we run into the issue.
Last word I have is it appears that it’s a known issue and will be remedied in a future release. What I have suggested doing as a rather ugly workaround, is if you SVMotion a DVS attached VM, pop it off the DVS to a VSS and then add it back to the DVS, which creates a new port and creates a new reference file in the datastore the VM is now in. Ideal, no, but it works, and it’s a proactive workaround as opposed to having to deal with the issues post HA event.
Duncan says
it is not the VM but HA who needs the file to be available or the port reference in the dvsdata.db. I am guessing that this is a combination of SvMotion + vMotion as the dvsdata.db should contain this info normally when the VM stays local. I will test again this afternoon.
Craig Seidler says
Interesting, I didn’t think that HA itself would play any part in a DVS config and it would instead be initiated from the VM as it tries to boot. Namely referencing the vmx file first as it tries to register the VM on the new host. The VMX has what appears to be a local to the datastore reference.
I would assume that if HA was turned off, and a host with VC goes down, that manually registering the VM to a host that is still up would still have this issue, although I haven’t tested that.
BryanMcc says
we have this exact issue and i can say that the issue is easy to replicate. and since we use svmotion as part of our migration strategy from vmfs3 to vmfs5 we saw this issue on almost every vm we migrated. i wrote a quick powercli script to identify vm with port keys that did not exist on the dvs cache where they resided and fixed them one by knew by changing the portid. if you are using svmotion as part of your migration strategy from vmfs3 to vmfs5 i would recommend only putting the vm on the dvswitch post svmotion and to not svmotion the vm afterward.
there is a hot patch for this issue but you will first have to validate this issue with aware engineering. the ga fixed is scheduled to be released with update 2.
Alpaca says
>There is a hot patch for this issue but you will first have to validate this issue with aware engineering. the ga fixed is scheduled to be released with update 2.
What is this I don’t even
SVMotioning and using DVswitches are, like VMware themselves keep touting, crucial points in today’s virtual infrastructures and so called “clouds”, happening on a dailly basis.
Now your even more crucial (probably yellow) brick, HA, which you trust to bring your stuff back properly once shit hits the fan, chokes because of that and systems keep staying down because some file isn’t where it’s supposed to be.
This seems not like some rare, weird issue and should be known since quite some time, so why the hell is VMware not fixing an essential, critical bug like this in a timely fashion? Seeing how U1 was only recently released, it will probably take at least half a year until U2.
Duncan says
Believe me when I say that we are looking at the issue and trying to address it appropriately.
Michael Webster says
Duncan, have you tried editing the VM settings and just re-selecting the same vDS port group for the VM and then clicking OK? This has helped me out in a number of situations in the past.
Duncan says
No I have not as in my test scenario it is the vCenter Server itself 🙂
Also, not actively testing this at the moment, working on other stuff.
Michael Webster says
I reproduced the problem in my lab and just reconnecting to the same port group on the vDS doesn’t work. Connecting to vSS and then connecting back to the vDS does however work. I hope VMware releases a patch for this shortly.
Florian says
Why only disabling SDRS for the vCenter Server? The vCenter is in that case much more critial, but wouldn’t it be better to avoid the problem for all VMs?
Duncan Epping says
Yes of course it would be Florian, but all VMs can easily be fixed. vCenter would be much more difficult to fix. Hence the reason to disable it completely by default for vCenter so that you cannot run in to this.
Ralf says
“Do not perform Storage vMotion”. This is a funny advise in the KB article. I’m in the process of migrating several clusters from 4.1 to 5. Now I’ve to think about that again.
Duncan says
Or just use it and then run the scripts to fix the problem