<Update>I asked William Lam if he could write a script to detect this problem and possibly even mitigate it. William worked on it over the weekend and just posted the result! Head over to his blog for the script! Thanks William for cranking it out this quick! For those who prefer PowerCLI… Alan Renouf just posted his version of the script! Both scripts provide the same functionality though!</Update>
I think there is some confusion around the SvMotion / VDS problem I described a couple of days back. Let me try to clarify it in a couple of simple steps.
First of all, this only applies to virtual machines that have been Storage vMotioned by vCenter 5.0 and are connected to a Distributed vSwitch. This could be either manually or using Storage DRS. So what is the exact problem?
- When a VM is attached to a dvPortgroup it is connected to a port. This information is stored locally on the host and on the VMFS volume this VM is stored on.
- This volume will contain a file which is named equal to the port number of this VM.
- When the VM is Storage vMotioned to a different datastore this file is not created on the destination datastore
- When the host fails on which the Storage vMotioned VM resides HA will attempt to restart that VM.
- In order for HA to restart it and connect it to the dvPortgroup this file is required.
- As the file is not available the restart fails.
You can simply resolve this by connecting the impacted VMs to a different dvPortgroup temporarily and then reconnect them back to the original portgroup. As soon as you’ve done that the file will be created on the datastore. For now this is a manual task, but I am sure some of my teammembers are working on a scripted solution as we speak… right Alan / William? 🙂
Interesting. So for now, if you’ve svmotioned or SDRS has moved a vm and you’re using a vDS, HA will fail. I don’t mind making the change as you’ve outlined when I perform a svmotion, but this would be pretty impossible for SDRS situations (automated SDRS) as these events could happen at anytime during the day.
Sounds more like a bug fix then something a script would handle.
Duncan, I believe you are right. Unless of course your vCenter is the VM that can’t connect to the vDS, in which case you’re screwed unless you have a fall back vSS to connect the vCenter VM to.
With with vCenter not running/not available on the network you cannot connect a VM to a different portgroup on the vDS.
is vCenter 5.0 Update 1 affected?
I have 6 nodes cluster Automated SDRS kinda worried now
Yes it is.
Hi,
Could you please explain how this process(at vCenter 5.0) differs from the vCenter 4.1. Taking the manual SvMotion under consideration.
Why the vCenter 4.0/4.1 is not affected by this issue and the 5.0 is??
Thank you very much for your help,
Greg
Apparently the SvMotion process was changed/impacted with vCenter 5.0.
Since the Nexus 1000v is a vDS of sorts, will it have the same issue?
Yes, same problem.
Does this happen regardless of whether the VM was powered off during the svmotion?
It looks like moving it to another port on the same dvPortgroup is enough to get the file recreated.
Also, if you have a large number of VMs on one dvPortgroup then making a temp dvPortgroup and using the “Migrate Virtual Machine Networking” wizard to move them over and back worked pretty well for me.
I haven’t tested this in a lab yet, but is there a reason we can’t just move the dvsdata file? Would I be incorrect in saying the problem is during an svMotion while on the same host? What happens if you migrate to another host? Would a simple vMotion fix the issue as well?
I actually tested that a while ago Jake, I moved the file and it worked. But this was just a single test and I don’t think this would be supported. Both Alan and William created scripts which provide a better mechanism in my opinion to “solve” the problem.
This issue also seems to be occurring when a vm is cloned to a new datastore.
Thanks Kevin, will test/validate it and report it.
I haven’t been able to reproduce the problem so far Kevin.
Hiya Duncan.
I just ran into the situation at a large customer, that has just activated SDRS. Out of the 600 VMs I have missing dvPort settings on about 20% of the VMs.
I installed Alan’s PowerCLI test-VDSVMIssue function, but it takes about 10 seconds to migrate the VM network card to the temporary dvPort and back. This means about 3 pings timeouts. On critical VMs like Production SAP, or Exchange, this might not be the best idea
Alan and William’s scripts are ‘temporary fixes’ for the moment. It does not ‘solve’ an Automatic-SDRS/vDS situation, it just mitigates a potential issue.
Thanks anyway to Craig, you, Alan & William in tackling this issue.
Regards,
Erik
We know it doesn’t solve the issue Erik. A fix is underway, but until it is released this is the best work around for now.
Alan and William are fine tuning the scripts to prevent downtime.
Hi,
We have a 2VM cluster on a ESX Server, which was working fine..
Once we updated it to ESX 5 version, the Quorum or any group failover doesnot succeed(not consistent, fails after few trails ).
Does this sound realted to ESX version upgrade?
Thanks,
Deepa
Hi.
Connecting the Storage-vMotion-ed VMs to a dVS with ephemeral ports should avoid this issue, am I right?
I can’t access the KB article anymore, I always get Access Denied.
Duncan, can you confirm if this issue has been resolved in vSphere 5.1?
We were bitten by this last night, 5.0 U1 with latest fixes. I guess that’s what we get for not reading this blog more attentively!
Yes problem is fixes. You will still need to run the script though to fix the VMs itself!!