The following comment was made on my VDS blog and I figured while would investigate this a bit further:
It seems like the ESXi host only tries to sync the vDS state with the storage at boot and never again afterward. You would think that it would keep trying, but it does not.
Now lets look at the “basics” first. When an ESXi host boots it will get the data required to recreate the VDS structure locally by reading /etc/vmware/dvsdata.db and from esx.conf. You can view the dvsdata.db file yourself by doing:
net-dvs -f /etc/vmware/dvsdata.db
But is that all that is used? If you check the output of that file you will see that all data required for a VDS configuration to work is actually stored in there, so what about those files stored on a VMFS volume?
Each VMFS volume that holds a working directory (place where .vmx is stored) for at least 1 virtual machine that is connected to a VDS will have the following folder:
drwxr-xr-x 1 root root 420 Feb 8 12:33 .dvsData
If you go to this folder you will see another folder. This folder appears to be some sort of unique identifier, and when comparing the string to the output of “net-dvs” it appears to be the identifier of the dvSwitch that was created.
drwxr-xr-x 1 root root 1.5k Feb 8 12:47 6d 8b 2e 50 3c d3 50 4a-ad dd b5 30 2f b1 0c aa
Within this folder you will find a collection of files:
-rw------- 1 root root 3.0k Feb 9 09:00 106 -rw------- 1 root root 3.0k Feb 9 09:02 136 -rw------- 1 root root 3.0k Feb 9 09:00 138 -rw------- 1 root root 3.0k Feb 9 09:05 152 -rw------- 1 root root 3.0k Feb 9 09:00 153 -rw------- 1 root root 3.0k Feb 9 09:05 156 -rw------- 1 root root 3.0k Feb 9 09:05 159 -rw------- 1 root root 3.0k Feb 9 09:00 160 -rw------- 1 root root 3.0k Feb 9 09:00 161
It is no coincidence that these files are “numbers” and that these numbers resemble the port ID of the virtual machines stored on this volume. This is the port information of the virtual machines which have their working directory on this particular datastore. This port info is also what HA uses when it needs to restart a virtual machine which uses a dvPort. Let me emphasize that, this is what HA uses when it needs to restart a virtual machine! Is that all?
Well I am not sure. When I tested the original question I powered on the host without access to the storage system and powered on my storage system when the host was fully booted. I did not get this confirmed, but it seems to me that access to the datastore holding these files is somehow required during the boot process of your host, in the case of “static port bindings” that is. (Port bindings are more in-depth described here.)
Does this imply that if your storage is not available during the boot process virtual machines cannot connect to the network when they are powered on? Yes that is correct, I tested it and when you have a full power-outage and your hosts come-up before your storage you will have a “challenge”. As soon as the storage is restored you probably will want to restart your virtual machines but if you do you will not get a network connection. I’ve tested this 6 or 7 times in total and not once did I get a connection.
As a workaround you can simply reboot your ESXi hosts. If you reboot the host the problem is solved and your virtual machines can be powered on and will get access to the network. Rebooting a host can be a painfully slow exercise though, as I noticed during my test runs in my lab. Fortunately there is a really simple workaround: restarting the management agents! Before you power-on your virtual machines and after your storage connection has been restored do the following from the ESXi shell:
services.sh restart
After the services have been restarted you can power-on your virtual machines and network connection will be restored!
Side note, on my article there was one question about the auto-expand property of static port groups and whether this was officially supported and where it was documented. Yes it is fully supported. There’s a KB Article about how to enable it and William Lam recently blogged about it here. That is it for now on VDS…
Duco Jaspars says
Great post
This does explain an issue I have seen in the past, and the services.sh restart is a nice “fix”
Michael Webster says
Hi Duncan, Great article. Thanks for your effort and research into this. Especially good news that you found a very simple fix instead of a reboot.
Daniel Rhoden says
@Duncan, great article. Say the storage came up before the host, as you stated, but VMs were automatically started (HA – in my head this could happen). Could you just restart the management agents and all would be well for the powered on VMs? …or would you have to either reboot the VMs or vmotion them off to a host that had the management agents restarted already?
My theory above is driven by a ROBO. Imagining nobody is there over the weekends when the power went out and the hosts boot faster than the storage array.
As always, thanks Duncan.
Daniel Rhoden says
My question was a bit confusing and contained a typo so let me try one more time:
You state the following process based off of a host coming online before storage:
1) Ensure storage is online
2) Restart Management Agents
3) Power-up VMs
If Storage came online and VMs were powered on before the restart of the management agents, what options would you have? Could you vMotion to a host which already had it’s management agents restarted post storage coming back online? Or are you forced to reboot the VMs?
As stated in my original comment, my assumption is HA would start the VMs once storage was available.
Thanks
Craig Seidler says
Hi Duncan,
So we’re experiencing an virtual VC on a DVS and HA issue with a client and I was able to reproduce it in lab. VI 5U1, virtual VC & SQL, Static port binding to start. Perform a storage vmotion of the VC (and other VM’s for that matter), then yank whichever host has the VC on it…and boom:
Operation failed, diagnostics report: Failed to open file /vmfs/volumes/4f64a5db-b539e3b0-afed-001b214558a5/.dvsData/71 9e 0d 50 c8 40 d1 c3-87 03 7b ac f8 0b 6a 2d/1241 Status (bad0003)= Not found
So what have I noticed, that file does not exist in that datastore, it did not move with the storage vmotion, but still exists in the original datastore.
A manual move of the file failed to help
Restarting the services failed to help
Reconfiguring HA, both on a host and by removing it and resetting it on a cluster level failed.
I ran the net-dvs -f /etc/vmware/dvsdata.db on the both the host the VC was on as well as the host it failed over to and couldn’t find reference to the port in their either.
So we have an open ticket with support, they suggested ephemeral, which we tested both with the client and in the lab. Both failed.
Found this: http://kb.vmware.com/kb/2013639 sadly no actual answer yet.
Any ideas? Thanks Duncan
Randy Robertson says
Craig, can you share the SR number, I’d like to look into this issue internally.
–
Randy Robertson
Senior Member of Technical Staff
vSphere Networking
Craig Seidler says
Absolutely, is there a way I can PM it to you, it’s a client SR, so for discretionary sake I’d rather not post it on the web:)
Craig
Vaseem says
Duncan, great arcticle. I have a question though with respect to your article. Does this apply to NFS datastores as well. Lets say, the infrastructure consists of the VC running as a VM, dVS is being used for all networking(including storage), and NFS datastores are used.
Now, if everything goes down, I think it will be chicken and an egg situation, when we try to bring up, as the dVS info (.dvsData) will not be available until it can read NFS storage, and NFS storage will not be available until the networking is up.
I am thinking to avoid this situation, it would be better to use Standard vSwitch for NFS storage (VMkernel) instead of dVS.
Let me know if there is any other workaround or please confirm if this is doesn’t apply to NFS.
Thanks and keep posting.
Randy Robertson says
Vaseem, in that case, there is no problem. Only VM DVS info is persisted on the datastore, host configuration for dvs ports used for vmknics is persisted locally in /etc/vmware/dvsdata.db, which is available after boot. So DVS comes up and vmknic can use DVS to connect to NFS. As always though, go ahead and test this scenario to be sure.
–
Randy Robertson
vSphere Networking
Mitchell Clissold says
Hi All
Has there been a resolution to Craig’s issue by VMWare? We have recently deployed vSphere 5.x and would be good to know the resolution or workaround
Great blog
– mitchell