On April 19th I wrote about an issue with vSphere 5.1 and NFS based datastores APD ‘ing. People internally at VMware have worked very hard to root cause the issue and fix it. Log entries witnessed are:
YYYY-04-01T14:35:08.075Z: [APDCorrelator] 9414268686us: [esx.problem.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
YYYY-04-01T14:36:55.274Z: No correlator for vob.vmfs.nfs.server.disconnect
YYYY-04-01T14:36:55.274Z: [vmfsCorrelator] 9521467867us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.1.1/NFS-DS1 12345678-abcdefg0-0000-000000000000 NFS-DS1
YYYY-04-01T14:37:28.081Z: [APDCorrelator] 9553899639us: [vob.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
More details on the fix can be found here: http://kb.vmware.com/kb/2077360
Awesome work from vmware engineers as usual! Anxious to deploy this over the weekend.
Thanks for releasing the patch for an important bug after almost two months. Come on VMware, you can do better.
Keeping enterprise customers in dark about the release detail will not make them happy.
Duncan Epping says
I recommend that you provide this feedback directly to your VMware pre-sales or sales contact. That way the people responsible will hear directly from customers how things like these are experienced. Thanks,
Anthony Spiteri (@anthonyspiteri) says
Transparency on the root cause would be a favourable outcome given the time it took to resolve and the general hush hush nature of the problem.
It otherwise causes unnecessary speculation.
Glad to have the fix though…just in time for a platform upgrade.
I’ve patched our hosts and still getting the APDCorrelator errors and NFS datastores going offline for a minute or two in the vSphere Client.
Not getting BSOD’s or issues with linux file systems just performance issues to the point where any applications that have to connect to DB’s are crashing…. Got a ticket with VMware and waiting to hear something.
that is worrying! we are planning to upgrade and we use NFS for almost everything. am I better staying on ESXi 5.5 GA I wonder!?
My company too applied the patch and had APD occur again. We are using NetApp and found NetApp is still recommending the nfs max queuedepth = 64 as still being needed for NetApp.
KB ID: 1014696 Version: 5.0 Published date: 07/11/2014
VMware has published KB 2016122: NFS connectivity issues on NetApp NFS filers on ESXi 5.x and KB 2077360: VMware ESXi 5.5, Patch ESXi550-201406401-SG: Updates esx-base
Their (VMware) claim is that there is a version of Data ONTAP that ‘resolves’ this issue
The Data ONTAP upgrade referenced will ONLY prevent the TCP windowsize from dropping to 0, it will NOT resolve all APD issues
Additionally, enabling SIOC will only ‘resolve’ the issue ‘after’ it begins occurring, it will not ‘prevent’ the issue from occurring.
The only recommended way to resolve this is to limit the NFS maxqueuedepth to 64.