Some had already reported on this on twitter and the various blog posts but I had to wait until I received the green light from our KB/GSS team. An issue has been discovered with vSphere 5.5 Update 1 that is related to loss of connection of NFS based datastores. (NFS volumes include VSA datastores.)
*** Patch released, read more about it here ***
This is a serious issue, as it results in an APD of the datastore meaning that the virtual machines will not be able to do any IO to the datastore at the time of the APD. This by itself can result in BSOD’s for Windows guests and filesystems becoming read only for Linux guests.
Witnessed log entries can include:
2014-04-01T14:35:08.074Z: [APDCorrelator] 9413898746us: [vob.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
2014-04-01T14:35:08.075Z: [APDCorrelator] 9414268686us: [esx.problem.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
2014-04-01T14:36:55.274Z: No correlator for vob.vmfs.nfs.server.disconnect
2014-04-01T14:36:55.274Z: [vmfsCorrelator] 9521467867us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.1.1/NFS-DS1 12345678-abcdefg0-0000-000000000000 NFS-DS1
2014-04-01T14:37:28.081Z: [APDCorrelator] 9553899639us: [vob.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
2014-04-01T14:37:28.081Z: [APDCorrelator] 9554275221us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
If you are hitting these issues than VMware recommends reverting back to vSphere 5.5. Please monitor the following KB closely for more details and hopefully a fix in the near future: http://kb.vmware.com/kb/2076392
Fletcher Cocquyt (@Cocquyt) says
Duncan, thanks for the heads up. We are on 5.5 < 2076392 are we safe from this issue?
Also when is the Heartbleed update available? Today?
Reminiscent of bug I remember from the past that affected the deleted-block-recovery process wrecking thin provisioning. This is even worse as it reads.
ever since VC 5.1 release I have noticed a gradual shift in VMware’s product release cycle and they are eager to release products so quickly that they are failing to perform some simple tests like these in house… I do know that they have a strong R&D and CPD team that are meant to test the stability of the product and they are taking customer’s environment for granted and releasing these unfinished products and basically performing QA on customer’s production environment… This is becoming more and more evident in last 2 years….
Please fix this.
X vmware employee
Anonymous Virtual Coward from a big company says
Yeah.. 5.1 was so buggy (with its first attempt with SSO and the storage heap memory issues) that we just stuck with 5.0. Now I feel like we’re doing their QA for them when it comes to vSphere 5.5 (and vCAC 5.x to 6.x migration). I know they have a lot of new products to support and integrate, but I think it’s bad when the core of the platform has these issues.
This bug is pretty bad as more and more people embrace network-based storage.
Virtualizacion en Español says
Is there a recommended downgrade procedure? I’ve been upgrading since ESX 4.0 times, but never had to roll back several dozen hosts to a previous version.
I also feel there has to be some more testing before releases, even though it takes a lot of time with such a big product portfolio. In our case, rolling back to 5.5.0GA will bring back an issue we had with HP Gen8 servers and a driver with memory leaks. Luckily I had planned OpenSSL patch maintenance for next week.
Thanks Duncan for the heads up!
P. Cruiser says
Interesting, I’m not seeing this issue at all in the environment that I manage. I wonder why.
We have been seeing this issue manifest for months. But I hate to tell you, we’re not on update 1. we’re on plain vanilla 5.5. Something doesn’t match up with the line being fed from vmware.
James Hess says
@Jon; The error messages are pretty generic. APD messages can be seen if there is a storage array performance or a network congestion issue, such as high read/write wait time. So maybe you are experiencing a different problem in your environment with similar apparent results; such as network congestion on a host.
Indeed… the nature of the vSphere bug has not been thoroughly explained so far.
I sure hope they will provide more information.
VMware also did not appear to show any way to definitively separate “legitimate” all path down events; where the getattr/read/write operations are failing because of a storage array/networking issue from the same APD messages caused by the vSphere bug on a host.
This begs one to ask: When will VMware bother to implement NFSv4, and PNFS or proper multipathing for NFS (like interface binding), and/or SMB Version 3 protocols for more resilient network-based storage connections.
Looks like VMware is doing a lousy job in QA. No wonder many are looking for alternative. Hope they will focus on the core business.
Agreed. I’m getting really tired of all the 5.x problems. I’m trying to roll out a new 5.5 infrastructure and am running into far too many vmware caused problems. Stay away from vflash until the next major release. Cleanup your house vmware!
I am having a similar issue with 5.5 U1 and iSCSI. Save behavior. I have a VMware support ticket open, no resolution yet.
What kind of nics are you using?
I have 8 total (4 onboard a Dell R610 and a separate card)
Broadcom NetXtreme II BCM5709 (onboard)
Intel 82576 (separate card)
The support person was looking at upgrading the NIC drivers but got some errors and was researching last I talked to him.
We had some issues with broadcom adapters and were able to fix it by disabling NetQueue:
Advanced settings > Boot > VMkernel.Boot.netNetqueueEnabled.
Hope this helps!
Oddly enough I never had a problem on 5.5 but moving to 5.5 U1 it started having problems. In fact I just downgraded a host to 5.5 and I am not getting the same issue.
That’s strange, we are very carefull these days when it comes to ugrading to the latest build, we are going to wait for the next update and then wait a few more days 🙂
I am surprised there is still no patch for this.
We’re surprised as well. We’re waiting for this update for U1, but I don’t see any progress from VMware side?
P. Cruiser says
Well, it does surprise me that there isn’t at least more information about the issue, but I’m not too surprised that a NFS-related fix is taking this long. I mean, maybe someday we’ll be able to use that fancy ‘new’ pNFS feature of our storage array, but I’m not holding my breath..
P. Letreulle says
Any idea when a fix will be released ? or a new VMware Kb to downgrade easily to ESX 5.5 GA
Duncan Epping says
The VMware team is working on it, more news soon!
Have you got any inside information on this one? We’ve 7 clients waiting for this patch to install the update, but we keep on delaying it week after week and we’re not seeing any movement at all from VMware… As you said; more news soon I was really hoping you meant 1 or 2 days? Have you got any updates?
[email protected] says
I can’t share any info unfortunately other than available on the VMware website.
James Hess says
Did the VMware team forget about it? It seems there’s no update for a very long time….:(
I really have to say I am surprised vmware is taking more than a month to address a situation where datastores APD daily. Very surprised.
P. Cruiser says
Datastores APD’ing daily? I’d sure hate to be suffering through that over the Memorial Day weekend..
Are you stuck with Update 1? There’s absolutely no way that my environment could go back to vanilla 5.5 due to fixes needed from Update 1. I’m guessing you’re in the same boat.
Not that frequently, I was exaggerating a bit. But it is very un-nerving having this hang over my head all the time 🙂
It is a shame when reliability gets sacrificed in pursuit of new features. The v5 series has been problematic for us also. VMware needs to understand that their hypervisor needs to be bullet-proof above all else.
More than a month and still no patch? Still waiting for the patch for a major deployment. I was hoping, along with today’s 5.0 patch there will be one for 5.5. But no. Come on VMware.
Duncan Epping says