Alert: vSphere 5.5 U1 and NFS issue!

Some had already reported on this on twitter and the various blog posts but I had to wait until I received the green light from our KB/GSS team. An issue has been discovered with vSphere 5.5 Update 1 that is related to loss of connection of NFS based datastores. (NFS volumes include VSA datastores.)

*** Patch released, read more about it here ***

This is a serious issue, as it results in an APD of the datastore meaning that the virtual machines will not be able to do any IO to the datastore at the time of the APD. This by itself can result in BSOD’s for Windows guests and filesystems becoming read only for Linux guests.

Witnessed log entries can include:

2014-04-01T14:35:08.074Z: [APDCorrelator] 9413898746us: [vob.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
2014-04-01T14:35:08.075Z: [APDCorrelator] 9414268686us: [esx.problem.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
2014-04-01T14:36:55.274Z: No correlator for vob.vmfs.nfs.server.disconnect
2014-04-01T14:36:55.274Z: [vmfsCorrelator] 9521467867us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.1.1/NFS-DS1 12345678-abcdefg0-0000-000000000000 NFS-DS1
2014-04-01T14:37:28.081Z: [APDCorrelator] 9553899639us: [vob.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
2014-04-01T14:37:28.081Z: [APDCorrelator] 9554275221us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

If you are hitting these issues than VMware recommends reverting back to vSphere 5.5. Please monitor the following KB closely for more details and hopefully a fix in the near future: http://kb.vmware.com/kb/2076392

 

Be Sociable, Share!

    Comments

    1. says

      Reminiscent of bug I remember from the past that affected the deleted-block-recovery process wrecking thin provisioning. This is even worse as it reads.

    2. testbedvm says

      ever since VC 5.1 release I have noticed a gradual shift in VMware’s product release cycle and they are eager to release products so quickly that they are failing to perform some simple tests like these in house… I do know that they have a strong R&D and CPD team that are meant to test the stability of the product and they are taking customer’s environment for granted and releasing these unfinished products and basically performing QA on customer’s production environment… This is becoming more and more evident in last 2 years….

      Please fix this.

      Regards,
      X vmware employee

      • Anonymous Virtual Coward from a big company says

        Yeah.. 5.1 was so buggy (with its first attempt with SSO and the storage heap memory issues) that we just stuck with 5.0. Now I feel like we’re doing their QA for them when it comes to vSphere 5.5 (and vCAC 5.x to 6.x migration). I know they have a lot of new products to support and integrate, but I think it’s bad when the core of the platform has these issues.

        This bug is pretty bad as more and more people embrace network-based storage.

    3. says

      Is there a recommended downgrade procedure? I’ve been upgrading since ESX 4.0 times, but never had to roll back several dozen hosts to a previous version.

      I also feel there has to be some more testing before releases, even though it takes a lot of time with such a big product portfolio. In our case, rolling back to 5.5.0GA will bring back an issue we had with HP Gen8 servers and a driver with memory leaks. Luckily I had planned OpenSSL patch maintenance for next week.

      Thanks Duncan for the heads up!

    4. P. Cruiser says

      Interesting, I’m not seeing this issue at all in the environment that I manage. I wonder why.

    5. says

      We have been seeing this issue manifest for months. But I hate to tell you, we’re not on update 1. we’re on plain vanilla 5.5. Something doesn’t match up with the line being fed from vmware.

    6. James Hess says

      @Jon; The error messages are pretty generic. APD messages can be seen if there is a storage array performance or a network congestion issue, such as high read/write wait time. So maybe you are experiencing a different problem in your environment with similar apparent results; such as network congestion on a host.

      Indeed… the nature of the vSphere bug has not been thoroughly explained so far.
      I sure hope they will provide more information.

      VMware also did not appear to show any way to definitively separate “legitimate” all path down events; where the getattr/read/write operations are failing because of a storage array/networking issue from the same APD messages caused by the vSphere bug on a host.

      This begs one to ask: When will VMware bother to implement NFSv4, and PNFS or proper multipathing for NFS (like interface binding), and/or SMB Version 3 protocols for more resilient network-based storage connections.

    7. Admin says

      Looks like VMware is doing a lousy job in QA. No wonder many are looking for alternative. Hope they will focus on the core business.

      • Ron says

        Agreed. I’m getting really tired of all the 5.x problems. I’m trying to roll out a new 5.5 infrastructure and am running into far too many vmware caused problems. Stay away from vflash until the next major release. Cleanup your house vmware!

    8. Tom says

      I am having a similar issue with 5.5 U1 and iSCSI. Save behavior. I have a VMware support ticket open, no resolution yet.

        • Tom says

          I have 8 total (4 onboard a Dell R610 and a separate card)
          Broadcom NetXtreme II BCM5709 (onboard)
          Intel 82576 (separate card)

          The support person was looking at upgrading the NIC drivers but got some errors and was researching last I talked to him.

          • Remco says

            We had some issues with broadcom adapters and were able to fix it by disabling NetQueue:
            Advanced settings > Boot > VMkernel.Boot.netNetqueueEnabled.

            Hope this helps!

            • Tom says

              Oddly enough I never had a problem on 5.5 but moving to 5.5 U1 it started having problems. In fact I just downgraded a host to 5.5 and I am not getting the same issue.

    9. Remco says

      That’s strange, we are very carefull these days when it comes to ugrading to the latest build, we are going to wait for the next update and then wait a few more days :)

      • MarkV says

        We’re surprised as well. We’re waiting for this update for U1, but I don’t see any progress from VMware side?

        • P. Cruiser says

          Well, it does surprise me that there isn’t at least more information about the issue, but I’m not too surprised that a NFS-related fix is taking this long. I mean, maybe someday we’ll be able to use that fancy ‘new’ pNFS feature of our storage array, but I’m not holding my breath..

        • MarkV says

          Have you got any inside information on this one? We’ve 7 clients waiting for this patch to install the update, but we keep on delaying it week after week and we’re not seeing any movement at all from VMware… As you said; more news soon I was really hoping you meant 1 or 2 days? Have you got any updates?

          • duncan@yellow-bricks says

            I can’t share any info unfortunately other than available on the VMware website.

            • James Hess says

              Did the VMware team forget about it? It seems there’s no update for a very long time….:(

    10. Jeff says

      I really have to say I am surprised vmware is taking more than a month to address a situation where datastores APD daily. Very surprised.

      • P. Cruiser says

        Datastores APD’ing daily? I’d sure hate to be suffering through that over the Memorial Day weekend..

        Are you stuck with Update 1? There’s absolutely no way that my environment could go back to vanilla 5.5 due to fixes needed from Update 1. I’m guessing you’re in the same boat.

        • jeff says

          Not that frequently, I was exaggerating a bit. But it is very un-nerving having this hang over my head all the time :)

    11. PaulB says

      It is a shame when reliability gets sacrificed in pursuit of new features. The v5 series has been problematic for us also. VMware needs to understand that their hypervisor needs to be bullet-proof above all else.

      • Admin says

        More than a month and still no patch? Still waiting for the patch for a major deployment. I was hoping, along with today’s 5.0 patch there will be one for 5.5. But no. Come on VMware.

    Leave a Reply