Alert: vSphere 5.5 U1 and NFS issue!

Duncan Epping · Apr 19, 2014 ·

Some had already reported on this on twitter and the various blog posts but I had to wait until I received the green light from our KB/GSS team. An issue has been discovered with vSphere 5.5 Update 1 that is related to loss of connection of NFS based datastores. (NFS volumes include VSA datastores.)

*** Patch released, read more about it here ***

This is a serious issue, as it results in an APD of the datastore meaning that the virtual machines will not be able to do any IO to the datastore at the time of the APD. This by itself can result in BSOD’s for Windows guests and filesystems becoming read only for Linux guests.

Witnessed log entries can include:

2014-04-01T14:35:08.074Z: [APDCorrelator] 9413898746us: [vob.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
2014-04-01T14:35:08.075Z: [APDCorrelator] 9414268686us: [esx.problem.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
2014-04-01T14:36:55.274Z: No correlator for vob.vmfs.nfs.server.disconnect
2014-04-01T14:36:55.274Z: [vmfsCorrelator] 9521467867us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.1.1/NFS-DS1 12345678-abcdefg0-0000-000000000000 NFS-DS1
2014-04-01T14:37:28.081Z: [APDCorrelator] 9553899639us: [vob.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
2014-04-01T14:37:28.081Z: [APDCorrelator] 9554275221us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

If you are hitting these issues than VMware recommends reverting back to vSphere 5.5. Please monitor the following KB closely for more details and hopefully a fix in the near future: http://kb.vmware.com/kb/2076392

Comments

Fletcher Cocquyt (@Cocquyt) says

19 April, 2014 at 11:56

Duncan, thanks for the heads up. We are on 5.5 < 2076392 are we safe from this issue?
Also when is the Heartbleed update available? Today?

thanks
Morianos says

19 April, 2014 at 17:03

Reminiscent of bug I remember from the past that affected the deleted-block-recovery process wrecking thin provisioning. This is even worse as it reads.
testbedvm says

20 April, 2014 at 08:47

ever since VC 5.1 release I have noticed a gradual shift in VMware’s product release cycle and they are eager to release products so quickly that they are failing to perform some simple tests like these in house… I do know that they have a strong R&D and CPD team that are meant to test the stability of the product and they are taking customer’s environment for granted and releasing these unfinished products and basically performing QA on customer’s production environment… This is becoming more and more evident in last 2 years….

Please fix this.

Regards,
X vmware employee
- Anonymous Virtual Coward from a big company says
  
  20 April, 2014 at 16:21
  
  Yeah.. 5.1 was so buggy (with its first attempt with SSO and the storage heap memory issues) that we just stuck with 5.0. Now I feel like we’re doing their QA for them when it comes to vSphere 5.5 (and vCAC 5.x to 6.x migration). I know they have a lot of new products to support and integrate, but I think it’s bad when the core of the platform has these issues.
  
  This bug is pretty bad as more and more people embrace network-based storage.
Virtualizacion en Español says

21 April, 2014 at 13:55

Is there a recommended downgrade procedure? I’ve been upgrading since ESX 4.0 times, but never had to roll back several dozen hosts to a previous version.

I also feel there has to be some more testing before releases, even though it takes a lot of time with such a big product portfolio. In our case, rolling back to 5.5.0GA will bring back an issue we had with HP Gen8 servers and a driver with memory leaks. Luckily I had planned OpenSSL patch maintenance for next week.

Thanks Duncan for the heads up!
P. Cruiser says

21 April, 2014 at 16:30

Interesting, I’m not seeing this issue at all in the environment that I manage. I wonder why.
Jon says

21 April, 2014 at 18:06

We have been seeing this issue manifest for months. But I hate to tell you, we’re not on update 1. we’re on plain vanilla 5.5. Something doesn’t match up with the line being fed from vmware.
James Hess says

21 April, 2014 at 23:57

@Jon; The error messages are pretty generic. APD messages can be seen if there is a storage array performance or a network congestion issue, such as high read/write wait time. So maybe you are experiencing a different problem in your environment with similar apparent results; such as network congestion on a host.

Indeed… the nature of the vSphere bug has not been thoroughly explained so far.
I sure hope they will provide more information.

VMware also did not appear to show any way to definitively separate “legitimate” all path down events; where the getattr/read/write operations are failing because of a storage array/networking issue from the same APD messages caused by the vSphere bug on a host.

This begs one to ask: When will VMware bother to implement NFSv4, and PNFS or proper multipathing for NFS (like interface binding), and/or SMB Version 3 protocols for more resilient network-based storage connections.
Admin says

22 April, 2014 at 06:10

Looks like VMware is doing a lousy job in QA. No wonder many are looking for alternative. Hope they will focus on the core business.
- Ron says
  
  22 April, 2014 at 23:40
  
  Agreed. I’m getting really tired of all the 5.x problems. I’m trying to roll out a new 5.5 infrastructure and am running into far too many vmware caused problems. Stay away from vflash until the next major release. Cleanup your house vmware!
Tom says

22 April, 2014 at 21:08

I am having a similar issue with 5.5 U1 and iSCSI. Save behavior. I have a VMware support ticket open, no resolution yet.
- Remco says
  
  23 April, 2014 at 13:13
  
  What kind of nics are you using?
  - Tom says
    
    23 April, 2014 at 14:54
    
    I have 8 total (4 onboard a Dell R610 and a separate card)
    Broadcom NetXtreme II BCM5709 (onboard)
    Intel 82576 (separate card)
    
    The support person was looking at upgrading the NIC drivers but got some errors and was researching last I talked to him.
    - Remco says
      
      23 April, 2014 at 15:00
      
      We had some issues with broadcom adapters and were able to fix it by disabling NetQueue:
      Advanced settings > Boot > VMkernel.Boot.netNetqueueEnabled.
      
      Hope this helps!
      - Tom says
        
        23 April, 2014 at 15:47
        
        Oddly enough I never had a problem on 5.5 but moving to 5.5 U1 it started having problems. In fact I just downgraded a host to 5.5 and I am not getting the same issue.
Remco says

23 April, 2014 at 15:51

That’s strange, we are very carefull these days when it comes to ugrading to the latest build, we are going to wait for the next update and then wait a few more days 🙂
Jeff says

11 May, 2014 at 13:28

I am surprised there is still no patch for this.
- MarkV says
  
  12 May, 2014 at 13:10
  
  We’re surprised as well. We’re waiting for this update for U1, but I don’t see any progress from VMware side?
  - P. Cruiser says
    
    12 May, 2014 at 16:27
    
    Well, it does surprise me that there isn’t at least more information about the issue, but I’m not too surprised that a NFS-related fix is taking this long. I mean, maybe someday we’ll be able to use that fancy ‘new’ pNFS feature of our storage array, but I’m not holding my breath..
P. Letreulle says

15 May, 2014 at 15:45

Hello Ducan,

Any idea when a fix will be released ? or a new VMware Kb to downgrade easily to ESX 5.5 GA
- Duncan Epping says
  
  15 May, 2014 at 16:29
  
  The VMware team is working on it, more news soon!
  - MarkV says
    
    20 May, 2014 at 15:04
    
    Have you got any inside information on this one? We’ve 7 clients waiting for this patch to install the update, but we keep on delaying it week after week and we’re not seeing any movement at all from VMware… As you said; more news soon I was really hoping you meant 1 or 2 days? Have you got any updates?
    - duncan@yellow-bricks says
      
      20 May, 2014 at 16:24
      
      I can’t share any info unfortunately other than available on the VMware website.
      - James Hess says
        
        10 June, 2014 at 21:28
        
        Did the VMware team forget about it? It seems there’s no update for a very long time….:(
Jeff says

21 May, 2014 at 15:42

I really have to say I am surprised vmware is taking more than a month to address a situation where datastores APD daily. Very surprised.
- P. Cruiser says
  
  23 May, 2014 at 17:48
  
  Datastores APD’ing daily? I’d sure hate to be suffering through that over the Memorial Day weekend..
  
  Are you stuck with Update 1? There’s absolutely no way that my environment could go back to vanilla 5.5 due to fixes needed from Update 1. I’m guessing you’re in the same boat.
  - jeff says
    
    28 May, 2014 at 21:38
    
    Not that frequently, I was exaggerating a bit. But it is very un-nerving having this hang over my head all the time 🙂
PaulB says

21 May, 2014 at 22:02

It is a shame when reliability gets sacrificed in pursuit of new features. The v5 series has been problematic for us also. VMware needs to understand that their hypervisor needs to be bullet-proof above all else.
- Admin says
  
  30 May, 2014 at 12:51
  
  More than a month and still no patch? Still waiting for the patch for a major deployment. I was hoping, along with today’s 5.0 patch there will be one for 5.5. But no. Come on VMware.
Duncan Epping says

11 June, 2014 at 08:11

PATCH RELEASED:

http://www.yellow-bricks.com/2014/06/11/vsphere-5-5-u1-patch-released-nfs-apd-problem/

Related

Reader Interactions

Comments