vSphere 5.5 U1 patch released for NFS APD problem!

On April 19th I wrote about an issue with vSphere 5.1 and NFS based datastores APD ‘ing. People internally at VMware have worked very hard to root cause the issue and fix it. Log entries witnessed are:

YYYY-04-01T14:35:08.075Z: [APDCorrelator] 9414268686us: [esx.problem.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
YYYY-04-01T14:36:55.274Z: No correlator for vob.vmfs.nfs.server.disconnect
YYYY-04-01T14:36:55.274Z: [vmfsCorrelator] 9521467867us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.1.1/NFS-DS1 12345678-abcdefg0-0000-000000000000 NFS-DS1
YYYY-04-01T14:37:28.081Z: [APDCorrelator] 9553899639us: [vob.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed. 

More details on the fix can be found here: http://kb.vmware.com/kb/2077360

Alert: vSphere 5.5 U1 and NFS issue!

Some had already reported on this on twitter and the various blog posts but I had to wait until I received the green light from our KB/GSS team. An issue has been discovered with vSphere 5.5 Update 1 that is related to loss of connection of NFS based datastores. (NFS volumes include VSA datastores.)

*** Patch released, read more about it here ***

This is a serious issue, as it results in an APD of the datastore meaning that the virtual machines will not be able to do any IO to the datastore at the time of the APD. This by itself can result in BSOD’s for Windows guests and filesystems becoming read only for Linux guests.

Witnessed log entries can include:

2014-04-01T14:35:08.074Z: [APDCorrelator] 9413898746us: [vob.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
2014-04-01T14:35:08.075Z: [APDCorrelator] 9414268686us: [esx.problem.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
2014-04-01T14:36:55.274Z: No correlator for vob.vmfs.nfs.server.disconnect
2014-04-01T14:36:55.274Z: [vmfsCorrelator] 9521467867us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.1.1/NFS-DS1 12345678-abcdefg0-0000-000000000000 NFS-DS1
2014-04-01T14:37:28.081Z: [APDCorrelator] 9553899639us: [vob.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
2014-04-01T14:37:28.081Z: [APDCorrelator] 9554275221us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

If you are hitting these issues than VMware recommends reverting back to vSphere 5.5. Please monitor the following KB closely for more details and hopefully a fix in the near future: http://kb.vmware.com/kb/2076392

 

How does vSphere recognize an NFS Datastore?

This question has popped up various times now, how does vSphere recognize an NFS Datastore? This concept has changed over time and hence the reason many people are confused. I am going to try to clarify this. Do note that this article is based on vSphere 5.0 and up. I had a similar article a while back, but figured writing it in a more explicit way might help answering these questions. (and gives me the option to send people just a link :-))

When an NFS share is mounted a unique identifier is created to ensure that this volume can be correctly identified. Now here comes the part where you need to pay attention, the UUID is created by calculating a hash and this calculation uses the “server name” and the folder name you specify in the “add nfs datastore” workflow.

Add NFS Datastore

This means that if you use “mynfserver.local” on Host A you will need to use to use the exact same on Host B. This also applies to the folder. Even “/vols/vol0/datastore-001″ is not considered to be the same as “/vols/vol0/datastore-001/”. In short, when you mount an NFS datastore make absolutely sure you use the exact same Server and Folder name for all hosts in your cluster!

By the way, there is a nice blogpost by NetApp on this topic.

Back to Basics: Using the vSphere 5.1 Web Client to add an NFS share to all hosts

If you look at the following workflow you know why I am starting to love NFS more and more… Adding an NFS datastore was easy with 5.0 (and prior) but with 5.1 it is even easier. Just a couple of steps to add an NFS datastore to your cluster:

  • Open the Web Client
  • Go to your host under “vCenter” —> “Hosts and Clusters”.
  • Click “New Datastore”.
  • Provide a name for the datastore and click “Next”.
  • Select “NFS” and click “Next”. Fill out the NFS “Server” and “Folder” details and click “Next”.
  • [Read more…]

Using a CNAME (DNS alias) to mount an NFS datastore

I was playing around in my lab with NFS datastores today. I wanted to fail-over a replicated NFS datastore without the need to re-register the virtual machines running on them. I had mounted the NFS datastore using the IP address and as that is used to create the UUID it was obvious that it wouldn’t work. I figured there should be a way around it but after a quick search on the internet I still hadn’t found anything yet.

I figured it should be possible to achieve this using a CNAME but also recalled something around vCenter screwing this up again. I tested it anyway and with success. This is what I did:

  • Added both NFS servers to DNS
  • Create a CNAME (DNS Alias) and pointed to the “active” NFS server
    • I used the name “nasdr” to make it obvious what it is used for
  • Created an NFS share (drtest) on the NFS server
  • Mount the NFS export using vCenter or though the CLI
    • esxcfg-nas -a -o nasdr -s /drtest drtest
  • Check the UUID using vCenter or through the CLI
    • ls -lah /vmfs/volumes
    • example output:
      lrwxr-xr-x    1 root     root           17 Feb  6 10:56 drtest -> e9f77a89-7b01e9fd
  • Created a virtual machine on the nfsdatastore
  • Enabled replication to my “standby” NFS server
  • I killed my “active” NFS server environment (after validating it had completed replication)
  • Changed the CNAME to point to the secondary NFS server
  • Unmounted the volume old volume
    • esxcfg-nas -d drtest
  • I did a vmkping to “nasdr” just to validate the destination IP had changed
  • Rescanned my storage using “esxcfg-rescan -A”
  • Mounted the new volume
    • esxcfg-nas -a -o nasdr -s /drtest drtest
  • Checked the UUID using the CLI
    • ls -lah /vmfs/volumes
    • example output:
      lrwxr-xr-x    1 root     root           17 Feb  6 13:09 drtest -> e9f77a89-7b01e9fd
  • Powered on the virtual machine now running on the secondary NFS server

As you can see, both volumes had the exact same UUID. After the fail-over I could power-on the virtual machine. No need to re-register the virtual machines within vCenter first. Before I wanted to share it with the world I reached out to my friends at NetApp. Vaughn Stewart connected me with Peter Learmonth who validated my findings and actually pointed me to a blog article he wrote about this topic. I suggest to head-over to Peter’s article for more details on this.