u1

vMSC and Disk.AutoremoveOnPDL on vSphere 6.x and higher

Duncan Epping · Mar 21, 2016 ·

I have discussed this topic a couple of times, and want to inform people about a recent change in recommendation. In the past when deploying a stretched cluster (vMSC) it was recommended by most storage vendors and by VMware to set Disk.AutoremoveOnPDL to 0. This basically disabled the feature that automatically removes LUNs which are in a PDL (permanent device loss) state. Upon return of the device a rescan would then allow you to use the device again. With vSphere 6.0 however there has been a change to how vSphere responds to a PDL scenario, vSphere does not expect the device to return. To be clear, the PDL behaviour in vSphere was designed around the removal of devices, they should not stay in the PDL state and return for duty, this did work however in previous version due to a bug.

With vSphere 6.0 and higher VMware recommends to set Disk.AutoremoveOnPDL to 1, which is the default setting. If you are a vMSC / stretched cluster customer, please change your environment and design accordingly. But before you do, please consult your storage vendor and discuss the change. I would also like to recommend testing the change and behaviour to validate that the environment returns for duty correctly after a PDL! Sorry about the confusion.

KB article backing my recommendation was just posted: https://kb.vmware.com/kb/2059622. Documentation (vMSC whitepaper) is also being updated.

VMworld 2015: vSphere APIs for IO Filtering update

Duncan Epping · Aug 31, 2015 ·

I suspect that the majority of blogs this week will all be about Virtual SAN, Cloud Native Apps and EVO. If you ask me then the vSphere APIs for IO Filtering announcements are just as important. I’ve written about VAIO before, in a way, and it was first released in vSphere 6.0 and opened to a select group of partners. For those who don’t know what it is, lets recap, the vSphere APIs for IO Filtering is a framework which enables VMware partners to develop data services for vSphere in a fully supported fashion. VMware worked closely with EMC and Sandisk during the design and development phase to ensure that VAIO would deliver what partners would require it to deliver.

These data services can be applied to on a VM or VMDK granular level and can be literally anything by simply attaching a policy to your VM or VMDK. In this first official release however you will see two key use cases for VAIO though:

Caching
Replication

The great thing about VAIO if you ask me is that it is an ESXi user space level API, which over time will make it possible for all the various data services providers (like Atlantis, Infinio etc) who now have a “virtual appliance” based solution to move in to ESXi and simplify their customers environment by removing that additional layer. (To be technically accurate, VAIO APIs are all user level APIs, the filters are all running in user space, only a part of the VAIO framework runs inside the kernel itself.) On top of that, as it is implemented on the “right” layer it will be supported for VMFS (FC/iSCSI/FCoE etc), NFS, VVols and VSAN based infrastructures. The below diagram shows where it sits.

VAIO software services are implemented before the IO is directed to any physical device and does not interfere with normal disk IO. In order to use VAIO you will need to use vSphere 6.0 Update 1. On top of that of course you will need to procure a solution from one of the VMware partners who are certified for it, VMware provides the framework – partners provide the data services!

As far as I know the first two to market will be EMC and Sandisk. Other partners who are working on VAIO based solutions and you can expect to see release something are Actifio, Primaryio, Samsung, HGST and more. I am hoping to be able to catch up with one or two of them this week or over the course of the next week so I can discuss it a bit more in detail.

Quick pointer to new Virtual SAN Ready Node configs

Duncan Epping · Jun 23, 2014 ·

Just a quick pointer to the new document that holds all Virtual SAN Ready Node configurations: Virtual SAN Ready Node.pdf. In this document various new configurations are described and a couple of old ready node configurations appear to have been removed. I expect these new configurations to be added in the upcoming weeks.

Another very useful document recently released on the topic of Virtual SAN hardware is the following: Virtual SAN Hardware Quick Reference Guide. It describes for both Server and VDI workloads different profiles and give examples around how you should configure your hardware to meet certain requirements.

Disconnect a host from VSAN cluster doesn’t change capacity?

Duncan Epping · Jun 13, 2014 ·

Someone asked this question on VMTN this week and I received a similar question this week from another user… If you disconnect a host from a VSAN cluster it doesn’t change the total amount of available capacity. The customer was wondering why this was. Well the answer is simple: You are not disconnecting the host from your VSAN cluster, but you are rather disconnecting it from vCenter Server instead! (In contrary to HA and DRS by the way) In other words: your VSAN host is still providing storage to the VSAN datastore when it is disconnected.

If you want a host to leave a VSAN cluster you have two options in my opinion:

Place it in maintenance mode with full data migration and remove it from the cluster
Run the following command from the ESXi command line:
esxcli vsan cluster leave

Please keep that in mind when you do maintenance… Do not use “disconnect” but actually remove the host from the cluster if you do not want it to participate in VSAN any longer.

vSphere 5.5 U1 patch released for NFS APD problem!

Duncan Epping · Jun 11, 2014 ·

On April 19th I wrote about an issue with vSphere 5.1 and NFS based datastores APD ‘ing. People internally at VMware have worked very hard to root cause the issue and fix it. Log entries witnessed are:

YYYY-04-01T14:35:08.075Z: [APDCorrelator] 9414268686us: [esx.problem.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
YYYY-04-01T14:36:55.274Z: No correlator for vob.vmfs.nfs.server.disconnect
YYYY-04-01T14:36:55.274Z: [vmfsCorrelator] 9521467867us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.1.1/NFS-DS1 12345678-abcdefg0-0000-000000000000 NFS-DS1
YYYY-04-01T14:37:28.081Z: [APDCorrelator] 9553899639us: [vob.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

More details on the fix can be found here: http://kb.vmware.com/kb/2077360