VMware

VSAN 6.2, checksumming where you should

Duncan Epping · Mar 24, 2016 ·

Today I was talking to a customer about the checksum functionality that is part of VSAN 6.2. They asked me if VSAN was still prone to bit rot scenarios, and they mentioned other potential bottlenecks like no data locality… It was fairly straight forward to set it straight as with VSAN 6.2 we do have a “host local read cache” and we checksum all data by default on write and on read, and yes we also scrub the disk to pro-actively detect potential issues. I’ve already written about these features a couple of times, but today when explaining to this customer how VSAN’s checksumming functionality is implemented the customer immediately realized the benefits of our hypervisor based implementation. Note that the diagram below shows the VM running on a different host then where the actual data is located, the VM could easily be running on the same host as where one of the replicas is located…

When it comes to checksums, these are calculated on the host where the VM resides. Why? Well you can imagine that you will want to protect your data against all types of potential corruption and issues. Not just when at rest, but you want to calculate the checksum before the data leaves the host, before it is replicated / distributed, before it hits the disk controller, before it goes to persistent media! Even if a bit flips while traveling across the network to be written to persistent media this will be detected and corrected.That is exactly what VSAN does, which is unique. As the title says, checksumming where you should… at the source.

Migrating from Hybrid to All-Flash VSAN

Duncan Epping · Mar 22, 2016 ·

I had this question twice last week and I went through the exercise in the lab so I figured I would share the experience. Migrating from Hybrid to All-Flash VSAN is pretty straight forward, and is pretty much an rolling migration. One thing I want to point out is that you need to do the full migration first before you enable any dataservices (dedupe/compression/raid-5/6). This is how you do it:

Open the vSphere Web Client.
Click the Hosts and Clusters tab.
Select the cluster which you want to migrate to all-flash Virtual SAN.
Click the Manage tab.
Click Settings.
Click Disk Management.
Select the first Disk Group and click the Remove Disk Group icon
Select Full data migration and click Yes
Remove the physical HDDs from the host
Add the new Flash devices to the host
Ensure there are no partitions on the flash devices
Ensure they are marked as flash devices
Now create a new Disk Group on this host by clicking the “Create a new disk group” button
Select the Caching Device
Select the Capacity Devices
Click OK

Repeat above steps for each host in the cluster. When finished upgrading all hosts in your cluster you can now enable your dataservices and/or change policies.

vMSC and Disk.AutoremoveOnPDL on vSphere 6.x and higher

Duncan Epping · Mar 21, 2016 ·

I have discussed this topic a couple of times, and want to inform people about a recent change in recommendation. In the past when deploying a stretched cluster (vMSC) it was recommended by most storage vendors and by VMware to set Disk.AutoremoveOnPDL to 0. This basically disabled the feature that automatically removes LUNs which are in a PDL (permanent device loss) state. Upon return of the device a rescan would then allow you to use the device again. With vSphere 6.0 however there has been a change to how vSphere responds to a PDL scenario, vSphere does not expect the device to return. To be clear, the PDL behaviour in vSphere was designed around the removal of devices, they should not stay in the PDL state and return for duty, this did work however in previous version due to a bug.

With vSphere 6.0 and higher VMware recommends to set Disk.AutoremoveOnPDL to 1, which is the default setting. If you are a vMSC / stretched cluster customer, please change your environment and design accordingly. But before you do, please consult your storage vendor and discuss the change. I would also like to recommend testing the change and behaviour to validate that the environment returns for duty correctly after a PDL! Sorry about the confusion.

KB article backing my recommendation was just posted: https://kb.vmware.com/kb/2059622. Documentation (vMSC whitepaper) is also being updated.

Want to hear all about VSAN 6.2? Watch the #SFD9 recordings!

Duncan Epping · Mar 20, 2016 ·

Last week my colleagues had the pleasure to present at Storage Field Day 9 (SFD9). The topic was Virtual SAN, and more specifically what we released with 6.2 and how we are doing from a business point of view / where we stand in the industry. Great explanation by the team on all the different aspects of this release, and even around some of the basic constructs like disk groups etc. One thing that stood out to me, was the explanation of Yanbing around the customer count, I’ve seen one of our competitors screaming shelfware every time they get the chance, which I guess is a great validation to begin with because why worry about something that isn’t a problem to begin with, but still I would like to share this quote:

VSAN has well over 3000 customers. So how did we count these 3000 customers. VSAN’s business is largely build on a transaction based business. Three quarters (75%) of the customers come from the transaction deals. Another quarter (25%) comes from ELA (enterprise license agreements), and most of those customers go through a Proof of Concept with VSAN and they have decided to make that choice. We also have another customer segment which is through the Horizon / VDI use case where we track actual deployments in those use cases.

The intro is by Yanbing Li, General Manager of the Storage and Availability business unit at VMware. Followed by the basics, what’s new and deeper dives by Christos Karamanolis, CTO Storage and Availability at VMware. And last a short demo to show the operational simplicity by Rawlinson Rivera, Principal Architect Storage and Availability at VMware. I created a simple playlist, if you want to skip videos you can do so by clicking those lines top left and select the one you want to view. Thanks Stephen Foskett for capturing and sharing these sessions, awesome job once again!

VSAN Health checks disabled after upgrade to vCenter 6.0 U2

Duncan Epping · Mar 18, 2016 ·

Yesterday at the Dutch VMUG I was talking to my friend @GabVirtualWorld. Gabe mentioned that he had just upgraded his vCenter Server to 6.0 U2 in his VSAN environment, but hadn’t upgraded the hosts yet. Funny enough later someone else mentioned the same scenario and both of them noticed that the VSAN Health Checks were disabled after upgrading vCenter Server. Below a screenshot of the issue Gabe saw in his environment. (Thanks Gabe)

So does that mean there is no backwards compatibility for the Healthcheck, well yes and no. In this release we made our APIs public, William Lam wrote a couple of great articles on this, and in order to deliver a high quality SDK backwards compatibility had to be broken with this release. So if you received the “health checks disabled” message after upgrading to vCenter Server 6.0 U2, you can simply solve this by also upgrading the hosts to ESXi 6.0 U2. I hope this helps.

** Update March 23rd **

Please note that ESXi 6.0 Update 2 is also a requirement in order to enable the “Performance Service” which was newly introduced in Virtual SAN 6.2. Although the Performance Service capability is exposed in vCenter Server 6.0 Update 2, without ESXi 6.0 U2 you will not be able to enable it. When trying to enable it on any version of ESXi lower than 6.0 U2 the following error will be thrown:

Task Details:

Status: General Virtual SAN error.
Start Time: Mar 23, 2016 10:55:35 AM
Completed Time: Mar 23, 2016 10:55:38 AM
State: Error

Error Stack: The performance service on host is not accessible. The host may be unreachable, or the host version may not be supported

This is what the error looks like in the UI: