Yellow Bricks

Startup News Flash part 9

Duncan Epping · Nov 18, 2013 ·

There we are, part 9 of the Startup News Flash. As mentioned last time, last week was “Storage Field Day” so typically a bit more news then normally this time of year. I would highly recommend to watch the videos. Especially the Coho video is very entertaining.

The original founders of Fusion IO (David Flynn and Rick White) just received 50 Million in funding for their new startup called Primary Data. I mentioned them briefly in Startup News Flash part 3 when they announced they started a new company and it seems that they have something on their hands! They haven’t revealed what they are working on, they are aiming to come out stealth around the second quarter of 2014. In the WSJ the following was mentioned in terms of the space these guys will be playing in: “The company is developing software–though it actually will come bundled on standard server hardware–that essentially connects all those pools of data together, offering what Flynn calls a “unified file directory namespace” visible to all servers in company computer rooms–as well as those “in the cloud” that might be operated by external service companies.” Indeed, something with storage / caching / software defined / scale-out…

I guess scale-out hypervisor based storage solutions are hot… Maxta just officially announced their new product called MxSP. Some rumors had already been floating around but now the details are out their. Marcel v/d Berg did a nice article on them which I recommend reading if you like to get some more details. Basically Maxta created a Virtual Storage Appliance which pools all local storage and presents it as NFS to your hypervisor. Today VMware vSphere is fully supported and KVM / Hyper-V in a limited fashion. It offers functionality like VM-level snapshots and zero-copy clones, Thin provisioning, Inline deduplication and more. It looks like licensing is capacity based but no prices have been mentioned.

When first looking at Avere is was intrigued by their solution but somehow it didn’t really click. Primary focus was a caching layer in between your NFS storage and your hosts… But I wondered why I would want an extra box for that and not just use something host local. Last week Avere made announcement around a solution that allows you to pool local and cloud storage resources and present them via a common namespace and move data between these tiers. FlashCloud is what Avere calls it. Their paper describes it best, so a shameless copy of that: “FlashCloud software running on Avere FXT Edge filers addresses this challenge by storing cold data on cost-effective cloud storage at the core of the network and automatically and efficiently moving active data to the edge near the users.” I like the concept… If you are interested, check out their site here.

Far from a startup, but cool enough to be listed here… The release of the X-Brick aka XtremIO by EMC. The XtremIO solution is a brand new all-flash array which delivers screaming performance in a scale-out fashion. Although there are limitations from a scaling point of view today, it is expected that these will be lifted soon. One of the articles I enjoyed reading is this one by Jason Nash. What is most interesting about the product is the following, and I am going to quote Jason here as he is spot on: “There is no setup and tuning of XtremIO. No LUNs. No RAID Groups. No pools. No stripe sizes. No tiering. Nothing. You have a pool of very fast storage. How big do you want that LUN to be? That’s all you really need to do!”

Another round of funding for SimpliVity, Series C… 58 Million led by Kleiner Perkins Growth Fund and DFJ Growth with contributions by Meritech Capital Partners and Swisscom Ventures. I guess this GigaOM quote says it all: “CEO Doron Kempel, an EMC veteran, said the cash infusion will enable the company to execute on plans to triple its staff and boost sales growth five fold in 2014”.

VMworld session on vSphere Metro Storage Cluster on youtube!

Duncan Epping · Nov 16, 2013 ·

I didn’t even realize this, but just found out that the session Lee Dilworth and I did at VMworld on the subject of vSphere Metro Storage Clusters can actually be viewed for free on youtube!

There are some more sessions up on youtube, so make sure you have a look around!

VSAN performance: many SAS low capacity VS some SATA high capacity?

Duncan Epping · Nov 14, 2013 ·

Something that I have seen popping up multiple times now is the discussion around VSAN and spindles for performance. Someone mentioned on the community forums they were going to buy 20 x 600GB SAS drives for their VSAN environment for each of their 3 hosts. These were 10K SAS disks, which obviously outperform the 7200 RPM SATA drives. I figured I would do some math first:

Server with 20 x 600GB 10K SAS = $9,369.99 per host
Server with 3 x 4TB Nearline SAS = $4,026.91 per host

So that is about a 4300 dollar difference. Note that I did not spec out the full server, so it was a base model without any additional memory etc, just to illustrate the Perf vs Capacity point. Now as mentioned, of course the 20 spindles would deliver additional performance. Because after all you have additional spindles and better performing spindles. So lets do the math on that one taking some average numbers in to account:

20 x 10K RPM SAS with 140 IOps each = 2800 IOps
3 x 7200 RPM NL-SAS with 80 IOps each = 240 IOps

That is a whopping 2560 IOps difference in total. That does sound like an awe full lot doesn’t it? To a certain extent it is a lot, but will it really matter in the end? Well the only correct answer here is: it depends.

I mean, if we were talking about a regular RAID based storage system it would be clear straight away… the 20 disks would win for sure. However we are talking VSAN here and VSAN heavily leans on SSD for performance. Meaning that each diskgroup is fronted by an SSD and that SSD is used for both Read Caching (70% of capacity) and write buffering (30%) of capacity. Illustrated in the diagram below.

The real question is what is your expected IO pattern? Will most IO come from read cache? Do you expect a high data change rate and as such could de-staging be problematic when backed by just 3 spindles? Then on top of that, how and when will data be de-staged? I mean, if data sits in write buffer for a while it could be the data changes 3 or 4 times before being destaged, preventing the need to hit the slow spindles. It all depends on your workload, your IO pattern, your particular use case. Looking at the difference in price, I guess it makes sense to ask yourself the question what $ 4300 could buy you?

Well for instance 3 x 400GB Intel S3700 capable of delivering 75k read IOps and 35k write IOps (~800 dollars per SSD). That is extra, as with the server with 20 disks you would also still need to buy SSD and as the rule of thumb is roughly 10% of your disk capacity you can see what either the savings are or the performance benefits could be. In other words, you can double up on the cache without any additional costs compared to the 20-disk server. I guess personally I would try to balance it a bit, I would go for higher capacity drives but probably not all the way up to 4TB. I guess it also depends on the server type you are buying, will they have 2.5″ drive slots or 3.5″? How many drive slots will you have and how many disks will you need to hit the capacity requirements? Are there any other requirements? As this particular user mentioned for instance he expected extremely high sustained IOs and potentially full backups daily, as you can imagine that could impact the number of spindles desired/required to meet performance expectations.

The question remains, what should you do? To be fair, I cannot answer that question for you… I just wanted to show that these are all things one should think about before buying hardware.

Just a nice little fact, today a VSAN host can have 5 Disk Groups with 7 disks, so 35 disks in total. With 32 hosts in a cluster that is 1120 disks… That is some nice capacity right with 4TB disks that are available today.

I also want to point out that a tool is being developed as we speak which will help you making certain decisions around hardware, cache sizing etc. Hopefully more news on that soon,

** Update, as of 26/11/2013 the VSAN Beta Refresh allows for 7 disks in a disk group… **

VSAN and Network IO Control / VDS part 2

Duncan Epping · Nov 12, 2013 ·

About a week ago I wrote this article about VSAN and Network IO Control. I originally wrote a longer article that contained more options for configuring the network part but decided to leave a section out of it for simplicity sake. I figured as more questions would come I would publish the rest of the content I developed. I guess now is the time to do so.

In the configuration described below we will have two 10GbE uplinks teamed (often referred to as “etherchannel” or “link aggregation”). Due to the physical switch capabilities the configuration of the virtual layer will be extremely simple. We will take the following recommended minimum bandwidth requirements in to consideration for this scenario:

Management Network –> 1GbE
vMotion VMkernel –> 5GbE
Virtual Machine PG –> 2GbE
Virtual SAN VMkernel interface –> 10GbE

When the physical uplinks are teamed (Multi-Chassis Link Aggregation) the Distributed Switch load balancing mechanism is required to be configured as:

IP-Hash
or
LACP

It is required to configure all portgroups and VMkernel interfaces on the same Distributed Switch using either LACP or IP-Hash depending on the type physical switch used. Please note all uplinks should be part of the same etherchannel / LAG. Do not try to create anything fancy here as a physically and virtually incorrectly configured team can and probably will lead to more down time!

Management Network VMkernel interface = LACP / IP-Hash
vMotion VMkernel interface = LACP / IP-Hash
Virtual Machine Portgroup = LACP / IP-Hash
Virtual SAN VMkernel interface = LACP / IP-Hash

As various traffic types will share the same uplinks we also want to make sure that no traffic type can push out other types of traffic during times of contention, for that we will use the Network IO Control shares mechanism.

We will work under the assumption that only have 1 physical port is available and all traffic types share the same physical port for this exercise. Taking a worst case scenario approach in to consideration will guarantee performance even in a failure scenario. By taking this approach we can ensure that Virtual SAN always has 50% of the bandwidth to its disposal while leaving the remaining traffic types with sufficient bandwidth to avoid a potential self-inflicted DoS. When both Uplinks are available this will equate to 10GbE, when only one uplink is available the bandwidth is also cut in half; 5GbE. It is recommended to configure shares for the traffic types as follows:

Traffic Type	Shares	Limit
Management Network	20	n/a
vMotion VMkernel Interface	50	n/a
Virtual Machine Portgroup	30	n/a
Virtual SAN VMkernel Interface	100	n/a

The following diagram depicts this configuration scenario.

Disable “Disk.AutoremoveOnPDL” in a vMSC environment!

Duncan Epping · Nov 8, 2013 ·

** UPDATE 20-March 2016 **

When using vSphere 6.0 or higher, please be advised that Disk.AutoremoveOnPDL needs to be set to 1 (default value) in order for “PDL Scenarios” to be handles correctly in vMSC based infrastructures. Please do not change the default value, or when upgrading to vSphere 6.x please set this value to 1 when changed in previous version.

** UPDATE 20-March 2016 **

Last week I tweeted the recommendation to disable the advanced setting Disk.AutoremoveOnPDL in a vSphere 5.5 vMSC environment:

https://twitter.com/DuncanYB/status/394740133079298048

Based on this tweet I received a whole bunch of questions. Before I explain why I want to point out that I have contacted the folks in charge of the vMSC program and have requested them to publish a KB article asap on this subject.

With vSphere 5.5 a new setting was introduced called “Disk.AutoremoveOnPDL”. When you install 5.5 this setting is set to 1 which means it is enabled. What it does is the following:

The host automatically removes the PDL device and all paths to the device if no open connections to the device exist, or after the last connection closes. If the device returns from the PDL condition, the host can discover it, but treats it as a new device. Data consistency for virtual machines on the recovered device is not guaranteed.

(Source: http://pubs.vmware.com/vsphere-55/index.jsp?topic=%2Fcom.vmware.vsphere.storage.doc%2FGUID-45CF28F0-87B1-403B-B012-25E7097E6BDF.html)

In a vMSC environment you can understand that removing devices which are in a PDL state is not desired. As when the issue that caused the PDL has been solved (from a networking or array perspective) customers would expect the LUNs to automatically appear again. However, as they have been removed a “rescan” is needed to show these devices again instantly, or you will need to wait for the vSphere periodic path evaluation to occur. As you can imagine, in a vSphere Metro Storage Cluster environment (stretched storage) you expect devices to always be there instantly on recovery… even when they are in a PDL or APD state they should be available instantly when the situation has been resolved.

For now, I recommend to set Disk.AutoremoveOnPDL to 0 instead of the default of 1:

Hopefully soon this KB on the topic of Disk.AutoremoveOnPDL will be updated to reflect this.