It is all about choice

The last couple of years we’ve seen a major shift in the market towards the software-defined datacenter. This has resulted in many new products, features and solutions being brought to market. What struck me though over the last couple of days is that many of the articles I have read in the past 6 months (and written as well) were about hardware and in many cases about the form factor or how it has changed. Also, there are the posts around hyper-converged vs traditional, or all flash storage solutions vs server side caching. Although we are moving towards a software-defined world, it seems that administrators / consultants / architects still very much live in the physical world. In many of these cases it even seems like there is a certain prejudice when it comes to the various types of products and the form factor they come in and whether that is 2U vs blade or software vs hardware is beside the point.

When I look at discussions being held around whether server side caching solutions is preferred over an all-flash arrays, which is just another form factor discussion if you ask me, the only right answer that comes to mind is “it depends”. It depends on what your business requirements are, what your budget is, if there are any constraints from an environmental perspective, hardware life cycle, what your staff’s expertise / knowledge is etc etc. It is impossible to to provide a single answer and solution to all the problems out there. What I realized is that what the software-defined movement actually brought us is choice, and in many of these cases the form factor is just a tiny aspect of the total story. It seems to be important though for many people, maybe still an inheritance from the “server hugger” days where hardware was still king? Those times are long gone though if you ask me.

In some cases a server side caching solutions will be the perfect fit, for instance when ultra low latency and use of existing storage infrastructure  is a requirement. In other cases bringing in an all-flash array may make more sense, or a hyper-converged appliance could be the perfect fit for that particular use case. What is more important though is how these components will enable you to optimize your operations, how these components will enable you to build that software-defined datacenter and help you meet the demands of the business. This is what you will need to ask yourself when looking at these various solutions, and if there is no clear answer… there is plenty of choice out there, stay open minded and go explore.

Why Queue Depth matters!

A while ago I wrote an article about the queue depth of certain disk controllers and tried to harvest some of the values and posted those up. William Lam did a “one up” this week and posted a script that can gather the info which then should be posted in a Google Docs spreadsheet, brilliant if you ask me. (PLEASE run the script and lets fill up the spreadsheet!!) But some of you may still wonder why this matters… (For those who didn’t read some of the troubles one customer had with a low-end shallow queue depth disk controller, and Chuck’s take on it here.) Considering the different layers of queuing involved, it probably makes most sense to show the picture from virtual machine down to the device.

queue depth

In this picture there are at least 6 different layers at which some form of queuing is done. Within the guest there is the vSCSI adaptor that has a queue. Then the next layer is VMkernel/VSAN which of course has its own queue and manages the IO that is pushed to the MPP aka muti-pathing layer the various devices on a host. On the next level a Disk Controller has a queue, potentially (depending on the controller used) each disk controller port has a queue. Last but not least of course each device (i.e. a disk) will have a queue. Note that this is even a simplified diagram.

If you look closely at the picture you see that IO of many virtual machines will all flow through the same disk controller and that this IO will go to or come from one or multiple devices. (Typically multiple devices.) Realistically, what are my potential choking points?

  1. Disk Controller queue
  2. Port queue
  3. Device queue

Lets assume you have 4 disks; these are SATA disks and each have a queue depth of 32. Total combined this means that in parallel you can handle 128 IOs. Now what if your disk controller can only handle 64? This will result in 64 IOs being held back by the VMkernel / VSAN. As you can see, it would beneficial in this scenario to ensure that your disk controller queue can hold the same number of IOs (or more) as your device queue can hold.

When it comes to disk controllers there is a huge difference in maximum queue depth value between vendors, and even between models of the same vendor. Lets look at some extreme examples:

HP Smart Array P420i - 1020
Intel C602 AHCI (Patsburg) - 31 (per port)
LSI 2008 - 25
LSI 2308 - 600

For VSAN it is recommended to ensure that the disk controller has a queue depth of at least 256. But go higher if possible. As you can see in the example there are various ranges, but for most LSI controllers the queue depth is 600 or higher. Now the disk controller is just one part of the equation, as there is also the device queue. As I listed in my other post, a RAID device for LSI for instance has a default queue depth of 128 while a SAS device has 254 and a SATA device has 32. The one which stands out the most is the queue depth of the SATA device, only a queue depth of 32 and you can imagine this can once again become a “choking point”. However, fortunately the shallow queue depth of SATA can easily be overcome by using NL-SAS drives (nearline serially attached SCSI) instead. NL-SAS drives are essentially SATA drives with a SAS connector and come with the following benefits:

  • Dual ports allowing redundant paths
  • Full SCSI command set
  • Faster interface compared to SATA, up to 20%
  • Larger (deeper) command queue [depth]

So what about the cost then? From a cost perspective the difference between NL-SAS and SATA is for most vendors negligible. For a 4TB drive the difference at the time of writing on different website was on average $ 30,-. I think it is safe to say that for ANY environment NL-SAS is the way to go and SATA should be avoided when possible.

In other words, when it comes to queue depth: spent a couple of extra bucks and go big… you don’t want to choke your own environment to death!

How do you know where an object is located with Virtual SAN?

You must have been wondering the same thing after reading the introduction to Virtual SAN. Last week at VMworld I received many questions on this topic, so I figured it was time for a quick blog post on this matter. How do you know where a storage object resides with Virtual SAN when you are striping across multiple disks and have multiple hosts for availability purposes, what about Virtual SAN object location? Yes I know this is difficult to grasp, even with just multiple hosts for resiliency where are things placed? The diagram gives an idea, but that is just from an availability perspective (in this example “failures to tolerate” is set to 1). If you have stripe width configured for 2 disks then imagine what could happen that picture. (Before I published this article, I spotted this excellent primer by Cormac on this exact topic…)

Luckily you can use the vSphere Web Client to figure out where objects are placed:

  • Go to your cluster object in the Web Client
  • Click “Monitor” and then “Virtual SAN”
  • Click “Virtual Disks”
  • Click your VM and select the object

The below screenshot depicts what you could potentially see. In this case the Policy was configured with “1 host failure to tolerate” and “disk striping set to 2″. I think the screenshot explains it pretty well, but lets go over it.

The “Type” column shows what it is, is it a “witness” (no data) or a “component” (data). The “Component state” shows you if it is available (active) or not at the moment. The “Host” column shows you on which host it currently resides and the “SSD Disk Name” column shows which SSD is used for read caching and write buffering. If you go to the right you can also see on which magnetic disk the data is stored in the column called  “Non-SSD Disk Name”.

Now in our example below you can see that “Hard disk 2″ is configured in RAID 1 and then immediately following with RAID 0. The “RAID 1″ refers to “availability” in this case aka “component failures” and the “RAID 0″ is all about disk striping. As we configured “component failures” to 1 we can see two copies of the data, and we said we would like to stripe across two disks for performance you see a “RAID 0″ underneath. Note that this is just an example to illustrate the concept, this is not a best practice or recommendation as that should be based on your requirements! Last but not least we see the “witness”, this is used in case of a failure of a host. If host 10.20.177.19 would fail or be isolated from the network somehow then the witness would be used by host 10.20.177.17 to claim ownership. Makes sense right?

Virtual SAN object location

Hope this helps understanding Virtual SAN object location a bit better… When I have the time available, I will try to dive a bit more in to the details of Storage Policy Based Management.

Introduction to VMware vSphere Virtual SAN

Many of you have seen the announcements by now and I am guessing that you are as excited as I am about the announcement of the public beta of Virtual SAN with vSphere 5.5. What is Virtual SAN, formerly known as “VSAN” or “vCloud Distributed Storage” all about?

Virtual SAN (VSAN from now on in this article) is a software based distributed storage solution which is built directly in the hypervisor. No this is not a virtual appliance like many of the other solutions out there, this sits indeed right inside your ESXi layer. VSAN is about simplicity, and when I say simple I do mean simple. Want to play around with VSAN? Create a VMkernel NIC for VSAN and enable it on a cluster level. Yes that is it!

vSphere Virtual SAN

Before we will get a bit more in to the weeds, what are the benefits of a solution like VSAN? What are the key selling points?

  • Software defined – Use industry standard hardware, as long as it is on the HCL you are good to go!
  • Flexible – Scale as needed and when needed. Just add more disks or add more hosts, yes both scale-up and scale-out are possible.
  • Simple – Ridiculously easy to manage! Ever tried implementing or managing some of the storage solutions out there? If you did, you know what I am getting at!
  • Automated – Per virtual machine policy based management. Yes, virtual machine level granularity. No more policies defined on a per LUN/Datastore level, but at the level where you want it to be!
  • Converged – It allows you to create dense / building block style solutions!

Okay that sounds great right, but where does that fit in? What are the use-cases for VSAN when it is released?

  • Virtual desktops
    • Scale out model, using predictive (performance etc) repeatable infrastructure blocks lowers costs and simplifies operations
  • Test & Dev
    • Avoids acquisition of expensive storage (lowers TCO), fast time to provision
  • Big Data
    • Scale out model with high bandwidth capabilities
  • Disaster recovery target
    • Cheap DR solution, enabled through a feature like vSphere Replication that allows you to replicate to any storage platform

So lets get a bit more technical, just a bit as this is an introduction right…

When VSAN is enabled a single shared datastore is presented to all hosts which are part of the VSAN enabled cluster. Typically all hosts will contribute performance (SSD) and capacity (magnetic disks) to this shared datastore. This means that when your cluster grows, your datastore will grow with it. (Not a requirement, there can be hosts in the cluster which just consume the datastore!) Note that there are some requirements for hosts which want to contribute storage. Each host will require at least one SSD and one magnetic disk. Also good to know is that with this beta release the limit on a VSAN enabled cluster is 8 hosts. (Total cluster size 8 hosts, including hosts not contributing storage to your VSAN datastore.)

As expected, VSAN heavily relies on SSD for performance. Every write I/O will go to SSD first, and eventually they will go to magnetic disks (SATA). As mentioned, you can set policies on a per virtual machine level. This will also dictate for instance what percentage of your read I/O you can expect to come from SSD. On top of that you can use these policies to define availability of your virtual machines. Yes you read that right, you can have different availability policies for virtual machines sitting on the same datastore. For resiliency “objects” will be replicated across multiple hosts, how many hosts/disks will thus depend on the profile.

VSAN does not require a local RAID set, just a bunch of local disks. Now, whether you defined a 1 host failure to tolerate ,or for instance a 3 host failure to tolerate, VSAN will ensure enough replicas of your objects are created. Is this awesome or what? So lets take a simple example to illustrate that. We have configured a 1 host failure and create a new virtual disk. This means that VSAN will create 2 identical objects and a witness. The witness is there just in case something happens to your cluster and to help you decide who will take control in case of a failure, the witness is not a copy of your object let that be clear! Note, that the amount of hosts in your cluster could potentially limit the amount of “host failures to tolerate”. In other words, in a 3 node cluster you can not create an object that is configured with 2 “host failures to tolerate”. Difficult to visualize? Well this is what it would look like on a high level for a virtual disk which tolerates 1 host failure:

With all this replication going on, are there requirements for networking? At a minimum VSAN will require a dedicated 1Gbps NIC port. Of course it is needless to say that 10Gbps would be preferred with solutions like these, and you should always have an additional NIC port available for resiliency purposes. There is no requirement from a virtual switch perspective, you can use either the Distributed Switch or the plain old vSwitch, both will work fine.

To conclude, vSphere Virtual SAN aka VSAN is a brand new hypervisor based distributed platform that enables convergence of compute and storage resources. It provides virtual machine level granularity through policy based management. It allows you to control availability and performance in a way I have never seen it before, simple and efficient. I am hoping that everyone will be pounding away on the public beta, sign up today: http://www.vmware.com/vsan-beta-register!

Startup intro: SolidFire

This seems to becoming a true series, introducing startups… Now in the case of SolidFire I am not really sure if I should use the word startup as they have been around since 2010. But then again, it is not a consumer solution that they’ve created and enterprise storage platforms do typically take a lot longer to develop and mature. SolidFire was founded in 2010 by Dave Wright who discovered a gap in the current storage market when he was working for Rackspace. The opportunity Dave saw was in the Quality of Service area. Not many storage solutions out there could provide a predictable performance in almost every scenario, and were designed for multi-tenancy and offered a rich API. Back then the term Software Defined Storage wasn’t coined yet, but I guess it is fair to say that is how we would describe it today. This actually how I got in touch with SolidFire. I wrote various articles on the topic of Software Defined Storage, and tweeted about this topic many times, and SolidFire was one of the companies who consistently joined the conversation. So what is SolidFire about?

SolidFire is a storage company, they sell a storage systems and today they offer two models namely the SF3010 and the SF6010. What is the difference between these two? Cache and capacity! With the SF3010 you get 72Gb of cache per node and it uses 300GB SSD’s where the SF6010 gives you 144GB of cache per node and uses 600GB SSD’s. Interesting? Well to a certain point I would say, SolidFire isn’t really about the hardware if you ask me. It is about what is inside the box, or boxes I should say as the starting point is always 5 nodes. So what is inside?

Architecture

SolidFire’s architecture is based on a scale-out model and of course flash, in the form of SSD. You start out with 5 nodes and you can go up to 100 nodes, all connected to your hosts via iSCSI. Those 100 nodes would be able to provide you 5 million IOps and about 2.1 Petabyte of capacity. Each node that is added linearly scales performance and of course adds capacity. Of course SolidFire offers deduplication, compression and thin provisioning. Considering it is a scale-out model it is probably not needed to point this out, but dedupe and compression are cluster wide. Now the nice thing about the SolidFire architecture is that they don’t use a traditional RAID, this means that the long rebuild times when a disk fails or a node fails do not apply to SolidFire. Rather SolidFire evenly distributes data across all disk and nodes, so when a single disk fails or even a node fails rebuild time is not constraint due to a limited amount of resources but many components can help in parallel to get back to a normal state. What I liked most about their architecture is that it already closely aligns with VMware’s Virtual Volume (VVOL) concept, SolidFire is prepared for VVOLs when it is released.

Quality of Service

I already has briefly mentioned this, but Quality of Service (QoS) is one of the key drivers of the SolidFire solution. It revolves around having the ability to provide an X amount of capacity with an X amount of performance (IOps). What does this mean? SolidFire allows you to specify a minimum and maximum number of IOps for a volume, and also a burst space. Lets quote the SolidFire website as I think they explain it in a clear way:

  • Min IOPS – The minimum number of I/O operations per-second that are always available to the volume, ensuring a guaranteed level of performance even in failure conditions.
  • Max IOPS – The maximum number of sustained I/O operations per-second that a volume can process over an extended period of time.
  • Burst IOPS – The maximum number of I/O operations per-second that a volume will be allowed to process during a spike in demand, particularly effective for data migration, large file transfers, database checkpoints, and other uneven latency sensitive workloads.

Now I do want to point out here that SolidFire storage systems have no “form of admission control” when it comes to QoS. Although it is mentioned that there is a guaranteed level of performance this is up to the administrator, you as the admin will need to do the math and not overprovision from a performance point of view if you truly want to guarantee a specific performance level. If you do, you will need to take failure scenarios in to account!

One thing that my automation friends William Lam and Alan Renouf will like is that you can manage all these settings using their REST-based API.

(VMware) Integration

Ofcourse during the conversation integration came up. SolidFire is all about enabling their customers to automate as much as they possibly can and have implemented a REST-based API. They are heavily investing in for instance integration with Openstack but also with VMware. They offer full support for the vSphere Storage APIs – Storage Awareness (VASA) and are also working towards full support for vSphere Storage APIs – Array Integration (VAAI). Currently not all VAAI primitives are supported but they promised me that this is a matter of time. (They support: Block Zero’ing, Space Reclamation, Thin Provisioning. See HCL for more details.) On top of that they are also looking at the future and going full steam ahead when it comes to Virtual Volumes. Obvious question from my side: what about replication / SRM? This is being worked on, hopefully more news about this soon!

Now with all this integration did they forget about what is sitting in between their storage system and the compute resources? In other words what are they doing with the network?

Software Defined Networking?

I can be short, no they did not forget about the network. SolidFire is partnering with Plexxi and Arista to provide a great end-to-end experience when it comes to building a storage environment. Where with Arista currently the focus is more on monitoring the the different layers Plexxi seems to focus more on the configuration and optimization for performance aspect. No end-to-end QoS yet, but a great step forward if you ask me! I can see this being expanded in the future

Wrapping up

I had already briefly looked at SolidFire after the various tweets we exchanged but this proper introduction has really opened my eyes. I am impressed by what SolidFire has achieved in a relatively short amount of time. Their solution is all about customer experience, that could be performance related or the ability to automate the full storage provisioning process… their architecture / concept caters for this. I have definitely added them to my list of storage vendors to visit at VMworld, and I am hoping that those who are looking in to Software Defined Storage solutions will do the same as SolidFire belongs on that list.