VMworld 2015: Site Recovery Manager 6.1 announced

This week Site Recovery Manager 6.1 was announced. There are many enhancements in SRM 6.1 like the integration with NSX for instance and policy driven protection, but personally I feel that support for stretched storage is huge. When I say stretched storage I am referring to solutions like EMC VPLEX, Hitachi Virtual Storage Platform and IBM San Volume Controller(etc). In the past, and you can still today, when you had these solutions deployed you would have a single vCenter Server with a single cluster and moved VMs around manually when needed, or let HA take care of restarts in failure scenarios.

As of SRM 6.1 running these types of stretched configurations is now also supported. So how does that work, what does it allow you to do, and what does it look like? Well in contrary to a vSphere Metro Storage Cluster solution with SRM 6.1 you will be using two vCenter Server instances. These two vCenter Server instances will have an SRM server attached to it which will use a storage replication adaptor to communicate to the array.

But why would you want this? Why not just stretch the compute cluster also? Many have deployed these stretched configurations for disaster avoidance purposes. The problem is however that there is no form of orchestration whatsoever. This means that all workloads will come up typically in a random fashion. In some cases the application knows how to recover from situations like that, in most cases it does not… Leaving you with a lot of work, as after a failure you will now need to restart services, or VMs, in the right order. This is where SRM comes in, this is the strength of SRM, orchestration.

Besides doing orchestration of a full failover, what SRM can also do in the 6.1 release is evacuate a datacenter using vMotion in an orchestrated / automated way. If there is a disaster about to happen, you can now use the SRM interface to move virtual machines from one datacenter to another, with just a couple of clicks, planned migration is what it is called as can be seen in the screenshot above.

Personally I think this is a great step forward for stretched storage and SRM, very excited about this release!

Startup intro: ZeroStack

A couple of months back one of the people I used to work a lot with in the DRS team reaches out to me. He told me that he started a company with some other people I knew and we spoke about the state of the industry and some of the challenges customers faced. Fast forward to today, ZeroStack just came out of stealth and announced to the world what they are building and an A round funding of roughly $ 5.6m.

At the head of the company as CEO we have Ajay Gulati, former VMware employee and most known for Storage IO Control, Storage DRS and DRS. Kiran Bondapalati is the CTO and some may recognize that name as he was a lead architect on Bromium. The DNA of the company is a mix of VMware, Nutanix, Bromium, Cisco, Google an more. Not a bad list I must say

So what are they selling? ZeroStack has developed a private cloud solution which is delivered in two parts:

  1. Physical 2U/4Node Appliance which comes with KVM preinstalled named ZS1000
  2. Management / Monitoring solution which is delivered in a SaaS model.

ZeroStack showed me a demo and getting their appliance up and running took about 15 minutes, the configuration wizard wasn’t unlike EVO:RAIL and looked very easy to run through. The magic however if you ask me isn’t in their configuration section, it is the SaaS based management solution. I stole a diagram from their website which immediately shows the potential.

zerostack

The SaaS management layer provides you a single pane of glass of all the deployed appliances. These can be in a single site or in multiple sites. You can imagine that especially for ROBO deployments this is very useful, but also in larger environments. Now it doesn’t just show you the physical aspect, it also shows you all the logical constructs that have been created like “projects”.

At this part of the demo by the way I got reminded of vCloud Director a bunch of times, and AWS for that matter. ZeroStack allows you to create “tenants” and designate resources to them in the form of projects. These can even have a lease times, which is kind of similar to what vCloud Director offers also.

When looking at the networking aspects of ZeroStack’s solution it also has the familiar constructs like private networks and public networks etc. On top of that networking services like routing / firewall’ing are implemented also in a distributed fashion. And before I forget, everything you see in the UI can also be automated through the APIs which are fully Openstack compatible.

Last but not least we had a discussion about patching and updating. With most systems this is usually the most complicated part. ZeroStack took a very customer friendly approach. The SaaS layer is being updated by them, and this can happen as frequent as once every ten days. The team said they are very receptive to feedback and have a short turnaround time for implementing new functionality, as their goal is to provide most functionality through the SaaS layer. The appliance will be on a different patch/update scheme, probably once every 3 or 6 months, of course depending on the problems fixed and features introduced. The updates are done in a rolling fashion and non-disruptive to your workloads, as expected.

That sounds pretty cool right? Well as always with a 1.0 version there is still some functionality missing. Functionality that is missing in 1.0 is for instance a “high availability” feature for your workloads. If a host fails then you as an admin will need to restart those VMs. Also when it comes to load balancing, there is no “DRS-alike” functionality today. Considering the background of the team though, I can imagine both of those showing up at some point in the near future. It does however mean that for some workloads the 1.0 version may not be the right solution for now. Nevertheless, test/dev and things like cloud native apps could land on it.

All in all, a nice set of announcements and some cool functionality coming. These guys are going to be at VMworld so make sure to stop by their booth if you want to see what they are working on.

Platform9 announcements / funding

Clearly VMworld is around the corner as many new products, releases and company announcements are being done this week and next. Last week I had the opportunity to catch up with Sirish Raghuram, Platform9‘s CEO. For those who don’t know who/what/where I recommend reading the two articles I wrote earlier this year. In short, Platform9 is a SaaS based private cloud management solution which leverages OpenStack. By Platform9 also described as “Openstack-as-a-Service”.

Over the last months Platform9 has grown to 27 people and is now actively focussing on scaling marketing and sales. They have already hired some very strong people from companies like Rackspace, EMC, Metacloud and VMware. Their series A funding was $ 4.5m by Redpoint Ventures, and now they announced a $ 10m Series B round which was led by Menlo Ventures and included Redpoint Ventures. Considering the state of Openstack startup community that is a big achievement if you ask me. The company has seen good revenue momentum in first two quarters of sales with QoQ growth of 200%, multiple site wide license agreements for 400+ servers in each quarter and customer deployments in 17 countries

So what is being announced? The GA of support for vSphere which has been in Beta since early this year. Basically this means that as of this release you can now manage local KVM and vSphere hosts using Platform9’s solution. What I like about their solution is that it is very easy to configure, and it is SaaS based so no worries about installing/configuring/upgrading/updating or maintenance of the management solution itself. Install / Configure takes less than 5 minutes. Basically you point it at your vCenter Server, a proxy VM will be deploy and then resources will be sucked in. The architecture for vSphere looks like this:

The cool thing is that it will integrate with existing vSphere deployments and if you have people managing vSphere with vCenter and they make changes then Platform9 is smart enough to recognize that and reconcile. On top of that all vSphere templates are also automatically pulled in so you can use those immediately when provisioning new VMs through Platform9. Managing VMs through Platform9 is very easy, but also if you are familiar with the OpenStack APIs then automating any aspect of Platform9 is a breeze as it is fully compatible. When it comes to managing resources and workloads, I think the UI speaks for itself. Very straight forward, very easy to use. Adding hosts, deploying new workloads or monitoring capacity, typically all done within a few clicks. When it comes to vSphere they also support things like the Distributed Switch and have support for NSX around the corner, for those who have the need for advanced networking / isolation / security etc.

Platform9 also introduces auto-scaling capabilities based on resource alarms and application templates. Both scaling-up and scaling-down of your workloads when needed is supported, which is something that comes up on a regular basis with customers I talk to. Platform9 can take care of the infrastructure side of scaling out, you worry about creating that scale-out application architecture, which is difficult enough as it is.

When it comes to their SaaS based platform it is good to know that their platform is not shared between customers. Which means that there is no risk of one customer high-jacking the environment of another customer. Also, the platform will scale independently and will scale automatically as your local environment grows. No need to worry about any of those aspects any longer, and of course because it is SaaS based Platform9 will take care of patching/updating/upgrading etc.

Personally I would love to see a couple of things added, I would find it useful if Platform9 could take care of Network Isolation… Just like Lab Manager was capable of doing in the past. It would also be great if Platform9 could manage “stand alone” ESXi hosts instead of having being pointed to vCenter Server. I do understand that brings some constraints etc, but it could be a nice feature… Either way, I like the single pane of glass they offer today, it can only get better. Nice job Platform9, keep those updates coming!

How Virtual SAN enables IndonesianCloud to remain competitive!

Last week I had the chance to catch up with one of our Virtual SAN customers. I connected to Neil Cresswell through twitter and after going back and forth we got on a conference call. Neil showed me what they had created for the company he works for, a public cloud provider called IndonesianCloud. No need to tell you where they are located as the name kind of reveals it. Neil is the CEO of IndonesianCloud by the way, and very very passionate about IT / Technology and VMware. It was great talking to him, and before I forget I want to say thanks for taking time out of your busy schedule Neil, I very much appreciate it!

IndonesianCloud is a 3 year old, cloud service provider, part of the vCloud Air Network, which focuses on the delivery of enterprise class hosting services to their customers. Their customers primarily run mission critical workloads in IndonesianCloud’s three DC environment, which means that stability, reliability and predictability is really important.

Having operated a “traditional” environment for a long time Neil and his team felt it was time for a change (Servers + Legacy Storage). They needed something which was much more fit for purpose, was robust / reliable and was capable of providing capacity as well as great performance. On top of that, from a cost perspective it needed to be significantly cheaper. The traditional environment they were maintaining just wasn’t allowing them to remain competitive in their dynamic and price sensitive market. Several different hyperconverged and software based offerings were considered, but finally the settled on Virtual SAN.

Since the Virtual SAN platform was placed into production two months ago, they have deployed over 450 new virtual machines onto their initial 12 node cluster. In addition, migration of another 600 virtual machines from one of their legacy storage platforms to their Virtual SAN environment is underway. While talking to Neil I was mostly interested in some of the design considerations, some of the benefits but also potential challenges.

From a design stance Neil explained how they decided to go with SuperMicro Fat Twin hardware, 5 x NL-SAS drives (4TB) and Intel S3700 SSDs (800GB) per host. Unfortunately no affordable bigger SSDs were available, and as such the environment has a lower cache to capacity ratio than preferred. Still, when looking at the cache hit rate for reads it is more or less steady around 98-99%. PCIe flash was also looked at, but didn’t fit within the budget. These SuperMicro systems were on the VSAN Ready Node list, and this was one of the main reasons for Neil and the team to pick them. Having a pre-validated configuration, which is guaranteed to be supported by all parties, was seen as a much lower risk than building their own nodes. Then there is the network; IndonesianCloud decided to go with HP networking gear after having tested various products. One of the reasons for this was the better overall throughput, better multicast performance, and lower price per port. The network is 10GbE end to end of course.

Key take away: There can be substantial performance difference between the various 10GbE switches, do your homework!

The choice to deploy 4TB NL-SAS drives was a little risky; IndonesianCloud needed to balance the performance, capacity, and price ratios. Luckily having already run their existing cloud platform for 3 years, there was a history of IO information readily available. Using this GB/IOPS historical information meant that IndonesianCloud were able to make a calculated decision that 4TB drives with 800GB SSD would provide the perfect combination of performance and capacity. With very good cache hit rates, Neil would like to deploy larger SSD drives when they become available, as he believes that cache is a great way to minimise the impact of the slower drives. Equally, the write performance of the 4TB drives was also concerning. Using the default VSAN stripe size configuration of 1 meant that at most, only 2 drives were able to service write de-stage requests for a given VM, and due to the slow speed of the 4TB drives, this could have an impact on performance. To mitigate this, IndonesianCloud performed a series of internal tests that baselined different stripe sizes to get a good balance of performance. In the end a stripe size of 5 was selected, and is now being used for all workloads. This also helps in situations where reads are coming from disk by the way, great side effect. BTW, the best way to think about Stripe Size and Failures to Tolerate is like Raid 1E (mirrored stripes).

Key take away: Write performance of large NL-SAS drives is low, striping can help improving performance.

IndonesianCloud has standardised on a 12 node Virtual SAN cluster, and I asked why, given that Virtual SAN 5.5 U1 supports up to 32 nodes (64 with 6.0 even). Neil’s response was that 12 nodes is what comprises an internal “zone”, and that customers can balance their workloads across zones to provide higher levels of availability. Having all nodes in a single cluster, whilst possible, was not considered the best fit for a service provider that is all about containing risk. 12 nodes also maps to approximately 1000 VMs, which is what they have modelled the financial costs against, so 1000 VMs deployed on the 12 node cluster would consume CPU/Memory/Disk at the same ratio, effectively ensuring maximum utilisation of the asset.

If you look at the workloads IndonesianCloud customers run, they range from large databases, time sensitive ERP systems, webservers, streaming TV CDN services, and they are even running Airline ERP operations for a local carrier… All of these VMs are from external paying customers by the way, and all of them are mission critical for those customers. On top of Virtual SAN some customers even have other storage services running. One of them for instance is running SoftNAS on top of Virtual SAN to offer shared file services to other VMs. Vast ranges of different applications, with different IO profiles and different needs but all satisfied by Virtual SAN. One thing that Neil stressed was that the ability to change the characteristics (failures to tolerate) specified in a profile was key for them, it allows for a lot of flexibility / agility.

I did wonder, with VSAN being relative new to the market, if they had concerns in terms of stability and recoverability. Neil actually showed me their comprehensive UAT Testing Plan and the results. They were very impressed by how VSAN handled these tests without any problem. Tests ranging from pulling drives, failing network interfaces and switches, through to removing full nodes from the cluster, all of these were performed whilst simultaneously running various burn-in benchmarks. No problems whatsoever were experienced, and as a matter of fact the environment has been running great in production (don’t curse it!!).

Key take away: Testing, Testing, Testing… Until you feel comfortable with what you designed and implemented!

When it comes to monitoring though, the team did want to see more details than what is provided out of the box, especially because it is a new platform they felt that this gave them a bit more insurance that things were indeed going well and it wasn’t just their perception. They worked with one of VMware’s rock stars (Iwan Rahabok) when it comes to VR Ops on creating custom dashboards with all sorts of data ranging from cache hit ratio to latency per spindle to ANY type of detail you want on a per VM level. Of course they start with generic dashboard which then allow you to drill down; any outlier is noted immediately and leveraging VR Ops and these custom dashboards, they can drill deep whenever they need. What I loved most is how relatively easy it is for them to extend their monitoring capabilities. During our WebEx Iwan felt he needed some more specifics on a per VM basis and added these details literally within minutes to VR Ops. IndonesianCloud has been kind enough to share a custom dashboard they created, where they can catch a rogue VM easily. In this dashboard, when a single VM, and it can be any VM, generates excessive IOPS it will trigger a spike right away in the overall dashboard.

I know I am heavily biased, but I was impressed. Not just with Virtual SAN, but even more so with how IndonesianCloud has implemented it. How it is changing the way IndonesianCloud manages their virtual estate and how it enables them to compete in today’s global market.

Rubrik follow up, GA and funding announcement

Two months ago I published an introduction post on Rubrik. Yesterday Rubrik announced that their platform went GA and they announced a funding round (series B) of 41 million dollars led by Greylock. I want to congratulate Rubrik with this new milestone, major achievement and I am sure we will hear much more from them in the months to come. For those who don’t recall, here is what Rubrik is all about:

Rubrik is building a hyperconverged backup solution and it will scale from 3 to 1000s of nodes. Note that this solution will be up and running in 15 minutes and includes the option to age out data to the public cloud. What impressed me most is that Rubrik can discover your datacenter without any agents, it scales-out in a fully automated fashion and will be capable of deduplicating / compressing data but also offer the ability to mount data instantly. All of this through a slick UI or you can leverage the REST APIs , fully programmable end-to-end.

When I published the article some people made comments that you can do the above with various of other solutions and people asked why I was so excited about their solution. Well, first of all because you can do all of that from a single platform and don’t need a backup solution plus a storage solution and have multiple pieces to manage without scale-out capabilities. I like the model, the combination of what is being offered, the fact that is is a single package designed for this purpose and not glued together… But of course there is more, I just couldn’t talk about it yet. I am not gonna go in to an extreme amount of detail as Cormac wrote an excellent piece here and there is this great blog from Chris, who is a user of the product, which explains the value of the solution. (Always nice to see by the way people read your article and share their experience as well in return…)

I do want to touch on a couple of things which I feel sets Rubrik apart. (And there may be others who do this / offer this, but I haven’t been briefed by them.)

  • Global search across all data
    • “Google-alike” search, which means you start typing the name of a file in the UI of any VM and while typing the UI already presents a list of potential files you are looking for. Then when it shows the right file you click it and it presents a list of options. The file with this name could of course be on one or many VMs, you can pick which one you want and select from which point in time. When I was an admin I was often challenged with this problem “I deleted a file, I know the name… but no clue where I stored it, can you recover it?”. Well that is no problem any longer with global search, just type the name and restore it.
  • True Scale Out
    • I’d already highlighted this, but I agree with Scott Lowe that there is “scale-out” and there is “Scale-Out”. In the case of Rubrik we are talking scale out with capital S and capital O. Not just from a capacity stance, but also when it comes to (as Scott points out) task management and the ability to run any task anywhere in the cluster. So with each node you add you aren’t just scaling capacity, but also performance on all fronts. No single choking point with Rubrik as far as I can tell.
  • Miscellaneous, stuff that people take for granted… but does matter
    • API-Driven – Not something you would expect I would get excited about. And it seems such an obvious thing, but Rubrik’s solution can be configured and managed through the API they expose. Note that every single thing you see in the UI can be done through the API, the UI is simply an API client.
    • Well performing instant mount through the use of flash and serving the cluster up as a scale-out NFS solution to any vSphere host in your environment. Want to access a VM that was backed-up? Mount it!
    • Cloud archiving… Yes others offer this functionality I know. I still feel it is valuable enough to mention that Rubrik does offer the option to archive data to S3 for instance.

Of course there is more to Rubrik then what I just listed, read the articles by Scott, Cormac and Chris to get a good overview… Or just contact Rubrik and ask for a demo.