Yellow Bricks

Tintri announces all-flash storage device and Tintri OS 4.0

Duncan Epping · Aug 20, 2015 ·

Last week I had the pleasure of catching up with Tintri. It has been a while since I spoke with them, but I have been following them from the very start. I met up with them in Mountain View a couple of times when it was just a couple of guys on a rather empty floor with a solution that sounded really promising. Tintri’s big thing is simplicity if you ask me. Super simple to setup, really easy to manage, and providing VM granular controls for about everything you can imagine. The solution comes in the form of a hybrid storage device (disks and flash) which is served up to the hypervisor as an NFS mount.

Today Tintri announces that they will be offering an all-flash system next to their hybrid systems. When talking to Kieran he made it clear that the all-flash system would probably be only for a subset of their customers. The key reason for this being that the hybrid solution already brings great performance and is at a much lower cost of course. The new all-flash model is named VMstore T5000 and comes in two variants: T5060 and T5080. The T5060 can hold up to 2500 VMs and around 36TB with dedupe and compression. For the T5080 that is 5000 VMs and around 73TB. Both delivered in a 2U form factor by the way. The expected use case for the all flash systems is large persistent desktops and multi TB high performance databases. Key thing here is of course not jus the number of IOPS it can drive, but the consistent low latency it can deliver.

Besides the hardware, there is also a software refresh. Tintri OS 4.0 and Global Center 2.1 are being announced. Tintri OS 4.0 is what is sitting on the VMstore storage systems and Global Center is their central management solution. With the 2.1 release Global Center now supports up to 100.000 VMs. It allows you to centrally manage both Tintri’s hybrid and all-flash systems from one UI and smart things like informing you when a VM is provisioned to the wrong storage system (hybrid but performance wise requires all-flash for instance). Not just inform you, but it also has the ability to migrate the VM from storage system to storage system. Note that during the migration all aspects that were associated with it (QoS, Replication etc) is kept. (Not unlike Storage DRS, but in this case the solution is aware of all that happens on the storage system) What I liked personally about Global Center is the performance views / health views. It is very easy to see what the state of your environment is, where latency is coming from etc. Also, if you need to configure things like QoS, replication or snapshotting for multiple VMs you can do this from the Global Center console by simply grouping them as show in the screenshot below.

Tintri QoS was demoed during the call, and I found this also particularly interesting as it allows you to define QoS on a VM (or VMDK) granular level. When you do things like specifying an IOPS limit it is good to know that Tintri normalizes the IOPS based on the size of the IO. Simply said, all IO of 8KB or lower becomes 1 normalized IOPS, an IO which is 16KB will be 2 normalized IOPS etc. This to ensure fairness in environments (this will be almost every environment) where IO sizes greatly vary. Those whom have ever tried to profile their workloads will know why this is important. What I’ve always like about Tintri is their monitoring things like latency for instance how they split that up in hypervisor, network and storage is very useful. They have done an excellent job again for QoS management.

Last but not least Tintri introduces Tintri VMstack. Basically their converged offering where Compute + Storage + Hypervisor is bundled and delivered as a single stack to customers. It will provide you the choice of storage platform (well needs to be Tintri of course), hypervisor, compute and network infrastructure. It can also include things like OpenStack or the vRealize Suite. Personally I think this is a smart move, but this is something I would have preferred to have seen launched 12-18 months ago. Nevertheless, it is a good move.

Using VM-Host rules without DRS enabled

Duncan Epping · Aug 20, 2015 ·

This week I was playing with the VM-Host rules in my environment. In this particular environment I had DRS disabled and I noticed some strange things when I created the VM-Host rules. I figured it should all work in a normal way as I was always told that VM/Host rules can be configured without DRS being enabled. And from a “configuration” perspective that is correct. However there is a big caveat here, and lets look at the two options you have when creating a rule namely “should” and “must”.

When using a VM-Host “must” rule when DRS is disabled it all works as expected. When you have the rule defined then you cannot place the VM on a host which is not within the VM-Host group. You cannot power it on on those hosts, no vMotion and HA will not place the VM there either after a failure. Everything as expected.

In the case of a VM-Host “should” rule when DRS is disabled this is different! When you have a should rule defined and DRS is disabled then vCenter will allow you to power on a VM on a host which is not part of the rule. HA will restart VMs on hosts as well which are not part of the rule, and you can migrate a VM to one of those hosts. All of this without a warning that the host is not in the rule and that you are violating the rule. Even after explicitly defining an alarm I don’t see anything triggered. The alarm by the way is called “VM is violating a DRS VM-Host affinity rule”.

I reached out to the HA/DRS engineering team and asked them why that is. It appears the logic for the “should” rule, in contrary to the “must rule, is handled by DRS. This includes the alerting. It makes sense to a certain extent, but it wasn’t what I expected. So be warned, if you don’t have DRS enabled, “VM-Host should rules” will not work. Must rules however will work perfectly fine. (Yes, I’ve asked them to look in to this and fix it so it behaves as you would expect it to behave and come with a warning when you try anything that violates a should rule.)

Rubrik 2.0 release announced today

Duncan Epping · Aug 19, 2015 ·

Today the Rubrik 2.0 release was announced. I’ve written about who they are and what they do twice now so I am not going to repeat that. If you haven’t read those articles please read those first. (Article 1 and article 2) Chris Wahl took the time to brief me and the first thing that stood out to me was the new term that was coined namely: Converged Data Management. Considering what Rubrik does and has planned for the future I think that term is spot on.

When it comes to 2.0 there are a bunch of features that are introduced, I will list them out and then discuss some of them in a bit more detail:

New Rubrik appliance model r348
- Same 2U/4Node platform, but leveraging 8TB disks instead of 4TB disks
Replication
Auto Protect
WAN Efficient (global deduplication)
AD Authentication – No need to explain
OpenStack Swift support
Application aware backups
Detailed reporting
Capacity planning

Lets start at the top, a new model is introduced next to the two existing models. The 2 other models are also both 2U/4Node solutions but use 4TB drives instead of the 8TB drives the R348 will be using. This will boost capacity for single Brik up to roughly 300TB, in 2U this is not bad at all I would say.

Of course the hardware isn’t the most exiting, the software changes fortunately are. In the 2.0 release Rubrik introduces replication between sites / appliances and global dedupe which ensures that replication is as efficient as it can be. The great thing here is that you backup data and replicate it straight after it has been deduplicated to other sites. All of this is again policy driven by the way, so you can define when you want to replicate, how often and for how long data needs to be saved on the destination.

Auto-protect is one of those features which you will take for granted fast, but is very valuable. Basically it will allow you to set a default SLA on a vCenter level, or Cluster – Resource Pool – Folder, you get the drift. Set and forget is basically what this means, no longer the risk of newly provisioned VMs which have not been added to the backup schedule. Something really simple, but very useful.

When it comes to applications awareness Rubrik in version 2.0 will also leverage a VSS provider to allow for transactional consistent backups. This applies today for Microsoft Exchange, SQL, Sharepoint and Active Directory. More can be expected in the near future. Note that this applies to backups, for restoring there is no option (yet) to restore a specific mailbox for instance, but Chris assured me that this on their radar.

When it comes to usability a lot of improvements have been made starting with things like reporting and capacity planning. One of the reports which I found very useful is the SLA Compliancy reporting capability. It will simply show you if VMs are meeting the defined SLA or not. Capacity planning is also very helpful as it will inform you what the growth rate is locally and in the cloud, and also when you will be running out of space. Nice trigger to buy an additional appliance right, or change your retention period or archival policy etc. On top of that things like object deletion, task cancellation, progress bars and much more usability improvements have made it in to the 2.0 release.

All in all an impressive release, especially considering the 1.0 was released less than 6 months ago. It is great to see a high release cadence for an industry which has been moving extremely slow for the past decades. Thanks Rubrik for stirring things up!

Platform9 announcements / funding

Duncan Epping · Aug 18, 2015 ·

Clearly VMworld is around the corner as many new products, releases and company announcements are being done this week and next. Last week I had the opportunity to catch up with Sirish Raghuram, Platform9‘s CEO. For those who don’t know who/what/where I recommend reading the two articles I wrote earlier this year. In short, Platform9 is a SaaS based private cloud management solution which leverages OpenStack. By Platform9 also described as “Openstack-as-a-Service”.

Over the last months Platform9 has grown to 27 people and is now actively focussing on scaling marketing and sales. They have already hired some very strong people from companies like Rackspace, EMC, Metacloud and VMware. Their series A funding was $ 4.5m by Redpoint Ventures, and now they announced a $ 10m Series B round which was led by Menlo Ventures and included Redpoint Ventures. Considering the state of Openstack startup community that is a big achievement if you ask me. The company has seen good revenue momentum in first two quarters of sales with QoQ growth of 200%, multiple site wide license agreements for 400+ servers in each quarter and customer deployments in 17 countries

So what is being announced? The GA of support for vSphere which has been in Beta since early this year. Basically this means that as of this release you can now manage local KVM and vSphere hosts using Platform9’s solution. What I like about their solution is that it is very easy to configure, and it is SaaS based so no worries about installing/configuring/upgrading/updating or maintenance of the management solution itself. Install / Configure takes less than 5 minutes. Basically you point it at your vCenter Server, a proxy VM will be deploy and then resources will be sucked in. The architecture for vSphere looks like this:

The cool thing is that it will integrate with existing vSphere deployments and if you have people managing vSphere with vCenter and they make changes then Platform9 is smart enough to recognize that and reconcile. On top of that all vSphere templates are also automatically pulled in so you can use those immediately when provisioning new VMs through Platform9. Managing VMs through Platform9 is very easy, but also if you are familiar with the OpenStack APIs then automating any aspect of Platform9 is a breeze as it is fully compatible. When it comes to managing resources and workloads, I think the UI speaks for itself. Very straight forward, very easy to use. Adding hosts, deploying new workloads or monitoring capacity, typically all done within a few clicks. When it comes to vSphere they also support things like the Distributed Switch and have support for NSX around the corner, for those who have the need for advanced networking / isolation / security etc.

Platform9 also introduces auto-scaling capabilities based on resource alarms and application templates. Both scaling-up and scaling-down of your workloads when needed is supported, which is something that comes up on a regular basis with customers I talk to. Platform9 can take care of the infrastructure side of scaling out, you worry about creating that scale-out application architecture, which is difficult enough as it is.

When it comes to their SaaS based platform it is good to know that their platform is not shared between customers. Which means that there is no risk of one customer high-jacking the environment of another customer. Also, the platform will scale independently and will scale automatically as your local environment grows. No need to worry about any of those aspects any longer, and of course because it is SaaS based Platform9 will take care of patching/updating/upgrading etc.

Personally I would love to see a couple of things added, I would find it useful if Platform9 could take care of Network Isolation… Just like Lab Manager was capable of doing in the past. It would also be great if Platform9 could manage “stand alone” ESXi hosts instead of having being pointed to vCenter Server. I do understand that brings some constraints etc, but it could be a nice feature… Either way, I like the single pane of glass they offer today, it can only get better. Nice job Platform9, keep those updates coming!

Virtual SAN going offshore

Duncan Epping · Aug 17, 2015 ·

Over the last couple of months I have been talking to many Virtual SAN customers. After having spoken to so many customers and having heard many special use cases and configurations I’m not easily impressed. I must say that half way during the conversation with Steffan Hafnor Røstvig from TeleComputing I was seriously impressed. Before we get to that lets first look at the background of Steffan Hafnor Røstvig and TeleComputing.

TeleComputing is one of the oldest service providers in Norway. They started out as an ASP with a lot of Citrix expertise. In the last years they’ve evolved more to being a service provider rather than an application provider. Telecomputing’s customer base consists of more than 800 companies and in excess of 80,000 IT users. Customers are typically between 200-2000 employees, so significant companies. In the Stavanger region a significant portion of the customer base is in the oil business or delivering services to the Oil business. Besides managed services, TeleComputing also has their own datacenter they manage and host services in for customers.

Steffan is a solutions architect but started out as a technician. He told me he still does a lot of hands-on, but besides that also supports sales / pre-sales when needed. The office he is in has about 60 employees. And Steffan’s core responsibility is virtualization, mostly VMware based! Note that TeleComputing is much larger than those 60 employees, they have about 700 employees worldwide with offices in Norway, Sweden and Russia.

Steffan told me he got first introduced to Virtual SAN when it was just launched. Many of their offshore installation used what they call “datacenter in a box” solution which was based on IBM Bladecenter. Great solution for that time but there were some challenges with it. Cost was a factor, rack size but also reliability. Swapping parts isn’t always easy either and that is one of the reasons they started exploring Virtual SAN.

For Virtual SAN they are not using blades any longer but instead switched to rack mounted servers. Considering the low number of VMs that are typically running in these offshore environments a fairly “basic” 1U server can be used. With 4 hosts you will now only take up 4U , instead of the 8 or 10U a typical blade system requires. Before I forget, the hosts itself are Lenovo x3550 M4’s with one S3700 Intel SSD of 200GB and 6 IBM 900GB 10K RPM drives. Each host has 64GB of memory and two Intel E5-2630 6 core CPUs. It also uses an M5110 SAS controller. Especially in the type of environments they support this is very important, on top of that the cost is significantly lower for 4 rack mounts vs a full bladecenter. What do I mean with type of environments? Well as I said offshore, but more specifically Oil Platforms! Yes, you are reading that right, Virtual SAN is being used on Oil Platforms.

For these environments 3 hosts are actively used and a 4th host is just there to serve as a “spare”. If anything fails in one of the hosts the components can easily be swapped, and if needed even the whole host could be swapped out. Even with a spare host the environment is still much cheaper than compared to the original blade architecture. I asked Steffan if these deployments were used by staff on the platform or remotely. Steffan explained that staff “locally” can only access the VMs, but that TeleComputing manages the hosts, rent-an-infrastructure or infrastructure as a service is the best way to describe it.

So how does that work? Well they use a central vCenter Server in their datacenter and added the remote Virtual SAN clusters connected via a satellite connection. The virtual infrastructure as such is completely managed from a central location. Not just virtual, also the hardware is being monitored. Steffan told me they use the vendor ESXi image and as a result gets all of the hardware notification within vCenter Server, single pane of glass when you are managing many of these environments like these is key. Plus it also eliminates the need for a 3rd party hardware monitoring platform.

Another thing I was interested in was knowing how the hosts were connected, considering the special location of the deployment I figured there would be constraints here. Steffan mentioned that 10GbE is very rare in these environments and that they have standardized on 1GbE. Number of connection is even limited and today they have 4 x 1GbE per server of which 2 are dedicated to Virtual SAN. The use of 1GbE wasn’t really a concern, the number of VMs is typically relatively low so the expectation was (and testing and production has confirmed) that 2 x 1GbE would suffice.

As we were wrapping up our conversation I asked Steffan what he learned during the design/implementation, besides all the great benefits already mentioned. Steffan said that they learned quickly how critical the disk controller is and that you need to pay attention to which driver you are using in combination with a certain version of the firmware. The HCL is leading, and should be strictly adhered to. When Steffan started with VSAN the Healthcheck plugin wasn’t released yet unfortunately as that could have helped with some of the challenges. Other caveat that Steffan mentioned was that when single device RAID-0 sets are being used instead of passthrough you need to make sure to disable write-caching. Lastly Steffan mentioned the importance of separating traffic streams when 1GbE is used. Do not combine VSAN with vMotion and Management for instance. vMotion by itself can easily saturate a 1GbE link, which could mean it pushes out VSAN or Management traffic.

It is fair to say that this is by far the most exciting and special use case I have heard for Virtual SAN. I know though there are some other really interesting use cases out there as I have heard about installations on cruise ships and trains as well. Hopefully I will be able to track those down and share those stories with you. Thanks Steffan and TeleComputing for your time and great story, much appreciated!