success story

VSAN everywhere with Computacenter

Duncan Epping · Jun 1, 2016 ·

This week I had the pleasure to have a chat with Marc Huppert (VCDX181). Marc works for Computacenter in Germany as a Senior Consultant and Category Leader VMware Solutions. He primarily focuses on datacenter technology. I noticed a tweet from Marc that he was working on a project where they will be implementing a 60 site ROBO deployment. It is one of those use cases that I don’t get to see too often in Europe so I figured I would drop him a note. We had a conversation of about an hour and below is the story around this project, and some of the other projects Marc has worked on.

Marc mentioned that he has been involved with VSAN for about 1.5 years now. They at first did intensive testing internally to see what VSAN could do for their customers and looked at the various use cases for their customer base. Quickly they discovered that when combining VSAN with Fusion-IO they ended up with a very powerful combination. Not only extremely reliable (Marc mentioned he has never seen a Fusion-IO card fail), but also an extremely well performing solution. They did comparisons between Fusion-IO and regular SATA connected SSDs and performance literally doubled, not just for reads but also writes was a big difference. One of the other reasons for considering PCIe based flash is to have the maximum number of disk slots available for the capacity tier. It all makes sense to me. Right now for current projects, NVMe based flash by Intel is being explored, and I am very curious to see what Marc’s experience is going to be like in terms of performance, reliability and the operational aspects compared to Fusion-IO.

Which brings us to the ROBO project, as this is the project where NVMe will be used surprisingly enough. Marc mentioned that this customer, a large company, has over 60 locations all connected to a central (main) datacenter. Each location will be equipped with 2 hosts. Depending on the size of the location and the number of VMs needed a particular VSAN configuration can be selected:

Small – 1 NVMe device + 5 disks, 128GB RAM
Medium – 2 NVMe devices + 10 disks, 256GB RAM
Large – 3 NVMe devices + 15 disks, 384GB RAM

Yes, that also leaves room to grow when desired, as every disk group can go up to 7 disks. From a Compute point of view the configurations do not differ too much besides memory config and disk capacity, actually the CPU is the same, to keep operations simple. In terms of licensing, the vSphere ROBO and VSAN ROBO edition are being leveraged, which provides a great scalable and affordable ROBO infrastructure, especially when coming from a two node configuration with a traditional storage system per location. Not just the price point, but primarily the day 2 management.

When demonstrating VSAN to their customer this is what impressed the customer the most. They have two people managing the entire virtual and physical estate, that is 60 locations (120 nodes) and the main datacenter which houses ~ 5000VMs and many physical machines as a result. You can imagine that they spend a lot of time in vCenter and they prefer to manage things end to end from that same spot, definitely don’t want be switching between different management interfaces. Today they manage many small storage systems for their ROBO locations, and they immediately realised that VSAN in a ROBO configuration would reduce the time they spend managing those locations significantly.

And that is just the first step, next up would be the DMZ. They have a separate compute cluster as it stands right now, but it unfortunately connects back to the same shared storage system as where there production is running. They do fully understand the risk, but never wanted to incur the large cost associated with a storage system dedicated for their DMZ, not just capex but also from an opex point of view. With VSAN the economics change, making a fully isolated and self-contained DMZ compute and storage cluster dead simple to justify, especially when combining it with NSX.

One awesome customer if you ask me, and I am hoping they will become a public VSAN reference at some point in the future as it is a great testimony to what VSAN can do. We briefly discussed other use cases Marc had seen out in the field and Horizon View, Management Clusters and production came up. Which is very similar to what I see. Marc also mentioned that there is a growing interest in all-flash, which is not surprising considering the dollar per GB cost of SAS is very close to flash these days.

Before we wrapped it up, I asked Marc if had any challenges with VSAN itself, what he felt was most complex. Marc mentioned that sizing was a critical aspect and that they have spend a lot of time in the past figuring out which configurations to offer to customers. Today the process they use is fairly straight forward: select Ready Node configuration, change SATA SSD with PCIe based Flash or NVMe, increase or decrease number of disks. Fully supported, yet still flexible enough to meet all the demands of his customers.

Thanks Marc for the great conversation, and looking forward to meeting up with you at VMworld. (PS: Marc has visited all VMworld events so far in both the US and EMEA, a proper old-timer you could say :-))

You can find him on Twitter or on his blog: http://www.vcdx181.com

VSAN Success Story: Zettagrid and VSAN the perfect match for a reliable cloud infrastructure

Duncan Epping · Apr 5, 2016 ·

Two weeks ago I spoke with Anthony Spiteri about Virtual SAN and how he uses it and why he uses it. For those who don’t know Anthony, he is an architect at a service provider called Zettagrid, he is an avid blogger and spends some time on twitter now and then. Make sure to bookmark his blog and follow him on twitter, he is a smart guy. I wanted to chat with him just to understand why they selected VSAN as their storage solution for their Management environment.

Anthony mentioned that when he joined Zettagrid they weren’t using dedicated management clusters. As most of you know who manage larger infrastructures, separating production workloads from the management stack can be very useful. You don’t want your management solution contending for CPU/Memory resources, and you surely don’t want any production outage impact your management cluster… Like for instance a storage outage. Which is exactly what happened in Anthony’s case, a storage outage took out (some of) the management components, which in its turn made it impossible to figure out what was going on, a situation you don’t want to ever encounter as a service provider. Luckily they managed to figure it out relatively quick, but it did made them see a change was needed.

What better time to introduce a new concept like hyper-converged and create a self-contained management environment? Anthony mentioned that he had looked at two different platforms but decided to go for VSAN. The reason was straight forward, they did a large amounts of tests and they simply couldn’t break it. It just worked, and it worked in a dead easy way, which also meant that when this would be taken in to production the learning curve would be tiny for the operational guys.

As a hardware platform Dell FX2 is used, I am a big fan of this platform and fully understand why they picked it. 4 nodes in 2u, which even includes switching, so for VSAN this means you can keep the traffic in the chassis with these smaller “4 node management” pods. Zettegrid decided to deploy 3 of these pods and each of them will run services like vCenter Server, vCloud Director, SQL, AD, Veeam Backup etc. Nice solution if you ask me.

We also spoke about pricing, although not part of my responsibilities it is always interesting to see how a solution works out from a TCO/ROI stance. I still recall exchanging some messages with Anthony about the VSPP pricing, and he mentioned it was on the high side. Needless to say, but the recent pricing changes definitely make VSAN a no-brainer for Service Providers. The points cut in half and the billing is one based on what is “used” versus what is “allocated”, and believe me (actually believe Anthony) that makes a huge difference! Such a big difference, Anthony said that they will definitely be looking at VSAN for their Cloud Resources as well.

Thanks Anthony for taking the time. Always good to hear back from customers.

PS: There is an official VSAN reference story coming out soon as well coincidentally, I will link to that as soon as I have received it.

VSAN enabling Sky to be fast / responsive / agile…

Duncan Epping · Nov 30, 2015 ·

Over the last couple of months I’ve been talking to a lot of VSAN customers. A while ago I had a very interesting use case with a deployment on an Oil Platform. This time it is a more traditional deployment: I had the pleasure of talking to James Cruickshank who works for Sky. Sky is Europe’s leading entertainment company, serving 21 million customers across five countries: Italy, Germany, Austria, the UK and Ireland.

James is part of Sky’s virtualisation group which primarily focusses on new technologies. In short, the team figures out if a technology will benefit Sky, how it works, how to implement it and how to support it. He documents all of his findings then develops and delivers the solution to the operations team.

One of the new products that James is working with is Virtual SAN. The project started in March and Sky has a handful of VSAN ready clusters in each of its strategic data centres. These clusters currently have ESXi 5.5 hosts with one 400GB SSD and 4 x 4TB NL-SAS drives all connected over 10GbE, a significant amount of capacity per host. The main reason for that is that there is a requirement for Sky to run with FTT=2 (for those who don’t know, this means that a 1TB disk will consume ~3TB). James anticipates VSAN 6 will be deployed with a view to deliver production workloads in Q1 2016.

We started talking about the workloads Sky had running and what some of the challenges were for James. I figured that, considering the size of the organisation and the number of workloads it has, getting all the details must not have been easy. James confirmed that it was difficult to get an understanding of the IO profile and that he spent a lot of time developing representative workloads. James mentioned that when he started his trial the VSAN Assessment Tool wasn’t available yet, and that it would have saved him a lot of time.

So what is Sky running? For now mainly test/dev workloads. These clusters are used by developers for short term usage, to test what they are building and trash the environment, all of which is enabled through vRealize Automation. Request a VM or multiple, deploy on VSAN cluster and done. So far in Sky’s deployment all key stakeholders are pleased with the technology as it is fast and responsive, and for the ops team in particular it is very easy to manage.

James mentioned that recently he has been testing both VSAN 5.5 and 6.0. He was so surprised about the performance increase that he re-ran his test multiple times, then had his colleagues do the same, while others reviewed the maths and the testing methodology. Each time they came to the same conclusion; there was an increase in excess of 60% performance between 5.5 and 6.0 (using a “real-world” IO profile), an amazing result.

Last question for me was around some of the challenges James faced. The first thing he said was that he felt the technology was fantastic. There were new considerations around the design/sizing of their VSAN hosts, the increased dependency on TCP/IP networks and the additional responsibilities for storage placed within the virtualisation operations team. There were also some minor technical challenges, but these were primarily from an operational perspective, and with vSphere / VSAN 5.5. In some cases he had to use RVC, which is a great tool, but as it is CLI based it does have a steep learning curve. The HealthCheck plugin has definitely helped a lot with 6.0 to improve this.

Another thing James wanted to call out is that in the current VSAN host design Sky uses an SSD to boot ESXi, as VSAN hosts with more than 512GB RAM cannot boot from SD card. This means the company is sacrificing a disk slot which could have been used for capacity, when it would prefer to use SD for boot if possible to optimise hardware config.

I guess it is safe to say that Sky is pleased with VSAN and in future the company is planning on adopting a “VSAN first” policy for a proportion of their virtual estate. I want to thank Sky, and James in particular, for taking the time to talk to me about his experience with VSAN. It is great to get direct feedback and hear the positive stories from such a large company, and such an experienced engineer.

Virtual SAN going offshore

Duncan Epping · Aug 17, 2015 ·

Over the last couple of months I have been talking to many Virtual SAN customers. After having spoken to so many customers and having heard many special use cases and configurations I’m not easily impressed. I must say that half way during the conversation with Steffan Hafnor Røstvig from TeleComputing I was seriously impressed. Before we get to that lets first look at the background of Steffan Hafnor Røstvig and TeleComputing.

TeleComputing is one of the oldest service providers in Norway. They started out as an ASP with a lot of Citrix expertise. In the last years they’ve evolved more to being a service provider rather than an application provider. Telecomputing’s customer base consists of more than 800 companies and in excess of 80,000 IT users. Customers are typically between 200-2000 employees, so significant companies. In the Stavanger region a significant portion of the customer base is in the oil business or delivering services to the Oil business. Besides managed services, TeleComputing also has their own datacenter they manage and host services in for customers.

Steffan is a solutions architect but started out as a technician. He told me he still does a lot of hands-on, but besides that also supports sales / pre-sales when needed. The office he is in has about 60 employees. And Steffan’s core responsibility is virtualization, mostly VMware based! Note that TeleComputing is much larger than those 60 employees, they have about 700 employees worldwide with offices in Norway, Sweden and Russia.

Steffan told me he got first introduced to Virtual SAN when it was just launched. Many of their offshore installation used what they call “datacenter in a box” solution which was based on IBM Bladecenter. Great solution for that time but there were some challenges with it. Cost was a factor, rack size but also reliability. Swapping parts isn’t always easy either and that is one of the reasons they started exploring Virtual SAN.

For Virtual SAN they are not using blades any longer but instead switched to rack mounted servers. Considering the low number of VMs that are typically running in these offshore environments a fairly “basic” 1U server can be used. With 4 hosts you will now only take up 4U , instead of the 8 or 10U a typical blade system requires. Before I forget, the hosts itself are Lenovo x3550 M4’s with one S3700 Intel SSD of 200GB and 6 IBM 900GB 10K RPM drives. Each host has 64GB of memory and two Intel E5-2630 6 core CPUs. It also uses an M5110 SAS controller. Especially in the type of environments they support this is very important, on top of that the cost is significantly lower for 4 rack mounts vs a full bladecenter. What do I mean with type of environments? Well as I said offshore, but more specifically Oil Platforms! Yes, you are reading that right, Virtual SAN is being used on Oil Platforms.

For these environments 3 hosts are actively used and a 4th host is just there to serve as a “spare”. If anything fails in one of the hosts the components can easily be swapped, and if needed even the whole host could be swapped out. Even with a spare host the environment is still much cheaper than compared to the original blade architecture. I asked Steffan if these deployments were used by staff on the platform or remotely. Steffan explained that staff “locally” can only access the VMs, but that TeleComputing manages the hosts, rent-an-infrastructure or infrastructure as a service is the best way to describe it.

So how does that work? Well they use a central vCenter Server in their datacenter and added the remote Virtual SAN clusters connected via a satellite connection. The virtual infrastructure as such is completely managed from a central location. Not just virtual, also the hardware is being monitored. Steffan told me they use the vendor ESXi image and as a result gets all of the hardware notification within vCenter Server, single pane of glass when you are managing many of these environments like these is key. Plus it also eliminates the need for a 3rd party hardware monitoring platform.

Another thing I was interested in was knowing how the hosts were connected, considering the special location of the deployment I figured there would be constraints here. Steffan mentioned that 10GbE is very rare in these environments and that they have standardized on 1GbE. Number of connection is even limited and today they have 4 x 1GbE per server of which 2 are dedicated to Virtual SAN. The use of 1GbE wasn’t really a concern, the number of VMs is typically relatively low so the expectation was (and testing and production has confirmed) that 2 x 1GbE would suffice.

As we were wrapping up our conversation I asked Steffan what he learned during the design/implementation, besides all the great benefits already mentioned. Steffan said that they learned quickly how critical the disk controller is and that you need to pay attention to which driver you are using in combination with a certain version of the firmware. The HCL is leading, and should be strictly adhered to. When Steffan started with VSAN the Healthcheck plugin wasn’t released yet unfortunately as that could have helped with some of the challenges. Other caveat that Steffan mentioned was that when single device RAID-0 sets are being used instead of passthrough you need to make sure to disable write-caching. Lastly Steffan mentioned the importance of separating traffic streams when 1GbE is used. Do not combine VSAN with vMotion and Management for instance. vMotion by itself can easily saturate a 1GbE link, which could mean it pushes out VSAN or Management traffic.

It is fair to say that this is by far the most exciting and special use case I have heard for Virtual SAN. I know though there are some other really interesting use cases out there as I have heard about installations on cruise ships and trains as well. Hopefully I will be able to track those down and share those stories with you. Thanks Steffan and TeleComputing for your time and great story, much appreciated!

Virtual SAN enabling PeaSoup to simplify cloud

Duncan Epping · Jun 25, 2015 ·

This week I had the pleasure of talking to fellow dutchy Harold Buter. Harold is the CTO for Peasoup and we had a lively discussion about Virtual SAN and why Peasoup decided to incorporate Virtual SAN in their architecture, what struck me was the fact that Peasoup Hosting was brought to life partly as a result of the Virtual SAN release. When we introduced Virtual SAN, Harold and his co-founder realized that this was a unique opportunity to build something from the ground up while avoiding big upfront costs typically associated with legacy arrays. How awesome is that, a new product that results in to new ideas and in the end a new company and product offering.

The conversation of course didn’t end there, lets get in to some more details. We discussed the use case first. PeaSoup is a hosting / cloud provider. Today they have two clusters running based on Virtual SAN. They have a management cluster which hosts all components needed for a vCloud Director environment and then they have a resource cluster. The great thing for PeaSoup was that they could start out with a relatively low investment in hardware and scale fast when new customers on-board or when existing customers require new hardware.

Talking about hardware PeaSoup looked at many different configurations and vendors and for their compute platform decided to go with Fujitsu RX300 rack mount servers. Harold mentioned that by far these were the best choice for them in terms of price, build quality and service. Personally it surprised me that Fujitsu came out as the cheapest option, it didn’t surprise me that Fujitsu’s service and build quality was excellent though. Specs wise the servers have 800GB SSDs, 7200 RPM NL-SAS disks and 256GB memory and of course two CPUs (Intel 2620 v2 – 6 core).

Harold pointed out that the only down side of this particular Fujitsu configuration was the fact that it only came with a disk controller that is limited to “RAID O” only, no passthrough. I asked him if they experienced any issues around that and he mentioned that they had 1 disk failure so far and that is resulted in having to reboot the server in order to recreate a RAID-0 set for that new disk. Not too big of a deal for PeaSoup, but of course if possible he would prefer to prevent this reboot from being needed. The disk controller by the way is based on the LSI 2208 chipset and it is one of things PeaSoup was very thorough about, making sure it was supported and that it had a high queue depth. The “HCL” came up multiple times during the conversation and Harold felt that although doing a lot of research up front and creating a scalable and repeatable architecture takes time, it also results in a very reliable environment with predictable performance. For a cloud provider reliability and user experience is literally your bread and butter, they couldn’t afford to “guess”. That was also one of the reasons they selected a VSAN Ready Node configuration as a foundation and tweaked where their environment and anticipated workload would require it.

Key take away: RAID-0 works perfectly fine during normal usage, only when disks need to be replaced a slight different operational process is required.

Anticipated is a keyword once again as it has been in many of the conversations I’ve had before, it is often unknown what kind of workloads will run on top of these infrastructures which means that you need to be able to be flexible in terms of scaling up versus scaling out. Virtual SAN provides just that to PeaSoup. We also spoke about the networking aspect. As a cloud provider running vCloud Director and Virtual SAN networking is a big aspect of the overall architecture. I was interested in knowing what kind of switching hardware was being used. PeaSoup uses Huawei 10GbE switches(CE6850), and each Server is connected with at least 4 x 10GbE port to these switches. PeaSoup dedicated 2 of these ports to Virtual SAN, which wasn’t a requirement from a load perspective (or from VMware’s point of view) but they preferred this level of redundancy and performance while having a lot of room to grow. Resiliency and future proof are key for PeaSoup. Price vs Quality was also a big factor in the decision to go with Huawei switches, Huawei in this case had the best price/quality ratio.

Key take away: It is worth exploring different network vendors and switch models. Prices greatly variate between vendors and models which could lead to substantial cost savings without impacting service / quality

Their host and networking configuration is well documented and can be easily repeated when more resources are needed. They even have discount / pricing documented with their suppliers so they know what the cost will be and can assess quickly what is needed and when, and of course what the cost will be. I also asked Harold if they were offering different storage profiles to provide their customers a choice in terms of performance and resiliency. So far they offer two different policies to their customers:

Failures to tolerate = 1 // Stripe Width = 2
Failures to tolerate = 1 // Stripe Width = 4

So far it appears that not too many customers are asking about higher availability, they recently had their first request and it looks like the offering will include “FTT=2” along side “SW=2 / 4” in the near future. On the topic of customers they mentioned they have a variety of different customers using the platform ranging from companies who are in the business of media conversion, law firms to a company which sells “virtual private servers” on their platform.

Before we wrapped up I asked Harold what the biggest challenge for them was with Virtual SAN. Harold mentioned that although they were a very early adopter and use it in combination with vCloud Director they have had no substantial problems. What may have been the most challenging in the first months was figuring out the operational processes around monitoring. Peasoup is a happy Veeam customer and they decided to use Veeam One to monitor Virtual SAN for now, but in the future they will also be looking at the vR Ops Virtual SAN management pack, and potentially create some custom dashboards in combination with LogInsight.

Key take away: Virtual SAN is not like a traditional SAN, new operational processes and tooling may be required.

PeaSoup is an official reference customer for Virtual SAN by the way, you can find the official video below and the slide deck of their PEX session here.