vSphere

Virtual SAN going offshore

Duncan Epping · Aug 17, 2015 ·

Over the last couple of months I have been talking to many Virtual SAN customers. After having spoken to so many customers and having heard many special use cases and configurations I’m not easily impressed. I must say that half way during the conversation with Steffan Hafnor Røstvig from TeleComputing I was seriously impressed. Before we get to that lets first look at the background of Steffan Hafnor Røstvig and TeleComputing.

TeleComputing is one of the oldest service providers in Norway. They started out as an ASP with a lot of Citrix expertise. In the last years they’ve evolved more to being a service provider rather than an application provider. Telecomputing’s customer base consists of more than 800 companies and in excess of 80,000 IT users. Customers are typically between 200-2000 employees, so significant companies. In the Stavanger region a significant portion of the customer base is in the oil business or delivering services to the Oil business. Besides managed services, TeleComputing also has their own datacenter they manage and host services in for customers.

Steffan is a solutions architect but started out as a technician. He told me he still does a lot of hands-on, but besides that also supports sales / pre-sales when needed. The office he is in has about 60 employees. And Steffan’s core responsibility is virtualization, mostly VMware based! Note that TeleComputing is much larger than those 60 employees, they have about 700 employees worldwide with offices in Norway, Sweden and Russia.

Steffan told me he got first introduced to Virtual SAN when it was just launched. Many of their offshore installation used what they call “datacenter in a box” solution which was based on IBM Bladecenter. Great solution for that time but there were some challenges with it. Cost was a factor, rack size but also reliability. Swapping parts isn’t always easy either and that is one of the reasons they started exploring Virtual SAN.

For Virtual SAN they are not using blades any longer but instead switched to rack mounted servers. Considering the low number of VMs that are typically running in these offshore environments a fairly “basic” 1U server can be used. With 4 hosts you will now only take up 4U , instead of the 8 or 10U a typical blade system requires. Before I forget, the hosts itself are Lenovo x3550 M4’s with one S3700 Intel SSD of 200GB and 6 IBM 900GB 10K RPM drives. Each host has 64GB of memory and two Intel E5-2630 6 core CPUs. It also uses an M5110 SAS controller. Especially in the type of environments they support this is very important, on top of that the cost is significantly lower for 4 rack mounts vs a full bladecenter. What do I mean with type of environments? Well as I said offshore, but more specifically Oil Platforms! Yes, you are reading that right, Virtual SAN is being used on Oil Platforms.

For these environments 3 hosts are actively used and a 4th host is just there to serve as a “spare”. If anything fails in one of the hosts the components can easily be swapped, and if needed even the whole host could be swapped out. Even with a spare host the environment is still much cheaper than compared to the original blade architecture. I asked Steffan if these deployments were used by staff on the platform or remotely. Steffan explained that staff “locally” can only access the VMs, but that TeleComputing manages the hosts, rent-an-infrastructure or infrastructure as a service is the best way to describe it.

So how does that work? Well they use a central vCenter Server in their datacenter and added the remote Virtual SAN clusters connected via a satellite connection. The virtual infrastructure as such is completely managed from a central location. Not just virtual, also the hardware is being monitored. Steffan told me they use the vendor ESXi image and as a result gets all of the hardware notification within vCenter Server, single pane of glass when you are managing many of these environments like these is key. Plus it also eliminates the need for a 3rd party hardware monitoring platform.

Another thing I was interested in was knowing how the hosts were connected, considering the special location of the deployment I figured there would be constraints here. Steffan mentioned that 10GbE is very rare in these environments and that they have standardized on 1GbE. Number of connection is even limited and today they have 4 x 1GbE per server of which 2 are dedicated to Virtual SAN. The use of 1GbE wasn’t really a concern, the number of VMs is typically relatively low so the expectation was (and testing and production has confirmed) that 2 x 1GbE would suffice.

As we were wrapping up our conversation I asked Steffan what he learned during the design/implementation, besides all the great benefits already mentioned. Steffan said that they learned quickly how critical the disk controller is and that you need to pay attention to which driver you are using in combination with a certain version of the firmware. The HCL is leading, and should be strictly adhered to. When Steffan started with VSAN the Healthcheck plugin wasn’t released yet unfortunately as that could have helped with some of the challenges. Other caveat that Steffan mentioned was that when single device RAID-0 sets are being used instead of passthrough you need to make sure to disable write-caching. Lastly Steffan mentioned the importance of separating traffic streams when 1GbE is used. Do not combine VSAN with vMotion and Management for instance. vMotion by itself can easily saturate a 1GbE link, which could mean it pushes out VSAN or Management traffic.

It is fair to say that this is by far the most exciting and special use case I have heard for Virtual SAN. I know though there are some other really interesting use cases out there as I have heard about installations on cruise ships and trains as well. Hopefully I will be able to track those down and share those stories with you. Thanks Steffan and TeleComputing for your time and great story, much appreciated!

Awesome fling: ESXi Embedded Host Client

Duncan Epping · Aug 13, 2015 ·

A long long time ago I stumbled across a project within VMware which allowed you to manage ESXi through a client which was running on ESXi itself. Basically it presented an html interface for ESXi not unlike the MUI we had in the old days. It was one of those pet-projects being done in spare time by a couple of engineers which for various reasons at the time was never completed. The concept/idea however did not die fortunately. Some very clever engineers felt it was time to have that “embedded host client” for ESXi and started developing something in their spare time and this is the result.

I am not going to describe it in detail as William Lam has an excellent post on this great fling already. The installation is fairly straight forward, basically a vib you need to install. No rocket science. When installed you can manage various aspects of your hosts and VMs including:

VM operations (Power on, off, reset, suspend, etc).
Creating a new VM, from scratch or from OVF/OVA (limited OVA support)
Configuring NTP on a host
Displaying summaries, events, tasks and notifications/alerts
Providing a console to VMs
Configuring host networking
Configuring host advanced settings
Configuring host services

Is that cool or what? Head over to the Fling website and test it. Make sure to provide feedback when you have it as the engineers are very receptive and always looking to improve their fling. Personally I hope that this fling will graduate and will be added to ESXi by default, or at a minimum be fully supported! Excellent work Etienne Le Sueur and George Estebe!

High latency VPLEX configuration and vMotion optimization

Duncan Epping · Jul 10, 2015 ·

This week someone asked me about an advanced setting to optimize vMotion for VPLEX configurations. This person referred to the vSphere 5.5 Performance Best Practices paper and more explicitly the following section:

Add the VMX option (extension.converttonew = “FALSE”) to virtual machine’s .vmx files. This option optimizes the opening of virtual disks during virtual machine power-on and thereby reduces switch-over time during vMotion. While this option can also be used in other situations, it is particularly helpful on VPLEX Metro deployments.

I had personally never heard of this advanced setting and I did some searches both internally and externally and couldn’t find any references other than in the vSphere 5.5 Performance paper. Strange, as you could expect with a generic recommendation like the above that it would be mentioned at least in 1 or 2 other spots. I reached out to one of the vMotion engineers and after going back and forth I figured out what the setting is for and when it should be used.

During testing with VPLEX and VMs using dozens of VMDKs in a “high latency” situation it could take longer than expected before the switchover between hosts had happened. First of all, when I say “high latency” we are talking about close to the max tolerated for VPLEX which is around 10ms RTT. When “extension.converttonew” is used the amount of IO needed during the switchover is limited, and when each IO takes 10ms you can imagine that has a direct impact on the time it takes to switchover. Of course these enhancements where also tested in scenarios where there wasn’t high latency, or a low number of disks were used, and in those cases the benefits of the enhancements were negligible and the operation overhead of configuring this setting did not weigh up against the benefits.

So to be clear, this setting should only be used in scenarios where high latency and a high number of virtual disks results in a long switchover time during migrations of VMs between hosts in a vMSC/VPLEX configuration. I hope that helps.

vSphere Metro Storage Cluster with vSphere 6.0 paper released

Duncan Epping · Jul 8, 2015 ·

I’d already blogged about this on the VMware blog, but I figured I would share it here as well. The vSphere Metro Storage Cluster with vSphere 6.0 white paper has been released. I worked on this paper together with my friend Lee Dilworth, it is an updated version of the paper we did in 2012. It contains all of the new best practices for vSphere 6.0 when it comes to vSphere Metro Storage Cluster implementations, so if you are looking to implement one or upgrade an existing environment make sure to read it!

VMware vSphere Metro Storage Cluster Recommended Practices

VMware vSphere Metro Storage Cluster (vMSC) is a specific configuration within the VMware Hardware Compatibility List (HCL). These configurations are commonly referred to as stretched storage clusters or metro storage clusters and are implemented in environments where disaster and downtime avoidance is a key requirement. This best practices document was developed to provide additional insight and information for operation of a vMSC infrastructure in conjunction with VMware vSphere. This paper explains how vSphere handles specific failure scenarios, and it discusses various design considerations and operational procedures. For detailed information about storage implementations, refer to documentation provided by the appropriate VMware storage partner.

Conservative vs Aggressive for VMCP APD response

Duncan Epping · Jun 19, 2015 ·

I just finished writing the vMSC 6.0 Best Practices paper which is about to be released when a question came in. The question was around the APD scenario and whether the response to an APD should be set to aggressive or conservative. Its a good question and my instinct immediately say: conservative… But should it be configured to that in all cases? If so, why on earth do we even have an aggressive method? That got me thinking. (By the way, make sure to read this article by Matt Meyer on VMCP on the vSphere blog, good post!) But before I spill the beans, what is aggressive/conservative in this case, and what is this feature again?

VM Component Protection (VMCP) is new in 6.0 and it allows vSphere to respond to a scenario where the host has lost access to a storage device. (Both PDL and APD.) In previous releases, vSphere was already capable of responding to PDL scenarios but the settings weren’t really exposed in the UI and that has been done with 6.0 and the APD response has also been added at the same time. Great feature if you ask me, especially in stretched environments as it will help during certain failure scenarios. [Read more…] about Conservative vs Aggressive for VMCP APD response