• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

5.5

VSAN enabling Sky to be fast / responsive / agile…

Duncan Epping · Nov 30, 2015 ·

Over the last couple of months I’ve been talking to a lot of VSAN customers. A while ago I had a very interesting use case with a deployment on an Oil Platform. This time it is a more traditional deployment: I had the pleasure of talking to James Cruickshank who works for Sky. Sky is Europe’s leading entertainment company, serving 21 million customers across five countries: Italy, Germany, Austria, the UK and Ireland.

James is part of Sky’s virtualisation group which primarily focusses on new technologies. In short, the team figures out if a technology will benefit Sky, how it works, how to implement it and how to support it. He documents all of his findings then develops and delivers the solution to the operations team.

One of the new products that James is working with is Virtual SAN. The project started in March and Sky has a handful of VSAN ready clusters in each of its strategic data centres. These clusters currently have ESXi 5.5 hosts with one 400GB SSD and 4 x 4TB NL-SAS drives all connected over 10GbE, a significant amount of capacity per host. The main reason for that is that there is a requirement for Sky to run with FTT=2 (for those who don’t know, this means that a 1TB disk will consume ~3TB). James anticipates VSAN 6 will be deployed with a view to deliver production workloads in Q1 2016.

We started talking about the workloads Sky had running and what some of the challenges were for James. I figured that, considering the size of the organisation and the number of workloads it has, getting all the details must not have been easy. James confirmed that it was difficult to get an understanding of the IO profile and that he spent a lot of time developing representative workloads. James mentioned that when he started his trial the VSAN Assessment Tool wasn’t available yet, and that it would have saved him a lot of time.

So what is Sky running? For now mainly test/dev workloads. These clusters are used by developers for short term usage, to test what they are building and trash the environment, all of which is enabled through vRealize Automation. Request a VM or multiple, deploy on VSAN cluster and done. So far in Sky’s deployment all key stakeholders are pleased with the technology as it is fast and responsive, and for the ops team in particular it is very easy to manage.

James mentioned that recently he has been testing both VSAN 5.5 and 6.0. He was so surprised about the performance increase that he re-ran his test multiple times, then had his colleagues do the same, while others reviewed the maths and the testing methodology. Each time they came to the same conclusion; there was an increase in excess of 60% performance between 5.5 and 6.0 (using a “real-world” IO profile), an amazing result.

Last question for me was around some of the challenges James faced. The first thing he said was that he felt the technology was fantastic. There were new considerations around the design/sizing of their VSAN hosts, the increased dependency on TCP/IP networks and the additional responsibilities for storage placed within the virtualisation operations team. There were also some minor technical challenges, but these were primarily from an operational perspective, and with vSphere / VSAN 5.5. In some cases he had to use RVC, which is a great tool, but as it is CLI based it does have a steep learning curve. The HealthCheck plugin has definitely helped a lot with 6.0 to improve this.

Another thing James wanted to call out is that in the current VSAN host design Sky uses an SSD to boot ESXi, as VSAN hosts with more than 512GB RAM cannot boot from SD card. This means the company is sacrificing a disk slot which could have been used for capacity, when it would prefer to use SD for boot if possible to optimise hardware config.

I guess it is safe to say that Sky is pleased with VSAN and in future the company is planning on adopting a “VSAN first” policy for a proportion of their virtual estate. I want to thank Sky, and James in particular, for taking the time to talk to me about his experience with VSAN. It is great to get direct feedback and hear the positive stories from such a large company, and such an experienced engineer.

 

High latency VPLEX configuration and vMotion optimization

Duncan Epping · Jul 10, 2015 ·

This week someone asked me about an advanced setting to optimize vMotion for VPLEX configurations. This person referred to the vSphere 5.5 Performance Best Practices paper and more explicitly the following section:

Add the VMX option (extension.converttonew = “FALSE”) to virtual machine’s .vmx files. This option optimizes the opening of virtual disks during virtual machine power-on and thereby reduces switch-over time during vMotion. While this option can also be used in other situations, it is particularly helpful on VPLEX Metro deployments.

I had personally never heard of this advanced setting and I did some searches both internally and externally and couldn’t find any references other than in the vSphere 5.5 Performance paper. Strange, as you could expect with a generic recommendation like the above that it would be mentioned at least in 1 or 2 other spots. I reached out to one of the vMotion engineers and after going back and forth I figured out what the setting is for and when it should be used.

During testing with VPLEX and VMs using dozens of VMDKs in a “high latency” situation it could take longer than expected before the switchover between hosts had happened. First of all, when I say “high latency” we are talking about close to the max tolerated for VPLEX which is around 10ms RTT. When “extension.converttonew” is used the amount of IO needed during the switchover is limited, and when each IO takes 10ms you can imagine that has a direct impact on the time it takes to switchover. Of course these enhancements where also tested in scenarios where there wasn’t high latency, or a low number of disks were used, and in those cases the benefits of the enhancements were negligible and the operation overhead of configuring this setting did not weigh up against the benefits.

So to be clear, this setting should only be used in scenarios where high latency and a high number of virtual disks results in a long switchover time during migrations of VMs between hosts in a vMSC/VPLEX configuration. I hope that helps.

Virtual SAN enabling PeaSoup to simplify cloud

Duncan Epping · Jun 25, 2015 ·

This week I had the pleasure of talking to fellow dutchy Harold Buter. Harold is the CTO for Peasoup and we had a lively discussion about Virtual SAN and why Peasoup decided to incorporate Virtual SAN in their architecture, what struck me was the fact that Peasoup Hosting was brought to life partly as a result of the Virtual SAN release. When we introduced Virtual SAN, Harold and his co-founder realized that this was a unique opportunity to build something from the ground up while avoiding big upfront costs typically associated with legacy arrays. How awesome is that, a new product that results in to new ideas and in the end a new company and product offering.

The conversation of course didn’t end there, lets get in to some more details. We discussed the use case first. PeaSoup is a hosting / cloud provider. Today they have two clusters running based on Virtual SAN. They have a management cluster which hosts all components needed for a vCloud Director environment and then they have a resource cluster. The great thing for PeaSoup was that they could start out with a relatively low investment in hardware and scale fast when new customers on-board or when existing customers require new hardware.

Talking about hardware PeaSoup looked at many different configurations and vendors and for their compute platform decided to go with Fujitsu RX300 rack mount servers. Harold mentioned that by far these were the best choice for them in terms of price, build quality and service. Personally it surprised me that Fujitsu came out as the cheapest option, it didn’t surprise me that Fujitsu’s service and build quality was excellent though. Specs wise the servers have 800GB SSDs, 7200 RPM NL-SAS disks and 256GB memory and of course two CPUs (Intel 2620 v2 – 6 core).

Harold pointed out that the only down side of this particular Fujitsu configuration was the fact that it only came with a disk controller that is limited to “RAID O” only, no passthrough. I asked him if they experienced any issues around that and he mentioned that they had 1 disk failure so far and that is resulted in having to reboot the server in order to recreate a RAID-0 set for that new disk. Not too big of a deal for PeaSoup, but of course if possible he would prefer to prevent this reboot from being needed. The disk controller by the way is based on the LSI 2208 chipset and it is one of things PeaSoup was very thorough about, making sure it was supported and that it had a high queue depth. The “HCL” came up multiple times during the conversation and Harold felt that although doing a lot of research up front and creating a scalable and repeatable architecture takes time, it also results in a very reliable environment with predictable performance. For a cloud provider reliability and user experience is literally your bread and butter, they couldn’t afford to “guess”. That was also one of the reasons they selected a VSAN Ready Node configuration as a foundation and tweaked where their environment and anticipated workload would require it.

Key take away: RAID-0 works perfectly fine during normal usage, only when disks need to be replaced a slight different operational process is required.

Anticipated is a keyword once again as it has been in many of the conversations I’ve had before, it is often unknown what kind of workloads will run on top of these infrastructures which means that you need to be able to be flexible in terms of scaling up versus scaling out. Virtual SAN provides just that to PeaSoup. We also spoke about the networking aspect. As a cloud provider running vCloud Director and Virtual SAN networking is a big aspect of the overall architecture. I was interested in knowing what kind of switching hardware was being used. PeaSoup uses Huawei 10GbE switches(CE6850), and each Server is connected with at least 4 x 10GbE port to these switches. PeaSoup dedicated 2 of these ports to Virtual SAN, which wasn’t a requirement from a load perspective (or from VMware’s point of view) but they preferred this level of redundancy and performance while having a lot of room to grow. Resiliency and future proof are key for PeaSoup. Price vs Quality was also a big factor in the decision to go with Huawei switches, Huawei in this case had the best price/quality ratio.

Key take away: It is worth exploring different network vendors and switch models. Prices greatly variate between vendors and models which could lead to substantial cost savings without impacting service / quality

Their host and networking configuration is well documented and can be easily repeated when more resources are needed. They even have discount / pricing documented with their suppliers so they know what the cost will be and can assess quickly what is needed and when, and of course what the cost will be. I also asked Harold if they were offering different storage profiles to provide their customers a choice in terms of performance and resiliency. So far they offer two different policies to their customers:

  • Failures to tolerate = 1  //  Stripe Width = 2
  • Failures to tolerate = 1  //  Stripe Width = 4

So far it appears that not too many customers are asking about higher availability, they recently had their first request and it looks like the offering will include “FTT=2” along side “SW=2 / 4” in the near future. On the topic of customers they mentioned they have a variety of different customers using the platform ranging from companies who are in the business of media conversion, law firms to a company which sells “virtual private servers” on their platform.

Before we wrapped up I asked Harold what the biggest challenge for them was with Virtual SAN. Harold mentioned that although they were a very early adopter and use it in combination with vCloud Director they have had no substantial problems. What may have been the most challenging in the first months was figuring out the operational processes around monitoring. Peasoup is a happy Veeam customer and they decided to use Veeam One to monitor Virtual SAN for now, but in the future they will also be looking at the vR Ops Virtual SAN management pack, and potentially create some custom dashboards in combination with LogInsight.

Key take away: Virtual SAN is not like a traditional SAN, new operational processes and tooling may be required.

PeaSoup is an official reference customer for Virtual SAN by the way, you can find the official video below and the slide deck of their PEX session here.

Synchronet leverages Virtual SAN to provide scale, agility and reduced costs to their customers

Duncan Epping · Jun 11, 2015 ·

This week I had the pleasure to talk to John Nicholson who works for one of our partners (Synchronet out of Houston). John has been involved with various Virtual SAN implementations and designs and I felt that it would make for an interesting conversation. John in my opinion is a true datacenter architect, he has a good understanding of all aspects and definitely has a lot of experience with different storage platforms (both traditional and hyper-converged). Something I did not mention during our conversation, but the answers John was giving me to some of the questions were most definitely VCDX-level. (If you can find the time, do it John :-)) Below is John’s bio, make sure to follow him on twitter:

John Nicholson vExpert (2013-2015) is the manager of client services for Synchronet.  He oversees the professional services who deploy cutting edge virtualization, VDI, and storage solutions for customers as well as the managed services who keep these environments running smoothly.  He enjoys a deep dive into the syslog, and can telepathically sense slow and undersized storage.

First customer / project we discussed was a Virtual SAN environment for a construction company. The environment was build on top of Dell R720s and they have 400GB flash capacity in each node and 7x 1.2TB 10K RPM. In this environment MS SQL is running on top of Virtual SAN and Exchange. The SQL database is used for ~ 1000 customers as part of a real time bidding and tracking solution. As you can imagine reliability and predictable performance is key in this environment. Also hosted on Virtual SAN is their ERP system and it is also used for their development environment for their end-customer applications.

What was interesting with this particular project is that there were some strange performance anomalies, as you can imagine Virtual SAN being a new product was a suspect but after troubleshooting the environment they found out that there was a mismatch driver/firmware mismatch for the 10GbE Intel NICs they were using. Further investigation revealed that all types of traffic were impacted. John wrote about it on their corporate blog here, worth reading if you are using the Intel X540 10GbE NICs.

Key take away: Always verify driver / firmware combination and compatibility as it can have an impact.

What pleased John and the customer the most is probably the performance Virtual SAN is providing. Especially when it comes to latency, or should I say the lack of latency as they are hitting sub millisecond numbers. They’ve been so happy running Virtual SAN in their environment that they’ve just purchased new hosts and a DR site with VSAN is being implemented this week. The DR site will be used at first to test VSAN 6.0 and when proven stable and reliable the production environment will be upgraded to 6.0 and the DR site will be configured for DR purposes leveraging vSphere Replication. I asked John how they went about advising the customer to leverage a virtual replication technology which is asynchronous and John mentioned that as part of their advisory/consultancy services they have business analysts on-board which will assess what the cost of down time is and map out the cost of availability and decide a solution based on that outcome. Same applies to de-duplication by the way, what is the price of disk, what is the dedupe ratio, does it make sense in your environment?

While discussing this project John mentioned that he has worked with customers in the past which had two or three IT folks of which one being a dedicated storage admin, primarily because of the complexity of the environment and the storage system. In todays world with solutions like Virtual SAN that isn’t needed any longer and the focus of IT people should be enabling the business.

During our discussion about networking John mentioned that Synchronet has a long history with IP based storage solution (primarily iSCSI), and based on their experience top grade switches were in absolute must when deploying these types of storage. While talking to some of the Virtual SAN engineers John asked about how Virtual SAN would handle switches which have a lower “PPS” (packets per second). The Virtual SAN team mentioned that VSAN was less prone to the common issues faced in iSCSI/NFS environments, John being the techie that he is of course was skeptical and wanted to test this for himself. The results were published in this white paper, fair to say that John and his team were impressed with how Virtual SAN handled itself indeed with relatively cheap switches. For the majority of Virtual SAN deployments their typical customer setup is leveraging 2 VMkernel interfaces each connected to a different switch so that traffic isn’t going outside of the switch, this is what it would look like for those interested:

Host 1 / NIC-A —> Switch-A
Host 2 / NIC-A —> Switch-A
Host 3 / NIC-A —> Switch-A
Host 1 / NIC-B —> Switch-B
Host 2 / NIC-B —> Switch-B
Host 3 / NIC-B —> Switch-B

The second project John mentioned was for a software startup in the healthcare space. They’ve been doing a lot of mergers and acquisitions. Initially they wanted to get 6 VMs up and running but with the ability to scale-up and scale-out when needed. I found the “scale-up” comment interesting so asked what John was referring to. In this scenario the server configuration used was SuperMicro Fat Twin initially deployed with 3 hosts using a single socket and 800GB of flash capacity and two NL-SAS drives. As the company started growing of course the number of virtual machines increased and they have over 70 VMs running currently, simply achieved by adding an additional CPU in each box and add some drives. The question I had was what about flash capacity then compared to disk capacity? John said that they started out with flash overprovisioned simply to allow them to scale-up when required. Especially in the merger and acquisition space where the growth pattern is unknown this is a huge advantage of a solution like Virtual SAN which allows you to both scale-out and scale-up when required. Compared to traditional storage systems this model worked very well as they avoided the huge up front cost (50K USD – 100K USD not uncommon). On top of that, John said that with the majority of storage systems a big discount is given during the initial purchase but when it is time to add a disk shelve that discount has magically disappeared. Also, with traditional storage systems you can fairly easily reach the limits of a storage controller and be stuck with a system which can’t scale to the size you need it to scale. Another problem that disappears when leveraging VSAN.

Key take away: Large upfront costs can be avoided while offering flexibility in terms of scaling and sizing

Synchronet isn’t just an implementation partner by the way, they also do managed services and one of the things they are doing for instance is monitor customer environments leveraging Log Insight. This includes monitoring Virtual SAN, and they’ve created custom dashboard so that they can respond to  issues like for instance when a snapshot removal has failed and solve the problem before an issue arises as a result of it. They can go as far as monitoring the raw syslog feeds if needed, but each time a problem occurs in any environment this is recorded and custom dashboards and warnings are created so that every customer immediately benefits from it. For some customers they even do full management of the vSphere environment.

We had some small talk about VDI. John mentioned that VSAN is great for PoC’s and small test environments because it is easy to get in their, use it and then grow it as soon as the PoC / test has completed. Especially the price per desktop licensing is really handy as it keeps the cost down initially, and at the same time the customer knows what it is paying and getting. From an architectural point of view John mentioned that the majority of their customers use non-persistent desktops and as such the Virtual SAN environment looks different then the traditional server VM environments. Typically less disk capacity and higher flash capacity to ensure performance.

Before we wrapped up there was one thing I was interested in knowing, that was if they tweaked any of the Virtual SAN related settings (within the storage policy or for instance advanced settings). John mentioned that they would tweak the number of stripes per VM from 1 to 3 by default. This is primarily to speed up the backup with Virtual SAN 5.5, preliminary tests are showing though that with Virtual SAN and the new snapshotting mechanism this isn’t needed any longer. While talking about striping John also mentioned that for their hosting services the one thing that stood out to him is that Virtual SAN was performing so well that the customers paying for a lower tier of storage were actually getting a lot more storage performance resources then they paid for and the storage policies were used to ensure that a tier 2 VM wouldn’t receive more resources than a tier 1 VM, pretty neat problem to have I guess.

Key take away: Increasing stripe-width with Virtual SAN 5.5 can have a positive impact on performance. With 6.0 this appears no longer needed.

Last thing John wanted to mention was the VIP Tool (https://vip.vmware.com/). He said it helped them immense figuring out how much data was active and designing / sizing Virtual SAN environments for customers. I think it is fair to say that John (and Synchronet) has had huge success introducing Virtual SAN to their customers and deploying it there where applicable. Thanks John for taking the time, and thanks for being a great VMware and Virtual SAN advocate!

VSAN and large VMDKs on relative small disks?

Duncan Epping · Jun 4, 2015 ·

Last week and this week I received a question and as it was the second time in a short time I figured I would share it. The question was around how VSAN places a VMDK which is larger than the disks. Lets look at a diagram first as that will make it obvious instantly.

If you look at the diagram you see these stripes. You can define the number of stripes in a policy if you want. In the example above, the stripe width is 2. This is not the only time when you can see objects being striped though. If an object (VMDK for instance) is larger than 256GB it will create multiple stripes for this object. Also, if a physical disk is smaller than the size of the VMDK it will create multiple stripes for that VMDK. These stripes can be located on the same host as you can see in the diagram but also can be across hosts. Pretty cool right.

  • Go to page 1
  • Go to page 2
  • Go to page 3
  • Interim pages omitted …
  • Go to page 16
  • Go to Next Page »

Primary Sidebar

About the author

Duncan Epping is a Chief Technologist in the Office of CTO of the HCI BU at VMware. He is a VCDX (# 007) and the author of multiple books including "vSAN Deep Dive" and the “vSphere Clustering Technical Deep Dive” series.

Upcoming Events

04-Feb-21 | Czech VMUG – Roadshow
25-Feb-21 | Swiss VMUG – Roadshow
04-Mar-21 | Polish VMUG – Roadshow
09-Mar-21 | Austrian VMUG – Roadshow
18-Mar-21 | St Louis Usercon Keynote

Recommended reads

Sponsors

Want to support us? Buy an advert!

Advertisements

Copyright Yellow-Bricks.com © 2021 · Log in