• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • Unexplored Territory Podcast
  • HA Deepdive
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

Server

Designing a Virtual SAN stretched cluster

Duncan Epping · Sep 23, 2015 ·

There is a lot of material on stretched clusters already out there but somehow it seems that is hasn’t reached everyone yet. Last couple of weeks I spend a lot of time on the phone with different customers in the various countries/regions talking about designing a Virtual SAN stretched cluster. In this post I wanted to collect some design considerations for your Virtual SAN stretched clusters and provide pointers to different articles and white papers that can help you getting a better understanding of the solution. If any additional considerations come up in the various conversations I still have planned I will add this to this article, so it will be very much a “work in progress”.

First and foremost a stretched cluster isn’t something you implement “just because you can”. It is a solution which is usually implemented when there is a strong desire to be able to avoid a disaster or recover in an extremely fast way from a disaster. Customers I talk with are usually mid-size and up, and typically provide some form of 24×7 service. As an example, I have a customer who runs (mission) critical workloads and there is one who hosts websites which have high uptime requirements(government services) on their stretched clusters. In both cases downtime is not acceptable from a business perspective, and unfortunately not all applications provide the level of availability required, which means it needs to be solved on a different layer.

First thing that needs to be looked at is the network. From a Virtual SAN perspective there are clear requirements:

  • 5ms RTT latency max between data sites
  • 200ms RTT latency max between data and witness site
  • Both L3 and L2 are supported between the data sites
    • 10Gbps bandwidth is recommended, dependent on the number of VMs this could be lower or higher, more guidance will be provided soon around this!
  • L3 is expected between data and the witness sites
    • 100Mbps bandwidth is recommended, dependent on the number of VMs this could be lower or higher, more guidance will be provided soon around this!

First thing I need to call out, if you do L3 between data sites note that you will need some form of multicast routing. L3 from data to the witness site doesn’t need this, it doesn’t use multicast, this requirement has been removed, which simplifies the design. Latency requirements are strict. 5ms (data/data) and 200ms (data/witness) maximum, but I think it is important to say that network latency will impact your storage performance. Each write will need to be replicated to the other site, which means that each write could take 5ms if that is your RTT. Yes this can be a killer, but keep in mind that all writes always go to SSD first, so network is going to be the challenge. The lower the latency the better. Also note that this only applies to “writes”, reads will be served locally as with the Stretched Cluster functionality VSAN also introduced “site locality” to avoid those network hops for reads. Some may say: well who in their right mind is going to incur a 5ms or 3ms network latency for every IO, well I guess that fully depends on what your business requirements are.

I already mentioned witness requirements, but what about the witness itself? I was talking to a customer last week who was planning on placing a brand new physical host to serve as the witness. No need for that, you can just deploy the VSAN witness appliance on an existing host. That witness will come with all the licenses included, so there is no vSphere/VSAN cost. If you have no third site then I think it is good to know that we are working on certifying the use of the witness in vCloud Air. Easy and cost effective way of having a witness in a 3rd site without needing a 3rd site and having to manage a 3rd site.

Then there is compute of course and storage. What do you need from that point of view? Well first of all you will have to buy a lot more hardware than you would normally need running in a single location. You will need extra CPU and memory resources to ensure VMs can be restarted when a full site has failed. Yes, HA Admission Control will help with that but you also need to plan for it which is something not everyone always realizes. I guess it is a discussion that you will need to have with the business, does performance need to be the same before / after a failure? If yes, then make sure you have sufficient capacity to tolerate a 50% loss.

From a storage perspective the VSAN Stretched Cluster is based on FTT=1. This means that if you have a 10GB VMDK that 10GB is stored in the first site, and another 10GB is stored in the second site, for a total of 20GB. Of course there is the swap file for a VM and some overhead. But that is relatively simple to calculate. Just remember: (Average VM disk capacity + Swap) * 2. I usually add 10% slackspace and another 10 / 20% for snapshots depending on the usage, and I would recommend adding room for growth. Another thing to remember is the limit of 200 VMs per host with this version of VSAN. Keep in mind that you want to tolerate a full site failure, so you will want to make sure that all VMs can run on the remaining site in a supported manner.

When it comes to HA and DRS the configuration is pretty straight forward and has been described in-depth by both Cormac and myself. A couple of things I want to point out in this article as they are configuration details which are easy to forget about.

  • Make sure to specify additional isolation addresses, one in each site (das.isolationAddress0 – 1).
  • Disable the default isolation address if it  can’t be used to validate the state of the environment during a partition (if the gateway isn’t available in both sides).
  • Disable Datastore heartbeating, without traditional external storage there is no reason to have this.
  • Enable HA Admission Control and make sure it is set to 50% for CPU and Memory.
  • Keep VMs local by creating “VM/Host” should rules.

And I think that covers most of it, well summarized relatively briefly compared to the excellent document Cormac developed with all details you can wish for. Make sure to read that if you want to know every aspect.

Awesome VMworld 2015 sessions available for watching…

Duncan Epping · Sep 19, 2015 ·

I was watching a couple of VMworld sessions and I noticed the relatively low view count on them. Some of these are real gems, and these are the sessions which I have already watched multiple times. You can find more sessions on youtube in this playlist.

Ken Werneburg and Patrick Dirks talking VVols, this is a deepdive, I watched twice so far… Great talk!

Richard McDougall talking about the Future of Software Defined Storage, seen this talk evolving over the last 6 months, great insights!

You like listening to deep tech chats? Fei Guo and Seong Beom Kim wil take it to a whole new level.

vSAN licensing / packaging

Duncan Epping · Sep 14, 2015 ·

I’ve seen many questions on vSAN packaging over the last months so I figured I would share a table that shows what is possible with which license. A lot of the confusion is around the “ROBO” use case, and I want to make it crystal clear that you can deploy a 2-node ROBO configuration using Standard, Advanced or the special “vSAN for ROBO” 25VM pack that will be made available. Anyway, when it comes to functionality the table below should make it crystal clear what is included with what.

Before anyone asks, “stretched clusters” refers to the vSAN stretched cluster workflow / feature. Two data center rooms in the same building leveraging external witness capabilities through the stretched cluster workflow requires “Advanced”. Three datacenters stretched across campus distance using “fault domains” does not require Advanced, but can use Standard.

Also note that “vSAN Advanced” is included in the “Horizon Advanced” and the “Horizon Enterprise” Suites. If you have either of those, I highly recommend testing vSAN, I am seeing more and more customers taking advantage of it, a great storage platform which performs extremely and is really simple to manage is included in your suite, why not use it?!

The below table shows what the current licensing/packaging looks like for vSAN 6.6. Note that for vSAN 6.5 “all-flash” is now available in all licensing levels. In vSAN 6.6 “QoS” has been dropped down to Standard, and “Local Site Protection for Stretched Clusters” and “vSAN Encryption” have been added to Enterprise. For pricing, please contact your partner or a VMware sales rep.

vSAN
Standard
vSAN
Advanced
vSAN EnterprisevSAN for ROBO StandardvSAN for ROBO Advanced
SPBMXXXXX
Read/Write SSD CachingXXXXX
Distributed RAIDXXXXX
Distributed SwitchXXXXX
Snapshots / ClonesXXXXX
Rack AwarenessXXXXX
Health MonitoringXXXXX
vSphere Replication *XXXXX
Two Node Robo ConfigurationXXXXX
Two Node Direct ConnectXXXXX
All-FlashXXXXX
Quality of ServiceXXXXX
Dedupe and CompressionXXX
RAID-5/6XXX
Stretched ClusterX
Local Site Protection for Stretched ClustersX
vSAN EncryptionX

* vSphere Replication is new with a 5 minute RPO, this was exclusive certified for vSAN. In some material you will see this being referred too as vSAN Replication.

Full licensing white paper can be found here,

Interview at the VMware theatre during VMworld

Duncan Epping · Sep 12, 2015 ·

I had the honour and pleasure to have been asked for a “rock star” interview at the VMware Theatre at VMworld in the Solutions Exchange. I am not a big fan of the use of the term “rock star” as I am not Dave Grohl or Eddie Vedder, just one of the geeks. Nevertheless it was a lot of fun to participate in this. The interview is more about me, how I got started, where I am today etc. If you are interested, it is a short 15 minute video…

There are two more I want to share, as I very much enjoyed watching them. Yanbing who is also part of our BU and was part of the VMworld keynote, and William “the automation robot/guru” Lam (no need to introduce him)… Enjoy watching, I think it is always nice to learn more about the person behind the character 🙂

Virtual SAN 6.1 available today!

Duncan Epping · Sep 10, 2015 ·

What more do I need to say? vSphere 6.0 U1 was released today and it ships with Virtual SAN 6.1. By now you’ve all seen my posts on what’s new for VSAN 6.1 and you’ve hopefully seen the demo we created for stretched clustering. If you want to play with 6.1 yourself then you can find it here:

  • VSAN 6.1 Product download page
  • VSAN 6.1 Release Notes
  • VSAN 6.1 Administration Guide
  • « Go to Previous Page
  • Page 1
  • Interim pages omitted …
  • Page 77
  • Page 78
  • Page 79
  • Page 80
  • Page 81
  • Interim pages omitted …
  • Page 336
  • Go to Next Page »

Primary Sidebar

About the Author

Duncan Epping is a Chief Technologist and Distinguished Engineering Architect at Broadcom. Besides writing on Yellow-Bricks, Duncan is the co-author of the vSAN Deep Dive and the vSphere Clustering Deep Dive book series. Duncan is also the host of the Unexplored Territory Podcast.

Follow Us

  • X
  • Spotify
  • RSS Feed
  • LinkedIn

Recommended Book(s)

Also visit!

For the Dutch-speaking audience, make sure to visit RunNerd.nl to follow my running adventure, read shoe/gear/race reviews, and more!

Do you like Hardcore-Punk music? Follow my Spotify Playlist!

Do you like 80s music? I got you covered!

Copyright Yellow-Bricks.com © 2026 · Log in