Be careful when defining a VM storage policy for VSAN

I was defining a VM storage policy for VSAN and it resulted in something unexpected. You might have read that when no policy is defined within vCenter that VSAN defaults to the following for availability reasons:

  • Failures to tolerate = 1

So I figured I would define a new policy and include “stripe width” in this policy. I wanted to have a stripe width of 2 and “failures to tolerate” set to the default of 1. I figured as “failures to tolerate” is set to 1 anyway by default I would specify it, but would just specify stripe width. Why add rules which already have the correct value right?

VM storage policy for VSAN

Well that is what I figured, no point in adding it… and this was the result:

Do you notice something in the above screenshot? I do… I see no “RAID 1″ mentioned and all components reside on the same host, esx014, in this case. So what does that mean? It means that when you create a profile and do not specify “failures to tolerate” that is default to 0 and no mirror copies are created. This is not the situation you want to find yourself in! So when you define stripe width, make sure you also define “failures to tolerate”. Even better, when you create a VM Storage Policy always include “failures to tolerate. Below is an example of what my policy should have looked like.

VM storage policy for VSAN

So remember this: When defining a new VSAN VM Storage Policy always include “Number of failures to tolerate”! If you did forget to specify it, the nice thing here is that you can change VM Storage Policies on the fly and apply them directly to your VMs. Cormac has a nice article on this subject!

Isolation / Partition scenario with VSAN cluster, how is this handled?

After explaining how a disk or host failure worked in a VSAN cluster, it only made sense to take the next step… How are Isolations or Partitions in a Virtual SAN cluster handled? I guess lets start with the beginning, and I am going to try to keep it simple, first a recap of what we learned in the disk/host failures article.

Virtual SAN (VSAN) has the ability to create mirrors of objects. This ability is defined within a policy (VM Storage Policy aka Storage Policy Based Management). You can define option called “failures to tolerate” anywhere between 0 and 3 at the moment. By default this option is set to 1. This means you will have two copies of your data. On top of that VSAN will need a witness / quorum to help figuring out who takes ownership in the case of an event. So what does this look like? Note that in the below diagram I used the term “vmdk” and “witness” to simplify things, in reality this could be any type of component of a VM.

So what did we learn from this (hopefully) simple diagram?

  • A VM does not necessarily have to run on the same host as where its storage objects are sitting
  • The witness lives on a different host than the components it is associated with in order to create an odd number of hosts involved for tiebreaking under a network partition
  • The VSAN network is used for communication, IO and HA

Lets recap some of the HA changes first for a VSAN cluster before we dive in to the details:

  • When HA is turned on in the cluster, FDM agent (HA) traffic uses the VSAN network and not the Management Network. However, when a potential isolation is detected HA will ping the default gateway (or specified isolation address) using the Management Network.
  • When enabling VSAN ensure vSphere HA is disabled. You cannot enable VSAN when HA is already configured. Either configure VSAN during the creation of the cluster or disable vSphere HA temporarily when configuring VSAN.
  • When there are only VSAN datastores available within a cluster then Datastore Heartbeating is disabled. HA will never use a VSAN datastore for heartbeating as the VSAN network is already used for network heartbeating using the Datastore for heartbeating would not add anything,
  • When changes are made to the VSAN network it is required to re-configure vSphere HA!

As you can see the VSAN network plays a big roll here, and even bigger then you might realize as it is also used by HA for network heartbeating. So what if the host on which the VM is running gets isolated from the rest of the network? The following would happen:

  • HA will detect there are no network heartbeats received from “esxi-01″
  • HA master will try to ping the slave “esxi-01″
  • HA will declare the slave “esxi-01″ is unavailable
  • VM will be restarted on one of the other hosts… “esxi-02″ in this case, but that could be any, depicted in the diagram below

Simple right? Before I forget, for these scenarios it is important to ensure that your isolation response is set to power-off. But I guess the question now arises… what if “esxi-01″ and “esxi-02″ would be part of the same partition? What happens then? Well that is where the witness comes in to play. Let show the diagram first, as that will make it a bit easier to understand!

Now this scenario is slightly more complex. There are two partitions, one of the partition is running the VM with its VMDK and the other partition has a VMDK and a witness. Guess what happens? Right, VSAN uses the witness to see which partition has quorum and based on that fact one of the two will win. In this case Partition-2 has more than 50% of the components of this object and as such is the winner. This means that the VM will be restarted on either “esxi-03″ or “esxi-04″ by HA. Note that the VM in Partition-1 will not be powered off, even if you have configured the isolation response to do so, as this partition would re-elect a master and would be able to see each other!

But what if “esxi-01″ and “esxi-04″ were isolated, what would happen then? This is what it would look like:

Remember that rule which I slipped in to the previous paragraph? The winner is declared based on the % of components available within that partition. If the partition has access to more than 50% it has won. Meaning that when “esxi-01″ and “esxi-04″ are isolated, either “esxi-02″ or “esxi-03″ can restart the VM because 66% of the components reside within this part of the cluster. Nice right?!

I hope this makes isolations / partitions a bit clearer, I realize though concepts will be tough for the first weeks/months… I will try to explore some more (complex) scenarios in the near future.

How VSAN handles a disk or host failure

I have had this question multiple times by now, I wanted to answer it in the Virtual SAN FAQ but I figured I would need some diagrams and probably more than 2 or 3 sentences to explain this. How are host or disk failures in a Virtual SAN cluster handled? I guess lets start with the beginning, and I am going to try to keep it simple.

I explained some of the basics in my VSAN intro post a couple of weeks back, but it never hurts to repeat this. I think it is good to explain the IO path first before talking about the failures. Lets look at a 4 host cluster with a single VM deployed. This VM is deployed with the default policy, meaning “stripe width” of 1 and “failures to tolerate” to 1 as well. When deployed in this fashion the following is the result:

In this case you can see: 2 mirrors of the VMDKs and a witness. These VMDKs by the way are the same, they are an exact copy. What else did we learn from this (hopefully) simple diagram?

  • A VM does not necessarily have to run on the same host as where its storage objects are sitting
  • The witness lives on a different host than the components it is associated with in order to create an odd number of hosts involved for tiebreaking under a network partition
  • The VSAN network is used for communication / IO etc

Okay, so now that we know these facts it is also worth knowing that VSAN will never place the mirror on the same host for availability reasons. When a VM writes the IO is mirrored by VSAN and will not be acknowledged back to the VM until all have completed. Meaning that in the example above both the acknowledgement from “esxi-02″ and “esxi-03″ will need to have been received before the write is acknowledge to the VM. The great thing here is though that all writes will go to flash/ssd, this is where the write-buffer comes in to play. At some point in time VSAN will then destage the data to your magnetic disks, but this will happen without the guest VM knowing about it… [Read more...]

Frequently asked questions about Virtual SAN / VSAN

After I published the vSphere Flash Read Cache FAQ many asked if I would also do a blog post for frequently asked questions about Virtual SAN / VSAN. I guess it makes sense considering Virtual SAN / VSAN being such a hot topic. So here are the questions I have received so far, followed by the answers of course. If you have a question do not hesitate to leave a comment.

** updated to reflect VSAN GA **

  • Can I add a host to a VSAN cluster which does not have local disks?
    • Yes a VSAN cluster can consist of hosts which are not contributing to VSAN storage. You will need to create a VSAN VMkernel and simply add it to the cluster. Note that you will need at a minimum 3 hosts which contribute storage to VSAN
  • VSAN requires an SSD, what is it used for?
    • The SSD is used for read caching (70%) and write buffering (30%). Every write will go to SSD first and will be destaged to HDD later.
  • When creating my VSAN VM Storage Policy, when do I use “failures to tolerate” and when do I use “stripe width”?
    • Failures to tolerate is all about availability, this is what you define when your virtual machine will need to be available when a host or disk group has failed. So if you want to take 1 host failure in to account, you define the policy to 1. This will then create 2 data objects and 1 witness in your cluster. Stripe width is about performance (read performance when not in cache and write destaging). Setting it to two or higher will result in data being striped across multiple disks. When used in conjunction with “failures” to tolerate this could potentially result in data of a single VM stored on multiple disks on multiple hosts.
  • Is there a default storage policy for VSAN?
    • Yes there is a policy applied by default to all VMs on a VSAN datastore but you cannot see this policy within the vSphere UI. You can see that a default policy is defined to various classes using the following command: esxcli vsan policy getdefault. By default an N+1 failures to tolerate policy is applied so that even in the case where user forgets to create and set a policy objects are made resilient. It is not recommended to change the default policy.
  • How is data striped across multiple disks on a host when stripe width is set to 2?
    • When stripe width is set to 2 first of all there is no guarantee that the data is striped across disks within a host. VSAN has it’s own algorythm to determine where data should be placed and as such it could happen that although you have sufficient disks in all host your data is striped across multiple hosts instead of disks within a host. When data is striped this is done in chunks of 1MB.
  • What is the purpose of “disk groups” since VSAN will create one datastore anyway?
    • A disk group defines the SSD that is used for caching/buffering in front of a set of HDDs. Basically a disk groups is a way of mapping HDDs to an SSD. Each disk group will have 1 SSD and a maximum of 7 disks.
  • How many disks can a single host contribute to VSAN?
    • Max 5 diskgroup
    • Each disk group needs 1 SDD and 1 HDD at a mininum and 7 HDDs at a maximum
    • HDD count max per host = 5 x 7 = 35
    • SSD count max per host = 5 x 1 = 5
  • Are both SSD and PCIe Flash cards supported?
    • Yes both are supported but check the HCL for more details around this as there are guidelines and requirements
  • Is 10GbE a hard requirement for VSAN?
    • 10GbE is not a hard requirement for VSAN. VSAN works perfectly fine in smaller environments, including labs, with 1GbE. Do note that 10GbE is a recommendation.
  • Why is it recommended for HA’s isolation response to be configured to “powered-off”?
    • When VSAN is enabled vSphere HA uses the VSAN VMkernel network for heartbeating. When a host does not receive any heartbeats, it is most likely that the host is also isolated/partitioned from a VSAN perspective from the rest of the cluster. In this state it is recommended to power-off the virtual machine as a new copy will be powered-on by HA on the remaining hosts in the cluster automatically. This way when the host comes out of isolation the situation where 2 VMs with the same identity are on the network does not occur.
  • Can I partition my SSD or disks so that I can use them for other (install ESXi / vFlash) purposes?
    • No you cannot partition your SSD or HDD(s). Virtual SAN will only, and always, claim entire disks. With VSAN it probably makes most sense to install ESXi on an internal USB/SD card, this to maximize the capacity for VSAN.
  • Does VSAN support deduplication or compression?
    • In the current version VSAN does not support deduplication or compression. The most expensive resource in your VSAN cluster is SSD/Flash, hence duplication of data is most relevant on that layer. While having multiple copies of your data results in two copies on HDDs, and two temporary copies in the distributed write buffer (30% of the SSDs), the distributed read cache portion of the Flash (70%) will only contain a single copy of any cached data.
  • Can VSAN leverage SAN/NAS datastores?
    • VSAN currently does not support the use of SAN/NAS datastores. Disks will need to be “local” and directly passed to the host.
  • I was told VSAN does thin disks by default, if I set Object Space Reservation to 100% does that mean the VMDK will be eager zero thick provisioned?
    • No it does not mean the VM will be thick provisioned, or a portion for that matter, when you define Object Space Reservation. Object Space Reservation is all about the numbers used by VSAN when calculation used disk space / available disk space etc. When Object Space Reservation is set to 100% on a disk of 25GB then this disk will be a thin provisioned disk but VSAN will do its math with 100% used of 25GB. I guess you can compare it to a memory reservation.
  • Does VSAN use iSCSI or NFS to connect hosts to the datastore?
    • VSAN does not use either of these two to connect hosts to a datastore. It uses a proprietary mechanism.
  • What is the impact of maintenance mode in a VSAN enabled cluster?
    • There are three ways of placing a host which is providing storage to your VSAN datastore in maintenance mode:
      1) Full Data Migration – All data residing on the host will be migrated. Impact: Could take a long time to complete.
      2) Ensure accessibility – VSAN ensures that all VMs will remain accessible by migrating the required data to other hosts. Impact: Potentially availability policies are violated.
      3) No Data Migration – No data will be migrated. Impact: Depending on the “failures to tolerate” policy defined some VMs might become unusable.
      The safest option is option 1, with option 2 being the preferred and default as it is the fastest to complete. I guess the question is why you are placing the host in maintenance mode and how fast it will become available again. Option 3 is a fall back, in caseyou really need to get into maintenance mode fast and don’t care about potential data loss.
  • Are there any features of vSphere which aren’t supported/compatible with VSAN?
    • Currently vSphere Distributed Power Management, Storage DRS and Storage IO Control are not supported with VSAN.
  • How do I add a Virtual SAN / VSAN license?
    • VSAN licenses are applied on a cluster level. Open the Webclient click on your VSAN enabled cluster, click the “Manage” tab followed by “Settings”. Under “Configuration” click “Virtual SAN Licensing” and then click “Assign License Key”.
  • How will Virtual SAN be priced / licensed?
    • VSAN is licensed per socket, the price is $ 2495 per socket or $ 50,- per VDI user. Note that the license includes the Distributed Switch and VM Storage Policies, even when using a vSphere license lower than Enterprise Plus!
  • If a host has failed and as such data is lost and all VMs were protected N+1, how long will it take before VSAN starts rebuilding the lost data?
    • VSAN will identify which objects are out of compliance (those which had N+1 and were stored on that host) and starts a time-out period of 60 minutes. It has a time-out period to avoid an unnecessary and costly full sync of data. If the host returns within those 60 minutes then the differences will copied to that host. When a VM has multiple mirrors it doesn’t notice the failure, this 60 minute period is all about going back to full policy compliance, i.e. being able to satisfy additional failures may they occur.
  • When a virtual machines moves around in a cluster will its objects follow to keep IO local?
    • No, objects (virtual disks for instance) do not follow the virtual machine. Just imagine what the cost/overhead of moving virtual disks between hosts would be each time DRS suggests a migration. Instead IO can be done remotely. Meaning that although your virtual machine might run on host-1 from a CPU/Mem perspective, its virtual disks could be physically located on host-2 and host-3.
  • When a Virtual Machine is migrated to another host,  is the situation such that after a vMotion the SDD cache is lost (temporary performance hit) and the cache will be rebuilt over time?
    • No cache will not be lost and there is no need to rebuilt/warm the cache up again. Cache will be accessed remotely when needed.
  • Does VSAN support Fault Tolerance aka FT?
    • No, VSAN does not support Fault Tolerance in this release.
  • The SSD in my host is being reported in vSphere as “non-SSD”. According to support this is a known issue with the generation of server I am using. Will this “mis-reporting” of the disk type affect my ability to configure a VSAN?
    • Yes it will, you will need to tag the SSD as local using (example below is what I use in my lab, your identifier will be different). And in this case I claim it as being “local” and as “SSD”.
      esxcli storage nmp satp rule add –satp VMW_SATP_LOCAL –device mpx.vmhba2:C0:T0:L0 –option “enable_local enable_ssd”
  • It was mentioned that it will take 60 minutes after a failure before VSAN starts the automatic repair. Is it possible to shorten this time-out value?
    • **disclaimer: Although I do not recommend changing this value, I was told it is supported**
      Yes it is possible to shorten this time-out value by configuring the advanced setting named “VSAN.ClomRepairDelay” on every host in your VSAN cluster.
  • Why can’t I use datastore heartbeat functionality in VSAN only cluster?
    • There is no requirement for heartbeat datastores. The reason you do not have this functionality when you only have a VSAN datastore is because HA will use the VSAN network for heartbeats. So if a host is isolated from the VSAN network and cannot send heartbeats, it is safe to say that it will also not be able to update a heartbeat region remotely as such making it pointless to enable this feature in a VSAN only environment.
  • Are there specific Best Practices around deploying View on VSAN?
  • Can the VSAN VMkernel of hosts in a cluster be part of a different subnet?
    • VSAN VMkernel’s need to be part of the same subnet. Different subnet for one (or multiple) hosts within a VSAN cluster is not supported. When using multiple VMkernel interfaces per host each interface needs to be part of a different subnet!
  • Does VSAN support being stretched across multiple geographical locations?
    • In the current version VSAN will not support “metro” clustering.
  • Is there a difference between a host failing and a disk gradually failing?
    • Yes there is a difference. There are various failure stated and depending on the state it also determines how fast VSAN will spin up a new mirror. The two failure states are “absent” and “degraded”. Degraded is where a disks has failed and the system has recognized this as such and knows it isn’t coming back. In this case VSAN recognizes this “degraded” state and will create a new mirror of the impacted objects immediately, as there is no point in waiting for 60 minutes when you know it isn’t coming back soon. The “absent” state means that VSAN doesn’t know if it is coming back any time soon, this could be a host that has failed or for instance when you yank a disk, in this case the 60 minute time-out starts.
  • Is there any explanation around how VSAN handles disk failures or host failures?
  • What happens when an SSD fails in a VSAN cluster?

    • An SSD sits in front of a Disk Group as the read cache / write buffer. When the SSD fails then the disk group and all the components stored on it are marked as degraded. VSAN will then instanties new mirror copies where applicable and when sufficient disk capacity is available. For more details read this post.
  • Does vSphere support TRIM for SSDs?
    • No, TRIM is currently not supported/leveraged.
  • What are the Maximum Numbers for Virtual SAN GA?
    • 32 hosts per cluster
    • 100 VMs per host maximum
    • 3200 VMs per cluster maximum
    • 2048 VMs HA protected per cluster maximum
    • 2 million IOPS tested
  • How do I size a VSAN datastore / cluster?
  • How do I monitor VSAN performance?
    • Performance can easily be monitored using the VSAN Observer tool. This has been discussed by various people: here, here and here, here.
  • What’s likely to affect VSAN performance ?
    • Performance is most likely affected by leveraging cheap flash devices or incorrectly configured policies. In the case a workload is highly random and has a large “working set” it could be that many of the IOs will need to come from disk, this can also impact performance depending on the disk type used and the number of disk stripes.
  • Why is  Storage DRS not supported in VSAN ?
    • VSAN only provides a single datastore and has its own placement and balancing algorithms.
  • What will happen when the whole environment goes down and power back on again ? Do we run some sort of integrity check ?
  • Is VSAN dependent on vCenter ? Can I configure VSAN if vCenter is down ?
    • VSAN is not dependent on vCenter. It can be configured from the console using “esxcli” and can even be configured and used before vCenter is up and running. William Lam wrote two articles around how to bootstrap vCenter on a single host running VSAN. (here and here)
  • Could you have locality in VSAN ? Does locality make sense at all compared to other solutions ?
    • By default VSAN does not have a “data locality” concept as I explained here. However, for View environments CBRC is fully supported and that provides a local read cache for desktops.
  • Is vCops aware of VSAN datastore?
    • The current version of VC Ops has limited functionality in its current release. The upcoming version of VC Ops will include more statistics and ways of monitoring a VSAN datastore.
  • How do you backup your VM’s in VSAN ? Just usual existing backup procedures ?
    • VDP supports VSAN and various storage vendors are going through testing/releasing a new version of their product as we speak. VMs stored on a VSAN database should not be treated differently then regular VMs.
  • Does VSAN support any data reduction mechanisms like deduplication or compression?
    • In the current version deduplication or compression is not included.
  • x

If you have a question, please don’t hesitate to ask… Over time I will add more and more to this list so come back regularly.

vSphere HA Futures: Restart Order

At VMworld I hosted a group discussion together with Keith Farkas (HA Lead Engineer) on the topic of HA Futures. Based on this discussion group session Keith and I decided to gather more feedback from the field, this post will hopefully help us with that. Please do not hesitate to comment. I will have a couple of articles following this one, but lets get started with HA futures for the Restart Order first.

A topic that has come up at various sessions is HA restart ordering / priorities. Today HA provides four levels of restart priority: High, Medium, Low, Disabled. The thing to note with the current restart priority though is that there is no guarantee VMs are actually restarted in that order when the VMs are started on more than one host. Even when HA would restart them in the right order there is also no guarantee around when the boot cycle completes. Typically large virtual machines with for instance a database will take longer to boot than a server just running DNS. So what do we propose? We propose restart orders instead of restart priority. What does this mean, and what would we like to now from you?

There are two complementary ways of implementing this and we would like your feedback including which one you think would be most useful.

  1. Global Restart Order aka Bucketing
  2. VM to VM dependency Chains

Lets explain these two options and then I let you guys chime in.

Global Restart Order aka Bucketing is basically what you have today with “restart priorities” only it will actually enforce the restart order and it will allow for more flexibility. So with this option you could for instance create 5 buckets, and then add virtual machines to these buckets appropriately. These buckets could be: Priority 1, Priority 2 and so on. When a failure has occurred vSphere HA would then restart all VMs in the bucket “Priority 1″ first and when that bucket has finished starting (e.g., wait for VMware Tools Heartbeat to report “alive” for each VM) vSphere HA would continue with the next bucket and so on.  Waiting for VMtools to report “alive” is one way to determine that a VM is “ready”. We are thinking of providing three other “wait” options —  wait for an application heartbeat, wait a certain amount of time after the VM powers on, or today’s behavior, wait for the power on task to complete”.

I guess a couple of questions we have:

  1. How many levels would you like to see?
  2. Which of the wait conditions (e.g., wait on VMtools) are most useful for you?
  3. Suppose HA could not power on a “Priority 1″ VM. Do you want HA to stop powering on the “Priority 2″ etc VMs until it can, move to the “Priority 2″ group after a timeout, or something else?
The second option is VM to VM dependency Chains. These can be seen as an explicit restart order for a specific group of VMs which typically would form a service. I guess not unlike the vApp construct today, but then without all the caveats and restrictions around this. (vApps are essential resource pools, and we don’t want resource management in this case… just restart orderering.) In the simplest form, you could imagine specifying ordered lists of VMs, each list specifying the restart order for that set — the VMs in a list would be powered on sequentially. For example, something like the following:

Database VM –> Application Server –> Web Server

As you can see that would offer a significant amount of granularity, but also potentially a lot of operational complexity. How far would you like to go I guess is the question? Questions we have for you:

  1. Is an ordered list sufficient to express dependencies in a chain of VMs or do you need more sophistication?
  2. A VM with a dependent fails, do you expect HA to restart that child VM even though the previous has failed?
  3. What if HA could not be able to restart a VM with dependents — should HA restart these dependent VMs after a delay or only after the first VM is restarted?
A final question. We think bucketing will be easier to manage operationally but it introduces artificial dependencies between VMs and will make it take much longer to restart all VMs after a failure. How significant are these limitations?

That is it for now… Please chime in, as your response will help us define the future of vSphere HA.