I was talking to someone before the start of the holiday season about running the Vendor Provider (VP) for vVols as a VM and what the best practices are around that. I was thinking about the implication of the VP not being available and came to the conclusion that when the VP is unavailable a bunch of things stop working out of which “bind” is probably most important.
The “bind” operation is what allows vSphere to access a given Virtual Volume (vVol), and this operation is issued during a power-on of a VM. This is how the vVols FAQ describes it:
When a vVol is created, it is not immediately accessible for IO. To Access vVol, vSphere needs to issue a “Bind” operation to a VASA Provider (VP), which creates IO access point for a vVol on a Protocol Endpoint (PE) chosen by a VP. A single PE can be the IO access point for multiple vVolss. “Unbind” Operation will remove this IO access point for a given vVol.
This means that when the VP is unavailable, you can’t power-on VMs at that particular time. For many storage systems that problem is mitigated by having the VP as part of their storage system itself, and of course there is the option to have multiple VPs as part of your solution, either in active/active or in active/standby configuration. In the case of VSAN for instance, each host has a VASA provider out of which one is active and others are standby, if the active fails the standby will take over automatically. So to be clear, it is up to the vendor to decide what type of availability to provide for the VP, some have decided to go for a single instance and rely on vSphere HA to restart the appliance, others have created active/standby etc.
But back to vVols, what if you own a storage system that requires an external VASA VP as a VM?
- Run your VP VMs in a management cluster, if the hosts in the “production” cluster are impacted and VMs are restarted then at least the VP VMs should be up and running in your management cluster
- Use multiple VP VMs if and when possible, if active/active or active / standby is supported make sure to run your VPs in that configuration
- Do not use vVols for the VP itself, you don’t want to have any (circular) dependency between the availability of the VP and being able to power-on the VP itself
- If there is no availability story for the VP, depending on the configuration of the appliance vSphere FT should be considered.
One more thing, if you are considering buying new storage, I think one question you definitely need to ask your vendor is what their story is around the VP. Is it a VM or is it part of the storage system itself? Is there an availability story for the VP, and if so is this “active/active” or “active/standby”? If not, what do they have on their roadmap around this? You are probably also asking yourself what VMware has planned to solve this problem, well there are a couple of things cooking and I can’t say too much about it. One important effort though is the inclusion of bind/unbind in the T10 SCSI standard, but as you can imagine, those things take time. (Which would allow us to power-on new VMs even when the VP is unavailable as it would be a SCSI command.) Until then, when you design a vVol environment, take the above into account when it comes to your Vendor Provider aka VP!
Howard Marks says
I was testing a beta vVols system in the lab, this vendor had their VASA provider running as a single VM without HA. Low and behold the VASA provider stopped working, didn’t crash just stopped doing it’s job.
If can take a while to figure out your VMs won’t power up because the VASA VM is having a bad day, and then reboot it. VMware FT/HA doesn’t help because the VM is still running, jsut not processing the requests we need.
Some sort of application level HA, while not required by VMware, is in my not so humble opinion a must for production environments. I prefer it to be built into the storage system and user that system’s HA but active/active or active/passive VMs would be fine.
– Howard
Ben Meadowcroft (@BenMeadowcroft) says
> Do not use VVols for the VP itself, you don’t want to have any (circular) dependency between the availability of the VP and being able to power-on the VP itself
One mechanism that could be used to help avoid this particular issue is the use of Storage Policy-Based Management (SPBM) to create storage policies for the VP VM to ensure that it is provisioned onto appropriate storage and the administrator is warned if they try to storage vMotion away from the appropriate storage. This can be achieved by tagging the appropriate datastores where the VP should live (ideally these are mounted to hosts in the management cluster as you suggest), creating a policy using that tag and then applying the policy to the VP VM itself. As Storage DRS and storage vMotion are now policy aware when the administrator attempts to migrate the VM they will receive a warning when they attempt to migrate the VP VM onto unsuitable storage.
For reference the steps I followed to create this policy and validate it in my environment was:
1. Create a new vSphere tag
1.a) Name – “VASA Vendor Provider Appliance Storage”
1.b) Applies to datastores
2. Assigned the tag to the suitable (non-VVol) datastores where I wanted the VP to reside
3. Create a new SPBM policy
3.a) Name – “VASA Vendor Provider Appliance Storage”
3.b) Added tag based rule for “VASA Vendor Provider Appliance Storage” tag
3.c) Validated that compatible storage did not include the VVol datastore
4. Assign the “VASA Vendor Provider Appliance Storage” policy to the VASA Vendor Provider appliance
5. Attempt to migrate the VP to the VVol datastore (VM storage policy was the default of “Keep existing VM storage policies”)
6. Validate that storage vMotion warned me about this move with the following message:
> Datastore does not satisfy compatibility since it does not support one or more required properties.
> Tags with name “VASA Provider Appliance Storage” not found on datastore.
While this is not a foolproof mechanism it does at least express the desired intent via policy and provide some warnings that may be helpful in preventing an avoidable error.