u3

vSAN File Services fails to create file share with error Failed to create the VDFS File System.

Duncan Epping · Oct 4, 2022 ·

Last week on our internal slack channel one of the field folks had a question. He was hitting a situation where vSAN File Services failed when creating a file share with the error “Failed to create the VDFS File System”. We went back and forth a bit and after a while I jumped on Zoom to look at the issue, and troubleshoot the environment. After testing various combinations of policies I noticed that a particular policy worked, whole another policy did not. At first it looked like that stretched cluster policies would not work but after creating a new policy with a different name it did work. One thing left, the name of the policy. It appears that the use of special characters in the VM Storage Policy name results in the error “Failed to create the VDFS File System”. In this particular case the VM Storage Policy that was used was “stretched – mirrored FTT=1 RAID-1”. The character that was causing the issue was the “=” character.

How do you resolve it? Simply change the name of the policy. For instance, the following would work: “stretched – mirrored FTT1 RAID-1”.

vSAN 7.0 U3 enhanced stretched cluster resiliency, what is it?

Duncan Epping · Oct 4, 2021 ·

I briefly discussed the enhanced stretched cluster resiliency capability in my vSAN 7.0 U3 overview blog. Of course, immediately questions started popping up. I didn’t want to go too deep in that post as I figured I would do a separate post on the topic sooner or later. What does this functionality add, and in which particular scenario?

In short, this enhancement to stretched clusters prevents downtime for workloads in a particular failure scenario. So the question then is, what failure scenario? Let’s take a look at this diagram first of a typical stretched vSAN cluster deployment.

If you look at the diagram you see the following: Datacenter A, Datacenter B, Witness. One of the situations customers have found themselves in is that Datacenter A would go down (unplanned). This of course would lead to the VMs in Datacenter A being restarted in Datacenter B. Unfortunately, sometimes when things go wrong, they go wrong badly, in some cases, the Witness would fail/disappear next. Why? Bad luck, networking issues, etc. Bad things just happen. If and when this happens, there would only be 1 location left, which is Datacenter B.

Now you may think that because Datacenter B typically will have a full RAID set of the VMs running that they will remain running, but that is not true. vSAN looks at the quorum of the top layer, so if 2 out of 3 datacenters disappear, all objects impacted will become inaccessible simply as quorum is lost! Makes sense right? We are not just talking about failures right, could also be that Datacenter A has to go offline for maintenance (planned downtime), and at some point, the Witness fails for whatever reason, this would result in the exact same situation, objects inaccessible.

Starting with 7.0 U3 this behavior has changed. If Datacenter A fails, and a few (let’s say 5) minutes later the witness disappears, all replicated objects would still be available! So why is this? Well in this scenario, if Datacenter A fails, vSAN will create a new votes layout for each of the objects impacted. It basically will assume that the witness can fail and give all components on the witness 0 votes, on top of that it will give the components in the active site additional votes so that we can survive that second failure. If the witness would fail, it would not render the objects inaccessible as quorum would not be lost.

Now, do note, when a failure occurs and Datacenter A is gone, vSAN will have to create a new votes layout for each object. If you have a lot of objects this can take some time. Typically it will take a few seconds per object, and it will do it per object, so if you have a lot of VMs (and a VM consists of various objects) it will take some time. How long, well it could be five minutes. So if anything happens in between, not all objects may have been processed, which would result in downtime for those VMs when the witness would go down, as for that VM/Object quorum would be lost.

What happens if Datacenter A (and the Witness) return for duty? Well at that point the votes would be restored for the objects across locations and the witness.

Pretty cool right?!

Which vSAN Maintenance Mode option was used on a host?

Duncan Epping · Jun 7, 2021 ·

Last week there was a question on VMTN around maintenance mode, this customer wanted to find out which vSAN Maintenance Mode option was used while the host was placed in maintenance mode. The person who placed the host into maintenance wasn’t around, and I guess the customer wanted to know if data was moved from the host or not. They looked at the events and tasks, but unfortunately it wasn’t obvious from that vCenter section, next up were the logs. I also looked at the logs, and you would expect this info to be stored in hostd.log, but it wasn’t, so where is it? Well it actually is a configuration item, and you can dig it up here:

Pre ESXi 7.0 the info is stored in the “esx.conf” file, just grep for “hostDecommission”. Let me show you:

$ grep "/vsan/hostDecommission" /etc/vmware/esx.conf
/vsan/hostDecommissionVersion = "10"
/vsan/hostDecommissionState = "decom-state-decommissioned"
/vsan/hostDecommissionMode = "decom-mode-ensure-object-accessibility"

If the ESX is at 7.0 or later, just run the below config store command:

$ configstorecli config current get -c vsan -g system -k host_state
{
   "decom_mode": "ENSUREOBJECT_ACCESSIBILITY",
   "decom_state": "DECOMMISSIONED",
   "decom_version": 0
}

I hope that helps others who have a similar question!

vGPUs and vMotion, why the long stun times?

Duncan Epping · Feb 7, 2020 ·

Last week one of our engineers shared something which I found very interesting. I have been playing with Virtual Reality technology and NVIDIA vGPUs for 2 months now. One thing I noticed is that we (VMware) introduced support for vMotion in vSphere 6.7 and support for vMotion of multi vGPU VMs in vSphere 6.7 U3. In order to enable this, you need to set an advanced setting first. William Lam described this in his blog how to set this via Powershell or the UI. Now when you read the documentation there’s one thing that stands out, and that is the relatively high stun times for vGPU enabled VMs. Just as an example, here are a few potential stun times with various sized vGPU frame buffers:

2GB – 16.5 seconds
8GB – 61.3 seconds
16GB – 100+ seconds (time out!)

This is all documented here for the various frame buffer sizes. Now there are a couple of things to know about this. First of all, the time mentioned was tested with 10GbE and the NVIDIA P40. This could be different for an RTX6000 or RTX8000 for instance. Secondly, they used a 10GbE NIC. If you use multi-NIC vMotion or for instance a 25GbE NIC than results may be different (times should be lower). But more importantly, the times mentioned assume the full frame buffer memory is consumed. If you have a 16GB frame buffer and only 2GB is consumed then, of course, the stun time would be lower than the above mentioned 100+ seconds.

Now, this doesn’t answer the question yet, why? Why on earth are these stun times this long? The vMotion process is described in this blog post by Niels in-depth, so I am not going to repeat it. It is also described in our Clustering Deep Dive book which you can download here for free. The key reason why with vMotion the “down time” (stun times) can be kept low is that vMotion uses a pre-copy process and tracks which memory pages are changed. In other words, when vMotion is initiated we copy memory pages to the destination host, and if a page has changed during that copy process we mark it as changed and copy it again. vMotion does this until the amount of memory that needs to be copied is extremely low and this would result in a seamless migration. Now here is the problem, it does this for VM memory. This isn’t possible for vGPUs unfortunately today.

Okay, so what does that mean? Well if you have a 16GB frame buffer and it is 100% consumed, the vMotion process will need to copy 16GB of frame buffer memory from the source to the destination host when the VM is stunned. Why when the VM is stunned? Well simply because that is the point in time where the frame buffer memory will not change! Hence the reason this could take a significant number of seconds unfortunately today. Definitely something to consider when planning on using vMotion on (multi) vGPU enabled VMs!

vCenter tasks time-out or ESX disconnects?

Duncan Epping · Feb 5, 2009 ·

I just received an email from a fellow consultant about a customer which had vCenter tasks time-out every once in a while. At times also ESX hosts got disconnected for no apparent reason at all. He discovered the following article by Richard Blythe aka VMware Wolf: ESX disconnects randomly or when doing VI client tasks from VC, task randomly timeout after a long idle time. Richard created a list of issues/errors that might be related to this issue:

ESX disconnects randomly from VirtualCenter
ESX disconnects when performing VI Client tasks from VirtualCenter.
Tasks randomly timeout after a long idle time
“An error occurred communicating to the remote host” pops up.

The article refers to an issue with vCenter Update 3 in combination with firewalls using state-ful inspection. The problem occurs because of SOAP timeouts, and this behavior did not exist in VC 2.0.x or 2.5 GA, as they used a different mechanism to communicate with ESX. The official KB article hasn’t been released yet but a temporary workaround has been published by Richard. If you run into any of the before mentioned issues head over to Richard’s website and try out the workaround until the fix or official KB article is released.