VMware

Changing advanced vSphere FT related settings, is that supported?

Duncan Epping · Feb 1, 2018 ·

This week I received a question around changing the values for vSphere FT related advanced settings. This customer is working on an environment where uptime is key. Of course the application layer is one side, but they also want to have additional availability from an infrastructure perspective. Which means vSphere HA and vSphere FT are key.

They have various VMs they need to enable FT on, these are vSMP VMs (meaning in this case dual CPU). Right now each host is limited to 4 FT VMs and at most 8 vCPUs, this is being controlled by two advanced settings called “das.maxftvmsperhost” and “das.maxFtVCpusPerHost”. The values for these are, obviously, 4 and 8. The question was: can I edit these and still have a supported configuration? Also, why 4 and 8?

I spoke to the product team about this and the answer is: yes, you can safely edit these. These values were set based on typical bandwidth and resource constraints customers have. An FT VM easily consumes between 1-3Gbps of bandwidth, meaning that if you dedicate a 10Gbps link to it you will fit roughly 4 VMs. I say roughly as of course the workload matters: CPU, Memory and IO pattern.

If you have a 40Gbps NIC, and you have plenty of cores and memory you could increase those max numbers for FT VMs per host and FT vCPUs. However, it must be noted that if you run in to problems VMware GSS may request you to revert back to the default just to ensure the issues that occur aren’t due to this change as VMware tests with the default values.

UPDATE to this content can be found here: https://www.yellow-bricks.com/2022/11/18/can-you-exceed-the-number-of-ft-enabled-vcpus-per-host-or-number-of-ft-enabled-vcpus-per-vm/

vSAN Adaptive Resync, what does it do?

Duncan Epping · Jan 18, 2018 ·

I am starting to get some more questions about vSAN Adaptive Resync lately. This was introduced a while back, but is also available in the latest versions of vSAN through vSphere 6.5 Patch 02. As a result various folks have started to look at it and are starting to wonder what it is. Hopefully by now everyone understands what resync traffic is and when you see resync traffic. The easiest example of course is a host failure. If a host has failed and there’s sufficient disk space and there’s additional hosts available to make the impacted VMs compliant with their policy again then vSAN will resync the data.

Resync aims to finish the creation of these new components asap, simple reason for this is availability. The longer the resync takes, the longer you are at risk. I think that makes sense right? In some cases however it may occur that when VMs are very busy and resync is happening that VM observed latency goes through the roof. We already had a manual throttling mechanism for when this situation occurs, but of course preferably vSAN should throttle resync traffic properly for you. This is what vSAN Adaptive Resync does.

So how does that work? Well, when the high watermark is reached for VM latency then vSAN will cut the bandwidth of resync in half. Next vSAN will check if the VM latency is below the low watermark, if not then it will cut resync traffic in half again. It does this until the latency is below the low watermark. When the latency is below the low watermark then vSAN will increase the bandwidth of resync traffic granularly until the low watermark is reached and stay at that level. (Some official info can be found in this kb, and this virtual blocks blog.)

Hope that helps,

Using HA VM Component Protection in a mixed environment

Duncan Epping · Nov 29, 2017 ·

I have some customers who are running both traditional storage and vSAN in the same environment. As most of you are aware, vSAN and VMCP do not go together at this point. So what does that mean for traditional storage, as in with traditional storage for certain storage failure scenarios you can benefit from VMCP.

Well the statement around vSAN and VMCP is actually a bit more delicate. vSAN does not propagate PDL or APD in a way which VMCP understands. So you can enable VMCP in your environment, without it having an impact on VMs running on top of vSAN. The VMs which are running on the traditional storage will be able to use the VMCP functionality, and if an APD or PDL is declared on the LUN they are running on vSphere HA will take action. For vSAN, well we don’t propagate the state of a disk that way and we have other mechanisms to provide availability / resiliency.

In summary: Yes, you can enable HA VMCP in a mixed storage environment (vSAN + Traditional Storage). It is fully supported.

Isolation Address in a 2-node direct connect vSAN environment?

Duncan Epping · Nov 22, 2017 ·

As most of you know by now when vSAN is enabled vSphere HA uses the vSAN network for heartbeating. I recently wrote an article about the isolation address and relationship with heartbeat datastores. In the comment section, Johann asked what the settings should be for 2-Node Direct Connect with vSAN. A very valid question as an isolation is still possible, although not as likely as with a stretched cluster considering you do not have a network switch for vSAN in this configuration. Anyway, you would still like the VMs that are impacted by the isolation to be powered off and you would like the other remaining host to power them back on.

So the question remains, which IP Address do you select? Well, there’s no IP address to select in this particular case. As it is “direct connect” there are probably only 2 IP addresses on that segment (one for host 1 and another for host 2). You cannot use the default gateway either, as that is the gateway for the management interface, which is the wrong network. So what do I recommend:

Disable the Isolation Response >> set it to “leave powered on” or “disabled” (depends on the version used
Disable the use of the default gateway by setting the following HA advanced setting:
- das.usedefaultisolationaddress = false

That probably makes you wonder what will happen when a host is isolated from the rest of the cluster (other host and the witness). Well, when this happens then the VMs are still killed, but not as a result of the isolation response kicking in, but as a result of vSAN kicking in. Here’s the process:

Heartbeats are not received
Host elects itself primary
Host pings the isolation address
- If the host can’t ping the gateway of the management interface then the host declares itself isolated
- If the host can ping the gateway of the management interface then the host doesn’t declare itself isolated
Either way, the isolation response is not triggered as it is set to “Leave powered on”
vSAN will now automatically kill all VMs which have lost access to its components
- The isolated host will lose quorum
- vSAN objects will become isolated
- The advanced setting “VSAN.AutoTerminateGhostVm=1” allows vSAN to kill the “ghosted” VMs (with all components inaccessible).

In other words, don’t worry about the isolation address in a 2-node configuration, vSAN has this situation covered! Note that “VSAN.AutoTerminateGhostVm=1” only works for 2-node and Stretched vSAN configurations at this time.

UPDATE:

I triggered a failure in my lab (which is 2-node, but not direct connect), and for those who are wondering, this is what you should be seeing in your syslog.log:

syslog.log:2017-11-29T13:45:28Z killInaccessibleVms.py [INFO]: Following VMs are powered on and HA protected in this host.
syslog.log:2017-11-29T13:45:28Z killInaccessibleVms.py [INFO]: * ['vm-01', 'vm-03', 'vm-04']
syslog.log:2017-11-29T13:45:32Z killInaccessibleVms.py [INFO]: List inaccessible VMs at round 1
syslog.log:2017-11-29T13:45:32Z killInaccessibleVms.py [INFO]: * ['vim.VirtualMachine:1', 'vim.VirtualMachine:2', 'vim.VirtualMachine:3']
syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: List inaccessible VMs at round 2
syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: * ['vim.VirtualMachine:1', 'vim.VirtualMachine:2', 'vim.VirtualMachine:3']
syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Following VMs are found to have all objects inaccessible, and will be terminated.
syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: * ['vim.VirtualMachine:1', 'vim.VirtualMachine:2', 'vim.VirtualMachine:3']
syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Start terminating VMs.
syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Successfully terminated inaccessible VM: vm-01
syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Successfully terminated inaccessible VM: vm-03
syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Successfully terminated inaccessible VM: vm-04
syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Finished killing the ghost vms

vSphere and iSCSI storage best practices

Duncan Epping · Nov 1, 2017 ·

And here’s the next paper I updated. This time it is the iSCSI storage best practices for vSphere. It seems that we have now overhauled most of the core storage white papers. You can find them all under core storage on storagehub.vmware.com, but for your convenience I will post the iSCSI links below here as well:

Best Practices For Running VMware vSphere On iSCSI (web based reading)
Best Practices For Running VMware vSphere On iSCSI (pdf download)

One thing I want to point out, as it is a significant change compared to the last version of the paper is the following: In the past vSphere did not support IPSec so for the longest time this was also not supported for iSCSI as a result. When reviewing all available material I noticed that although vSphere now supports IPSec with IPv6 only there was no statement around iSCSI.

So what does that mean for iSCSI? Well, it is supported as of vSphere 6.0 to have IPSec enabled on the vSphere Software iSCSI implementation, but only for IPv6 implementations and not for IPv4! Note however, that there’s no data on the potential performance impact, and enabling IPSec could (I should probably say “will” instead of “could”) introduce latency / overhead. In other words, if you want to enable this make sure to test the impact it has on your workload.