6.6

Device X is not listed on the vSAN Compatability Guide, can I still use it?

Duncan Epping · Jan 8, 2019 ·

I get this question almost daily, and I am pretty sure I have said this various times, but just in case it wasn’t clear I figured I would share the answer to the question whether a device should be used in a vSAN cluster when it is not listed on the vSAN Compatibility Guide? if you have not looked at the components variant of the VCG for vSAN please take a look here: http://vmwa.re/vsanhclc. Of course, we also have an easier route, which is the ReadyNode VCG. But some may want to tweak based on performance, cost etc. I get that, and so does VMware, that is why we have listed all supported and tested components. Can you use a device which is not listed? Sure you can. Will VMware support the environment? Maybe they will, maybe they won’t! Should you use a device which is not listed if the previous answer is maybe? No!

So let’s be clear and let’s answer the two most asked questions:

Device X is not listed on the vSAN Compatability Guide, can I still use it?
- No, you should not. If any problem arises chances are you will not get the support you need as a result of an unsupported configuration. Sure, usually VMware Support will do their best to help, but if it appears the unsupported device is causing the problem then it becomes difficult. Please do not use devices which are not listed
Device X is listed with Firmware version Y, but the OEM says I should use Z, what to do?
- Ask the OEM why the version is not listed on VMware’s VCG website. Vendors are responsible for certifying components and the software (drivers / firmware) associated with it. If it is not listed then it has either not been submitted yet, it has not been tested, or it has not passed the test. Please only use tested and listed versions, the only exception is when both VMware GSS and the OEM points you to a new version.

Hope that helps,

Doing maintenance on a Two-Node (Direct Connect) vSAN configuration

Duncan Epping · Mar 13, 2018 ·

I was talking to a partner and customer last week at a VMUG. They were running a two node (direct connect) vSAN configuration and had some issues during maintenance which were, to them, not easy to explain. What they did is they placed the host which was in the “preferred fault domain” in to maintenance mode. After they placed that host in to maintenance mode the link between the two hosts for whatever reason failed. After they rebooted the host in the preferred host it connected back to the witness but at this point in time the connection between the hosts had not returned yet. This confused vSAN and that resulted in the scenario where the VMs in the secondary fault domain were powered off. As you can imagine an undesired effect.

This issue is solved in the near future in a new version of vSAN, but for those who need to do maintenance on a two-node (direct connect) configuration (or a full site maintenance in a stretched environment) I would highly recommend the following simple procedure. This will need to be done when doing maintenance on the host which is in the “preferred fault domain”:

Change the preferred fault domain
- Under vSAN, click Fault Domains and Stretched Cluster.
- Select the secondary fault domain and click the Mark Fault Domain as preferred for Stretched Cluster icon
Place the host in to maintenance mode
Do you maintenance

Fairly straight forward, but important to remember…

Using HA VM Component Protection in a mixed environment

Duncan Epping · Nov 29, 2017 ·

I have some customers who are running both traditional storage and vSAN in the same environment. As most of you are aware, vSAN and VMCP do not go together at this point. So what does that mean for traditional storage, as in with traditional storage for certain storage failure scenarios you can benefit from VMCP.

Well the statement around vSAN and VMCP is actually a bit more delicate. vSAN does not propagate PDL or APD in a way which VMCP understands. So you can enable VMCP in your environment, without it having an impact on VMs running on top of vSAN. The VMs which are running on the traditional storage will be able to use the VMCP functionality, and if an APD or PDL is declared on the LUN they are running on vSphere HA will take action. For vSAN, well we don’t propagate the state of a disk that way and we have other mechanisms to provide availability / resiliency.

In summary: Yes, you can enable HA VMCP in a mixed storage environment (vSAN + Traditional Storage). It is fully supported.

vSphere HA heartbeat datastores, the isolation address and vSAN

Duncan Epping · Nov 8, 2017 ·

I’ve written about vSAN and vSphere HA various times, but I don’t think this has been explicitly called out before. Cormac and I were doing some tests this week and noticed something. When we were looking at results I realized I described it in my HA book a long time ago, but it is so far hidden away that probably no one has noticed.

In a traditional environment when you enable HA you will automatically have HA heartbeat datastores selected. These heartbeat datastores are used by the HA primary host to determine what has happened to a host which is no longer reachable over the management network. In other words, when a host is isolated it will communicate this to the HA primary using the heartbeat datastores. It will also inform the HA primary which VMs were powered off as the result of this isolation event (or not powered off when the isolation response is not configured).

Now, with vSAN, the management network is not used for communication between the hosts but the vSAN network is used. Typically in a vSAN environment, there’s only vSAN storage so there are no heartbeat datastores. As such, when a host is isolated it is not possible to communicate this to the HA primary. Remember, the network is down and there is no access to the vSAN datastore so the host cannot communicate through that either. HA will still function as expected though. You can set the isolation response to power-off and then the VMs will be killed and restarted. That is, if isolation is declared.

So when is isolation declared? A host declares itself isolated when:

It is not receiving any communication from the primary
It cannot ping the isolation address

Now, if you have not set any advanced settings then the default gateway of the management network will be the isolation address. Just imagine your vSAN Network to be isolated on a given host, but for whatever reason, the Management Network is not. In that scenario isolation is not declared, the host can still ping the isolation address using the management network vmkernel interface. HOWEVER… vSphere HA will restart the VMs. The VMs have lost access to disk, as such the lock on the VMDK is lost. HA notices the hosts are gone, which must mean that the VMs are dead as the locks are lost, lets restart them.

That is when you could be in the situation where the VMs are running on the isolated hosts and also somewhere else in the cluster. Both with the same mac address and the same name / IP address. Not a good situation. Now, if you would have had datastore heartbeats enabled then this would be prevented. As the isolated host would inform the primary it is isolated, but it would also inform the primary about the state of the VMs, which would be powered-on. The primary would then decide not to restart the VMs. However, the VMs which are running on the isolated host are more or less useless as they cannot write to disk anymore.

Let’s describe what we tested and what the outcome was in a way that is a bit easier to consume a table:

Isolation Address	Datastore Heartbeats	Observed behavior
IP on vSAN Network	Not configured	Isolated host cannot ping the isolation address, isolation declared, VMs killed and VMs restarted
Management Network	Not configured	Can ping the isolation address, isolation not declared, yet rest of the cluster restarts the VMs even though they are still running on the isolated hosts
IP on vSAN Network	Configured	Isolated host cannot ping the isolation address, isolation declared, VMs killed and VMs restarted
Management Network	Configured	VMs are not powered-off and not restarted as the “isolated host” can still ping the management network and the datastore heartbeat mechanism is used to inform the master about the state. So the master knows HA network is not working, but the VMs are not powered off.

So what did we learn, what should you do when you have vSAN? Always use an isolation address that is in the same network as vSAN! This way during an isolation event the isolation is validated using the vSAN vmkernel interface. Always set the isolation response to power-off. (My personal opinion based on testing.) This would avoid the scenario of duplicate mac / ip / names on the network when you have a single network being isolated for a specific host! And if you have traditional storage, then you can enable heartbeat datastores. It doesn’t add much in terms of availability, but still it will allow the HA hosts to communicate state through the datastore.

PS1: For those who don’t know, HA is configured to automatically select a heartbeat datastore. In a vSAN only environment you can disable this by selecting “Use datastore from only the specified list” in the HA interface and then set “das.ignoreInsufficientHbDatastore = true” in the advanced HA settings.

PS2: In a non-routable vSAN network environment you could create a Switch Virtual Interface on the physical switch. This will give you an IP on the vSAN segment for the isolation address leveraging the advanced setting das.isolationaddress0.

Which disk controller to use for vSAN

Duncan Epping · Sep 28, 2017 ·

I have many customers going through the plan and design phase for implementing a vSAN based infrastructure. Many of them have conversations with OEMs and this typically results in a set of recommendations in terms of which hardware to purchase. One thing that seems to be a recurring theme is the question which disk controller a customer should buy. The typical recommendation seems to be the most beefy disk controller on the list. I wrote about this a while ago as well, and want to re-emphasize my thinking. Before I do, I understand why these recommendations are being made. Traditionally with local storage devices selecting the high-end disk controller made sense. It provided a lot of options you needed to have a decent performance and also availability of your data. With vSAN however this is not needed, this is all provided by our software layer.

When it comes to disk controllers my recommendation is simple: go for the simplest device on the list that has a good queue depth. Just to give an example, the Dell H730 disk controller is often recommended, but if you look at the vSAN Compatibility Guide then you will also see the HBA330. The big difference between these two is the RAID functionality offered on the H730 and the cache on the controller. Again, this functionality is not needed for vSAN, by going for the HBA330 you will save money. (For HP I would recommend the H240 disk controller.)

Having said that, I would at the same time recommend customers to consider NVMe for the caching tier instead of SAS or SATA connected flash. Why, well for the caching layer it makes sense to avoid the disk controller. Place the flash as close to the CPU as you can get for low latency high throughput. In other words, invest the money you are saving on the more expensive disk controller in NVMe connected flash for the caching layer.