It used to be a best practice to increase the “das.failuredetectiontime” to 30000 for an active/standby setup. This way when a failover to another nic occurs one would have atleast 30 seconds to switch over before HA starts shutting down VM’s. The default value is 15000 by the way.
If it’s not really clear I’m talking about a setup like this:
- vSwitch0 – 2 Physical nics(vmnic0 & vmnic2) – 2 Portgroups (Service Console & VMkernel)
Service Console active on vmnic0 and standby on vmnic2
VMkernel active on vmnic2 and standby on vmnic0
Each portgroup has a VLAN assigned and runs dedicated on its own nic, only in the case of a fault it’s switched over to the standby nic, but it will return to the original nic when the connection is up again.
I just noticed in the Resource Management Guide pdf that the best practice is to increase it to 60000. In other words, it can take up to 60 seconds before your HA starts restarting machines. For a secondary service console you only need to increase by 5 seconds cause of the fact that an additional isolation address needs to be checked. In other words a secondary service console saves you 30 seconds when isolation occurs which can be a lot in a 7×24 environment.
So like I blogged three months ago, going for a secondary service console is definitely the best option you have for service console redundancy today! Keep in mind though that your secondary service console needs to be in a different subnet than the primary!
John Troyer says
When did this change? How can we highlight changes in best practices for people in a way that they’ll actually notice, read, and digest?