das.failuredetectiontime for active/standby COS vswitch

It used to be a best practice to increase the “das.failuredetectiontime” to 30000 for an active/standby setup. This way when a failover to another nic occurs one would have atleast 30 seconds to switch over before HA starts shutting down VM’s. The default value is 15000 by the way.

If it’s not really clear I’m talking about a setup like this:

  • vSwitch0 - 2 Physical nics(vmnic0 & vmnic2) – 2 Portgroups (Service Console & VMkernel)
    Service Console active on vmnic0 and standby on vmnic2
    VMkernel active on vmnic2 and standby on vmnic0
    Each portgroup has a VLAN assigned and runs dedicated on its own nic, only in the case of a fault it’s switched over to the standby nic, but it will return to the original nic when the connection is up again. 

I just noticed in the Resource Management Guide pdf that the best practice is to increase it to 60000. In other words, it can take up to 60 seconds before your HA starts restarting machines. For a secondary service console you only need to  increase by 5 seconds cause of the fact that an additional isolation address needs to be checked. In other words a secondary service console saves you 30 seconds when isolation occurs which can be a lot in a 7×24 environment.

So like I blogged three months ago, going for a secondary service console is definitely the best option you have for service console redundancy today! Keep in mind though that your secondary service console needs to be in a different subnet than the primary!




You can leave a response, or trackback from your own site.

One Response to “das.failuredetectiontime for active/standby COS vswitch”

  1. John Troyer says:

    When did this change? How can we highlight changes in best practices for people in a way that they’ll actually notice, read, and digest?

Leave a Reply

Subscribe to RSS Feed Follow me on Twitter!