We all know (at least I hope so) what HA is responsible for within a vSphere Cluster. Although it is great that vSphere HA responds to a failure of a host / VM / application and even in some cases your storage device; wouldn’t it be nice if vSphere HA could pro-actively respond to conditions which might lead to a failure? That is what we want to discuss in this article.
What we are exploring right now is the ability for HA to avoid unplanned downtime. HA would detect specific (health) conditions that could lead to catastrophic failures and pro-actively move virtual machines of that host. You could for instance think of a situation where 1 out of 2 storage paths goes down. Although not directly impacting the machines from an availability perspective, it could be catastrophic if that second path goes down. So in order to avoid ending up in this situation vSphere HA would vMotion all the virtual machines to a host which does not have a failure.
This could of course also apply to other components like networking or even memory or CPU. You could potentially have a memory dimm which is reporting specific issues that could impact availability, this in its turn could then trigger HA to pro-actively move all potentially impacted VMs to a different host.
A couple of questions we have for you:
- When such partial host failures occur today, how do you address these conditions? When do you bring the host back online?
- What level of integration do you expect with management tools? In other words, should we expose an API that your management solution can consume, or do you prefer this to be a stand-alone solution using a CIM provider for instance?
- Should HA treat all health conditions the same? I.e., always evacuate all VMs from an “unhealthy” host?
- How would you like HA to compare two conditions? E.g., H1 fan failure, H2 network path failure?
Please chime in,