HA Futures: Pro-active response

We all know (at least I hope so) what HA is responsible for within a vSphere Cluster. Although it is great that vSphere HA responds to a failure of a host / VM / application and even in some cases your storage device; wouldn’t it be nice if vSphere HA could pro-actively respond to conditions which might lead to a failure? That is what we want to discuss in this article.

What we are exploring right now is the ability for HA to avoid unplanned downtime. HA would detect specific (health) conditions that could lead to catastrophic failures and pro-actively move virtual machines of that host. You could for instance think of a situation where 1 out of 2 storage paths goes down. Although not directly impacting the machines from an availability perspective, it could be catastrophic if that second path goes down. So in order to avoid ending up in this situation vSphere HA would vMotion all the virtual machines to a host which does not have a failure.

This could of course also apply to other components like networking or even memory or CPU. You could potentially have a memory dimm which is reporting specific issues that could impact availability, this in its turn could then trigger HA to pro-actively move all potentially impacted VMs to a different host.

A couple of questions we have for you:

  1. When such partial host failures occur today, how do you address these conditions? When do you bring the host back online?
  2. What level of integration do you expect with management tools? In other words, should we expose an API that your management solution can consume, or do you prefer this to be a stand-alone solution using a CIM provider for instance?
  3. Should HA treat all health conditions the same? I.e., always evacuate all VMs from an “unhealthy” host?
  4. How would you like HA to compare two conditions? E.g., H1 fan failure, H2 network path failure?

Please chime in,

Be Sociable, Share!

    Comments

    1. says

      This has been a dream of mine for awhile because it does crop up. DRS does wonders, but this could definitely help. So here we go:

      1. When a partial failure occurs, I see the alert via e-mail or vCenter itself. At my earliest opportunity I begin to VMotion machines to appropriate hosts. After the issue is resolved, I’ll do some testing (be it memory test on the host or testing paths by browsing datastores). The host is brought back online typically a few hours after the issue is resolved.
      2. Cant speak to this – we use only vCenter/email alerts.
      3. Imagine a DRS-like configuration where you have classifications – warning, critical, and by component (for example, give a slider for memory, cpu, disk, storage paths etc). Based on the aggregate combo you specify, perform the evacuation.
      4. It depends on what it is. If a single NIC in a 6 NIC team fails I’m not going to rearrange my day to fix the host. However, if a storage path or power supply fails, I’ll get on it quickly. (Hot swap components like power supply I would not necessarily evacuate immediately…so long as my tech doesn’t pull the wrong one!)

    2. says

      Handle different type of healt test could be interesting: storage is a good example, but also other networking tests could be great.
      At this point is become more at DRS side (and maybe also SDRS for some storage healt conditions) and it’s so no related to HA.
      Or maybe there is some plan to join DRS with HA in the future?

      • says

        About the questions:

        1. When such partial host failures occur today, how do you address these conditions? When do you bring the host back online?
        This could be used according with a similar DRS setting like aggressive vs. conservative.
        IMHO a risk level meter could be an interesting threshold

        2. What level of integration do you expect with management tools? In other words, should we expose an API that your management solution can consume, or do you prefer this to be a stand-alone solution using a CIM provider for instance?
        API is flexible, but maybe limited without something with basic functionallity.

        3. Should HA treat all health conditions the same? I.e., always evacuate all VMs from an “unhealthy” host?
        No… see point 1.

        4. How would you like HA to compare two conditions? E.g., H1 fan failure, H2 network path failure?
        AND and OR with custom rules… And maybe some predefined rules.

        A final aspect… could be this interegrated with FT or something like a scheduled partial vMotion (that is not finalized, but that pre-copy the VM state on another host)?
        Example: you loose a path in your host and you are preparing the “state” on the other host. If you loose more paths (going to a APD state) you need few time to finish the vMotion and the failover.

    3. says

      Hi Duncan – This sounds like a good idea in theory, however I do have concerns about how well this would work in the real world.

      For example, I’m currently dealing with a bug that makes false hardware (storage) alerts appear for hosts in vCenter. These alerts can occur on multiple hosts, and I would not want vCenter to start vMotioning VMs around in this case. Also, there have been events impacting the storage fabric (fiber channel) causing the number of paths for LUNs to be temporarily reduced from four to two for all impacted clusters. I would not want vCenter to vMotion VMs in this case either. These types of issues often impact whole clusters, and not just one host. When this type of storage issue occurs, we usually work with the appropriate teams (ie. Storage and/or Network) to address the root cause and resolve.

      There have been times where a DIMM or HBA fails, and we usually just place the host in maintenance mode, fix, and take it out of maintenance mode. But not sure how a proactive system would have the intelligence to know when to vMotion VMs and when not to.

    4. Josh De Jong says

      1. Being able to categorize these partial failures would be key to automatic or manual intervention. 1 down Network link isn’t as big of an issue as failing RAM. Perhaps the failure component can’t determine the action, but classification of VM importance (high, medium, low) can determine the action? If I lost 1 of my 10gbe cables on a host, I may only be concerned about a handful of VMs being up and not being restarted by HA.

      2. I think an API would be ideal. This would allow for more advanced configurations potentially and choose response on a per-VM basis.

      3. High priority VMs can be moved while lower priority VMs can stay. This would prevent any issues with running an over-subscribed environment in the event of any hardware or path failure. Perhaps certain events vacate all VMs while other events will just prevent new workloads from moving to it and waiting for manual intervention to allow normal cluster operations.

      4. Are you asking about multiple events on the same host or different events affecting different hosts in the same cluster? Multiple events on the same host would increase the event severity and cause evacuation of VMs. As for multiple events across multiple hosts, not sure what would be the best idea.

    5. Jeff says

      While things like paths down, etc are easy to detect. I suppose the real value would be if the host can expose things like DIMM error counts, etc. through cim that vcenter can act on. Still giving anything proactive and tunable is pretty cool. I have had IBM hosts PSOD and looking that logs have seen MCE’s happening for days such as:

      013-09-17T07:18:10.609Z cpu32:913868)MCE: 215: CMCI on cpu32 bank9: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled.
      2013-09-17T07:18:10.609Z cpu32:913868)MCE: 220: Status bits: “Memory Controller Error.”
      2013-09-20T23:41:14.344Z cpu1:21479)MCE: 215: CMCI on cpu1 bank9: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled.
      2013-09-20T23:41:14.344Z cpu1:21479)MCE: 220: Status bits: “Memory Controller Error.”
      2013-09-20T23:41:14.431Z cpu16:23386)MCE: 215: CMCI on cpu16 bank9: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled.
      2013-09-20T23:41:14.431Z cpu16:23386)MCE: 220: Status bits: “Memory Controller Error.”
      2013-09-21T00:14:35.666Z cpu32:16416)MCE: 215: CMCI on cpu32 bank9: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled.
      2013-09-21T00:14:35.666Z cpu32:16416)MCE: 220: Status bits: “Memory Controller Error.”
      2013-09-21T00:15:10.035Z cpu16:21955)MCE: 215: CMCI on cpu16 bank9: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled.
      2013-09-21T00:15:10.035Z cpu16:21955)MCE: 220: Status bits: “Memory Controller Error.”
      2013-09-21T00:34:27.647Z cpu32:22017)MCE: 215: CMCI on cpu32 bank8: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled.
      2013-09-21T00:34:27.647Z cpu32:22017)MCE: 220: Status bits: “Memory Controller Error.”

      If vcenter could see those and move stuff off proactively–now that is cool!!

      • says

        @Jeff
        This is already available on high-end Intel processors (E7 series). It is called Machine Check Architecture or MCA.

        With MCA Recovery feature, when an uncorrectable data error is detected, the system can isolate the error to only the affected VM. The hardware notifies the VMM (Support for VMware vSphere 5.x), which then attempts to retire the failing memory page(s) and notify affected VMs and components.

        Here is a video that explains it all -> https://www.youtube.com/watch?v=HY1jDOd59P8

    6. says

      First off I think this would be a great addition!

      What I have seen with other technologies that I have enjoyed is a points based system on when to declare failed.

      Obviously there could be defaults out of the box. Lets say I need a points value of 255 to declare a device “failed”. So one could say X(255) failed consider the host degraded or not until Y(150) and Z(150) failed, yet if just Y(150) triggered it would be enough to consider degraded.

      So with an easily usable default but with the advanced settings to get as granular as required. This can cause some complexity, but if there are only advanced settings it is nice to have the option.

      Let me know if this makes sense,
      Sean

    7. Angelo says

      When such partial host failures occur today, how do you address these conditions?… Based on tuning metrics. When do you bring the host back online?… Based on condition and failure type. Failover vs Failback. Most want failback manually after they have ensured the error has been cleared.
      What level of integration do you expect with management tools? …Full exposure and integration. Being able to get at with PowerCLI and use tools like Nagios would be great! Should HA treat all health conditions the same? …NO! I.e., always evacuate all VMs from an “unhealthy” host?…. Depends. If I have a cmos battery warning I don’t want to evacuate a host :)
      How would you like HA to compare two conditions? E.g., H1 fan failure, H2 network path failure? I think the user should be able to set that. Based on different environments what we may see as nuts to run on others may be “ok” as long as the host is still up or just can’t afford to do better, so setting anything predefined is putting you in a corner. However maybe a VMware suggested config file could be loaded with KB articles as to what a host “should” look like based on certain factors.

      Hope that helps.

      • Angelo says

        Actually after giving this some more thought…why not just expand the actions of the alarms to allow them to handle some of these conditions vs messing with HA, something that works fairly well already…..because wouldn’t we have to worry about DRS also and maybe SDRS depending on the failure type and the redistribution of resources based on the possible failure condition?

    8. says

      1) Depends on the issue and physical hardware. A PSU failure for instance would simply trigger the operations team to swap out the failed PSU (assuming hot-swappable and redundant PSUs). A memory fault on the other hand would result in a host evacuation and maintenance mode.

      2) I would expect some sort of integration with systems like vCOPS. In fact perhaps vCOPS (and perhaps Hyperic?) data could potentially be used (if available) in the HA fault-avoidance decision process.

      3) No. I would want this to be user-customizable. It depends on the type of fault and physical hardware (as described above).

      4) In most cases I would expect that multiple fault conditions of any type would be indicative of a major configuration or hardware issue (especially in converged/hyper-converged solutions). Again, it should be user-customizable, but I would expect in most cases that should trigger a host evacuation/maintenance mode. Individually however, a fan failure is likely less serious that a network redundancy fault. The administrator should be able to customize the HA response in these instances.

    9. says

      It would be great if HA could detect hosts in trouble (e.g. link down or memory error). But there should be an option to manually override. Just like DRS (automatic, semi-automatic, manual). I dont want HA to move away my VMs when there is a planned reboot of a FC switch. That’s why we have redundant links.
      Another thing would be a big HA improvement:
      in case of APD to a LUN a HA cluster does very nasty things (from my experience). Although VMs on live LUNs keep on running, hosts become unmanageable and you can’t interact with the remaining VMs (shutdown, migrate, etc.).
      * in case of an APD kill all VMs on that LUN an do not try to restart or migrate them elswhere – they are dead (at least for some time). Paint them grey in the client an leave the rest alone.

    10. Chris N says

      1. When such partial host failures occur today, how do you address these conditions? When do you bring the host back online?

      Host issue raises vCenter alarm. Manual intervention. Host into MM and then resolve issue. Host only hosts VMs after issue is resolved. Note: for high capacity clusters DRS would be set to manual and the host remain but will not host any VMs. The difference being is that host would be available in case of a HA event .

      2. What level of integration do you expect with management tools? In other words, should we expose an API that your management solution can consume, or do you prefer this to be a stand-alone solution using a CIM provider for instance?

      Third party management solutions, such as vCOPs should be aware of the event ocurring, but having an alarm raised in vCenter (for the root cause of the event and another for the event occurring) and reflected in management tools that about all thats required.

      3. Should HA treat all health conditions the same? I.e., always evacuate all VMs from an “unhealthy” host?
      Nope. In some occasions i would want the host to be put into MM until resolved. Other times i would prefer the host to simple have no VMs. rules stating things such as,
      IF cluster can tolerate 2 failed hosts THEN put host in MM OTHERWISE empty host and only use as last resort. IF memory/CPU utilisation greater than alert threhold (90%), then use host for VMs regardless.

      4. How would you like HA to compare two conditions? E.g., H1 fan failure, H2 network path failure?
      In the alarm for an event have an editable ‘impact score’. Then when the impact of one event occurs a decision is made based on the score to complete the required actions. This way if a fan failure occurs the impact score may go to 20 which results in no action. But when a network path fails the score gots to 40 which is above the threshold set of 30 and the predictive HA kicks in. By themselves each action wouldn’t cause predictive HA to kick in, but together they do.

      Hope that makes sense. just my 2 cents and i’m sure others will find issues.

      • Chris N says

        I quite like the idea of each alarm given a simply ‘impact score’. Then predictive HA just needs to have configurable rules. in the format IF XXX THEN XX EXCEPT WHEN XXX

        if in 2 years i see an ‘impact score’ in vCenter alarms… that would be sweet to know I added to the thought process on the topic.

    11. Anders O says

      During my years working with VI/vSphere, I think probably 95 % of VM uptime trouble has been due to ESXi software problems, and only around 5 % due to hardware problems. It’s _much_ more common that an ESX(i) host semi-hangs due to weird memory-leak-like symtoms, APDs or such issues. That makes the vC lose the host, which makes vMotioning VMs impossible, so that’s where I think the HA development effort should be put.

      You spoke on the panel in Barcelona about future HA perhaps being able to detect whether a _VM_ still has network and storage contact, and basing HA shutdown/failover actions on that rather than only _host_ network heartbeat being able to trigger isolation/failover. That sounds like a really good future feature. :)

    12. fvanrooyen says

      1.This will be a ticket to DCOps and once the host is marked as healthy in CMDB it will be powered up and added back into the cluster.

      2.API access will be great, to be able to drive configuration of this from a location where SLA’s and OLA’s are defined (SKMS) will enable the appropriate response for the situation.

      3.Personally and the environment I work in, it would be much better to have a DRS idea here, “manual, partial and fully automatic” and then have a conservative to an aggressive threshold. This can be tuned by cluster; this will help with making sure certain SLA’s are met. For example a service that requires high uptime might be set to fully automatic with aggressive, in contrary a service that can stand to lose nodes (Hadoop) but needs all the processing power it can use right now might be tuned to something much lower.

      4.This would be great if we can specify this and put weights on different conditions. A fan failure would be weighted much lower than a network failure, where both performance and reliability is impacted.

    13. Andy N says

      It would be great if this could take action after a certain time. For example single psu/fan/hba fails and it it is not rectified after 24 hours then guests migrate to a healthy node.

    14. Morten Werner Forsbring says

      1. Today we do an educated guess of the consequences and plan the manual action after the criticality. Depends on the “failure”, but the host might run as HA failover capacity in the cluster if the “failure” is not that severe.

      2. As the result of these proactive actions can be disruptive thinking of the capacity in the cluster, I would prefer this to be a stand-alone solution with few dependencies of other management solutions.

      3. I would like to be able to tune the “action” on the different kinds of health conditions. Would also be very nice if one option could be able to evacuate the VM’s from that host, but let the host be in the cluster as HA failover capacity.

      4. I guess this depends on the customers situation. In some cases a fan faliure can be more critical than a network path failure and the other way around. Would be nice if the customer was able to decide the priority of the failures in a list if the decent default doesn’t fit.

      It would be easier to enable this proactive HA-features if one could enable also in a semi-automatic mode which only delivers the suggested actions (in a pre-production phase).

    Leave a Reply