vSphere HA Futures: Restart Order

At VMworld I hosted a group discussion together with Keith Farkas (HA Lead Engineer) on the topic of HA Futures. Based on this discussion group session Keith and I decided to gather more feedback from the field, this post will hopefully help us with that. Please do not hesitate to comment. I will have a couple of articles following this one, but lets get started with HA futures for the Restart Order first.

A topic that has come up at various sessions is HA restart ordering / priorities. Today HA provides four levels of restart priority: High, Medium, Low, Disabled. The thing to note with the current restart priority though is that there is no guarantee VMs are actually restarted in that order when the VMs are started on more than one host. Even when HA would restart them in the right order there is also no guarantee around when the boot cycle completes. Typically large virtual machines with for instance a database will take longer to boot than a server just running DNS. So what do we propose? We propose restart orders instead of restart priority. What does this mean, and what would we like to now from you?

There are two complementary ways of implementing this and we would like your feedback including which one you think would be most useful.

  1. Global Restart Order aka Bucketing
  2. VM to VM dependency Chains

Lets explain these two options and then I let you guys chime in.

Global Restart Order aka Bucketing is basically what you have today with “restart priorities” only it will actually enforce the restart order and it will allow for more flexibility. So with this option you could for instance create 5 buckets, and then add virtual machines to these buckets appropriately. These buckets could be: Priority 1, Priority 2 and so on. When a failure has occurred vSphere HA would then restart all VMs in the bucket “Priority 1″ first and when that bucket has finished starting (e.g., wait for VMware Tools Heartbeat to report “alive” for each VM) vSphere HA would continue with the next bucket and so on.  Waiting for VMtools to report “alive” is one way to determine that a VM is “ready”. We are thinking of providing three other “wait” options –  wait for an application heartbeat, wait a certain amount of time after the VM powers on, or today’s behavior, wait for the power on task to complete”.

I guess a couple of questions we have:

  1. How many levels would you like to see?
  2. Which of the wait conditions (e.g., wait on VMtools) are most useful for you?
  3. Suppose HA could not power on a “Priority 1″ VM. Do you want HA to stop powering on the “Priority 2″ etc VMs until it can, move to the “Priority 2″ group after a timeout, or something else?
The second option is VM to VM dependency Chains. These can be seen as an explicit restart order for a specific group of VMs which typically would form a service. I guess not unlike the vApp construct today, but then without all the caveats and restrictions around this. (vApps are essential resource pools, and we don’t want resource management in this case… just restart orderering.) In the simplest form, you could imagine specifying ordered lists of VMs, each list specifying the restart order for that set — the VMs in a list would be powered on sequentially. For example, something like the following:

Database VM –> Application Server –> Web Server

As you can see that would offer a significant amount of granularity, but also potentially a lot of operational complexity. How far would you like to go I guess is the question? Questions we have for you:

  1. Is an ordered list sufficient to express dependencies in a chain of VMs or do you need more sophistication?
  2. A VM with a dependent fails, do you expect HA to restart that child VM even though the previous has failed?
  3. What if HA could not be able to restart a VM with dependents — should HA restart these dependent VMs after a delay or only after the first VM is restarted?
A final question. We think bucketing will be easier to manage operationally but it introduces artificial dependencies between VMs and will make it take much longer to restart all VMs after a failure. How significant are these limitations?

That is it for now… Please chime in, as your response will help us define the future of vSphere HA.

Be Sociable, Share!

    Comments

    1. says

      I’d lean more towards the dependency chain. Slightly more complex, but I like the granularity and would be willing to sacrifice some simplicity for that. ‘Wait on VMware Tools’ seems best for me in either case.

      Is an ordered list sufficient to express dependencies in a chain of VMs or do you need more sophistication? < That would be sufficient

      A VM with a dependent fails, do you expect HA to restart that child VM even though the previous has failed? < Perhaps an option when it's configured to continue on failure or not.

      What if HA could not be able to restart a VM with dependents — should HA restart these dependent VMs after a delay or only after the first VM is restarted? < Again, the ability to chose would be awesome.

    2. says

      Bucketing would be a big leap forward and sufficient for many use cases. If we could fine-tune buckets with VM chains that’ll be great too.
      There’s one more thing I’d like to add: a cluster wide startup/shutdown policy. So far you can only define it on host level, but once a VM migrates the startup order is broken. This would be very helpful for small clusters, that rely on emergency shutdown scripts triggered by UPS.

      • RJones says

        I would LOVE to second this. I have a growing handful of small clusters that I am “responsible” for and this feature would have saved me some considerable stress.
        –QUOTE–
        “There’s one more thing I’d like to add: a cluster wide startup/shutdown policy. So far you can only define it on host level, but once a VM migrates the startup order is broken. This would be very helpful for small clusters, that rely on emergency shutdown scripts triggered by UPS.”

        Randy.

        • Keith Farkas says

          @Michael and @Rjones, regarding a cluster wide startup/shutdown policy, please help me understand the scenario better. Is your goal to extend the amount of time you can run on UPS, or to have an ordered shutdown when the UPS is close to running out of power? Or both options?

          If the former, how would you like to select the VMs that should be powered off? For example, do you want to select a host and have HA power off all the VMs on it? Or, do you want to select VMs by, say, restart level — power off all P2 VMs?

          Regarding start up, what event should trigger HA to begin restarting VMs again? For example, suppose we offered a HA agent API that could be invoked when the restart process should begin? You could then use a script or the FDM MOB (Managed Object Browser) to initiate the startup. Other ideas?

          • says

            Hello Keith

            The idea is to have an ordered shutdown procedure in case of prolonged power loss. For example it’s important to keep Domain Controllers and vCenter alive as long as possible and to go down last. Once you do a cold boot of your cluster, it’s important to have DCs start first, followed by vCenter and then all remaining VMs.
            My goal is not to extend the runtime on battery. Once the decision for shutdown is triggered, it’s just a matter of how the cluster is going down.
            I can imagine something like restart groups. Priority 1,2,3,4. Those VMs who are in group 1 will go down last and statup first.
            Regarding your question about the statup trigger: As far as I know the HA election is independent from vCenter. Wouldn’t it be a good point in time to begin with the startup procedure once the master election between hosts has finished?

            • Keith Farkas says

              Thanks for clarifying.

              Regarding restart groups (P1, P2, etc), we have been discussing them in several scenarios– restarting VMs after a host failure/isolation, shutting down VMs after a power loss, and starting up VMs after the power is resumed. Should the same grouping be used in all scenarios? Or, would you want to specify one grouping for host failure, and another for power loss?

              Regarding the question of a trigger, I’m concerned about triggering a restart when a new master election occurs because it is possible for a master election to occur as the hosts are being powered off during the shutdown. We could allow you to specify a startup delay, but this has its own problems (e.g., we can’t assume hosts are all time sync’d or have the right time). That’s why I mentioned using an external trigger. What do you think?

          • says

            Hi Keith
            (sorry can’t reply to your last posting)
            I’m concerned about triggering a restart when a new master election occurs because it is possible for a master election to occur as the hosts are being powered off during the shutdown.
            That’s just an idea. I don’t mind when the startup procedure is being triggered. How does it happen today with the host based startup lists? I can see the problem with an ideal point in time and not triggering any false restarts. Additionally we need to make sure that all hosts are booted and online. At least enough hosts as defined in HA admission control policy.

            Grouping: From my point of view (and in the name of many customers) one group for HA and startup/shutdown should be enough. Keep it simple! :-)

    3. David Cain says

      I would also lean towards dependencies, simply for the granularity. If its more complex to manage, but provides greater control then that’s an acceptable trade-off. Buckets might be too broad to allow for specific scenarios where one VM can affect multiple dependent VMs.

      I think an ordered list will probably be sufficient, I use dependencies in SRM to add extra control to VM recovery startup, it seems to work well for us.

      I agree with James on the third point and fourth points, if a VM failed it would be good to be able to choose whether dependents should restart or not, we probably have situations where each option is applicable, similarly if a VM with dependents could not be started a configurable restart/don’t restart for dependents would be the most flexible solution

      • Keith Farkas says

        Regarding whether to restart dependent VMs, you mention if a VM with dependents could not be started, you’d like to have the option as to whether HA should restart the dependents in any case.

        I’m wondering about the sequencing. Suppose VM APP depends on a VM DB, and VM DB fails. If there is enough capacity available to restart VM DB, should HA

        1) power down VM APP, then restart VM DB, then power up VM APP, OR
        2) restart VM DB, then power down VM APP, then power back up VM APP, OR
        3) restart VM DB, then reboot the OS of VM APP by resetting the VM?

        Reset has an advantage — since the VM is never powered off, there is no risk that there will be insufficient resources available when we go to power it back on.

        If reset is okay, now suppose HA cannot power VM DB back on? What options should we offer? Power off VM APP? Reset VM APP? Leave it alone?

    4. says

      Why not have a mix of both options. Have the global restart order and within which one could define VM dependency chain. Or better yet have cross global bucket VM dependency.

      By that I mean, we have the first global bucket with a bunch of infra type machines.. once that comes up bucket two gets engaged which has a lost of DB servers. The third bucket has a bunch of app servers but because they have VM dependencies in bucket 2 they will do a check of confirming the corresponding DB server is up before they are started. This way if a server in bucket 2 doesn’t start, it will not translate into not starting anything in bucket 3.

      Basically the impact of a few servers misbehaving wouldn’t impact the whole environment. Unless of course something goes wrong in bucket 1 that holds the infra servers. But one would hope there would be good redundancy built-in there… thoughts?

      • Keith Farkas says

        We were thinking of keeping chains local to a single restart bucket to make the system easier to understand. E.g., if chains could cross buckets, it would be easy to create chains in which the priority group assignment of VMs in the chain does not increase as one moves from the root to the head of the chain. Also, we thought users would likely assign all VMs of a given application to the same restart bucket. But, you raise a use case where cross bucket chains would help. Would you/others be OK with chains not crossing buckets, or is it important that we allow cross bucket chains?

    5. says

      I think the dependency chain assumes that all tiers are on the same host to start with (before the HA event). Would that be a requirement? Otherwise I don’t know how the dependency chain would work if say the middle tier in a 3 tier was on a host that PSOD’ed.

      So I would opt for the global bucket with some sort of heartbeating (vmware-tools responding). — This still doesn’t help my scenario of a tiered service spanning physical hosts, but it seems to be the more flexible option of the two (as described).

    6. Russell says

      Had an interesting request come in yesterday from a customer. They wanted to trigger the suspend of a group of vms in the event of a host failure. I.e. I lost a host and I want to suspend/power off staging and QA vms. For this customer it’s less about resource contention but retaining support for a specific application.

      How I’d want to see this is basically a triggered action on HA events. Tried doing this a couple years ago with alarms but it wasn’t consistent enough to trust. Maybe I should retest on 5.X and see if its better.

      On dependency mapping I like it but I don’t think a vm should ever be held up from starting. It should time out at least and come up in the “last bucket” so that admins can log in and troubleshoot the upstream service and then quickly restart downstream services without having to wait for a boot up.

      The last item on my wish list is the ability to use regular expressions and/or take my vcenter folder structure into account in both HA and DRS cluster settings.

      I.e vms that are called appserver*.foo.local should have a drs rule to keep separate and also could be used as a match string for part of a dependency chain. That of recreate the hierarchy with folders and tie those folders to policies to give the same behavior.

      • says

        What do you mean with “For this customer it’s less about resource contention but retaining support for a specific application.” –> Because if there is no resource contention why would you bother powering off other VMs?

        • Russell says

          Some vendors (Cisco I’m looking squarely at your UCM division) have absurd requirements for supporting their applications in a virtual environment. Personally I’d be happy if I could just hook a vCenter orchestrator workflow or poweCLI script to a failover event.

          If not for that bizzaro requirement I’d say “just toss those lesser VMs in a resource pool and call it a day.”

    7. Russell says

      To Andrea;

      If I lose a server with just the app server on it then I would imagine the upstream dependency would be resolved so it could just restart the vm elsewhere. From there it’s a matter of dealing with downstream stuff. Is is where a triggered action would come in handy. I.e. execute a powershell script to restart any services on systems that may have failed/lost connection due to a lost and now restored dependency.

    8. Charles says

      We have used a “Bucketing” type mothodology for planned datacenter shutdowns and it works quite well. The way we do it is by adding a note to all VMs what their shutdown priority is. We also have folders named according to those shutdown priorities “shutdownpriority1″ “shutdownpriority2″…”shutdownpriority9″ It is a pretty simply powercli script that we run that moves vms to their proper shutdown priority folder, if they are not already in the correct folder, then shuts down all VMs in each folder in order 1-9. When it is time for start up it is a manual process to startup virualized DCs, DHCP and DNS servers as well as vCenter but then it is back to the script to start the VM buckets in opposite order from shutdown, 9-1 with built in pauses long enough to ensure all servcies started on all VMs before moving on to the next bucket. I probably should have put a check to ensure tools was running before moving on but what I did, works for its purpose.

      My point though is that it is essentially the bucket method as described above and it works for planned shutdowns so I don’t know why it could not be implemented into HA and work just as well.

    9. Andrew Fidel says

      My vote is for dependency chains, though they probably need a concept of groups where if any of these servers are up you can go to the next dependency. I’m specifically thinking of DNS servers, domain controllers, and clustered database/middleware servers here where having at least one of those group members up is sufficient to start the next level.

    10. Blomart says

      Automating DR via scripts we attained the same question…
      How do we deal with buisness priorities
      How do we deal with technical depandencies

      We finaly implemented custom vmx params to represent both.
      A global process parses all vmxs to get the global visions: class machines among buisness priorities and then per technical priorities (aka dependencies).
      We then build batch of machines to start. An extra step is necessary to remove empty batches. Today we wait a bit after machines restart but tought about waiting for vmware tools.

      Applications responsibles still don’t consider it a safe way… many prefer haveing a “hard” way to check dependencies (kind of app heartbeat). And prefer organizing VM restart with an external scheduling tool.

    11. Marko says

      My vote is nearly the same as Andrew Fidel already wrote. Dependency chains combined with a group system which allows you to proceed if one/several/all VMs are restarted.
      What I miss:
      - a way to use this system to migrate VMs from one physical datacenter (not VMware DC!) to another physical DC (e.g. vMSC which should migrate due to an power outage)
      - integration between several vCenter systems (e.g. vCenter A contains VMs for DNS, Mail, etc. vCenter B contains Production systems, vCenter C Logistic systems -> restart order goes across the vCenter systems)

    12. Harsha Hosur says

      I would also agree with the dependency chains as an “upgrade” to the next level of HA. What I would also like to see is a restart group priority similar to what we currently have in SRM. Interlinked with Dependency chains , restart group priority would make it easier. If certain VMs have a HA event because of a host failure, then the dependency chain would “shutdown” all the other and bring them up in the restart group priority order, even working across multiple hosts (leveraging vCenter).

    13. says

      Nice idea but to be honest I don’t understand how cluster wide HA restart will work and help with deterministic start of VMs. Let’s assume I have 8 nodes in cluster and one node fails. That’s a typical situation right? Now I have 3 tier application where DB server, APP server (dependent on DB) and WEB server (dependent on APP) are on different ESX hosts protected by HA Cluster. If ESX host where APP server fails the dependency startup is IMHO useless because only failed application group component is APP server. That’s the difference between HA and SRM. In SRM we start everything from scratch. I can imagine that bucketing and VM Tools dependency can give as more deterministic starts of failed components but it will be always non-deterministic from cluster point of view. Or do I miss something?

      • says

        As mentioned in one of the questions at the bottom:
        A VM with a dependent fails, do you expect HA to restart that child VM even though the previous has failed?

        In your scenario that would be the App VM which has the Web Server as a “child”. Would you expect HA to also restart the child?

        Another thing to keep in mind is that this functionality will probably most likely be used in stretched cluster deployments, as there people typically design for full site failures etc.

    14. says

      I don’t know if the child should be restarted? It really depends on application/business logic but I guess usually no. This would be more application aware clustering and I was always thinking about VMware HA as a hardware cluster. That’s why Global Restart Order (bucketing) sounds more logically to me for local fail overs. I understand that innovations must be considered but I think it should follow initial purpose otherwise it can be big mess.

      If stretched cluster is the main use case for dependency chain then it make sense because there is high probability app group will be started from scratch. However child doesn’t need to be restarted in this scenario.

      So if I understand correctly than described “bucketing” would be just slight improvement of current HA restart algorithm which allows user defined priorities and VM hardware/OS aware conditions. That’s probably good idea and it can help to improve current HA behavior to match with customer’s requirements for local fail-over.

      Dependency chain probably make sense for stretched clusters and it can help geo-cluster fail-over to start whole application group on another site in particular order and with dependency checks.

      So I agree with others who vote for both methods and here are my answers to your questions

      Proposed Global Restart Order
      Q1/ How many levels would you like to see?
      A1/ I’ve been always happy with current HA levels (3 + disabled) but up to 10 levels must be enough for any design

      Q2/ Which of the wait conditions (e.g., wait on VMtools) are most useful for you?
      A3/ SELF VM CONDITIONS: wait on PowerOn Status, wait on VMtools, wait X seconds, DEPENDENCY CONDITIONS: wait for another VM Power On Status, wait for another VM VMtools

      Q3/ Suppose HA could not power on a “Priority 1″ VM. Do you want HA to stop powering on the “Priority 2″ etc VMs until it can, move to the “Priority 2″ group after a timeout, or something else?
      A3/ Move to the Priority 2 group after a timeout but trigger the alert to notify administrator about the issue.

      Proposed VM to VM dependency chains
      Q1/ Is an ordered list sufficient to express dependencies in a chain of VMs or do you need more sophistication?
      A1/ Order list can be ok but I’ll propose another solution later

      Q2/ A VM with a dependent fails, do you expect HA to restart that child VM even though the previous has failed?
      A2/ It depends on business logic and architect/admin must be able to choose

      Q3/ What if HA could not be able to restart a VM with dependents — should HA restart these dependent VMs after a delay or only after the first VM is restarted?
      A3/ It also depends on business logic and architect/admin must be able to choose

      So proposed VM to VM dependency is very complex and significantly increase architecture design and operations.

      What about to merge these methods into single more universal method? I already did it above in my comment by adding special wait conditions (DEPENDENCY CONDITIONS) on Global Restart Order method. Let’s call it “Global Restart Order with inter VM dependency wait conditions”. If I want make particular VM dependent on another VM it can be simply done as a special VM wait condition waiting for another VM for “power On” or “VMtools status”.

      Does it make sense or I forgot or oversimplified something?

    15. Warren Legg says

      I would definitely support the dependency chain option, to provide a cluster-wide restart ordering priority. Whilst it is not attractive to use the actual VApp object, because it inherits from a resource pool, I think it would make a lot of sense to use the same properties (e.g. timeout / wait for VMware tools) and process when performing a VApp (or chained VApp) start. This would support both single VMs, and VApps, but taking the work already done in this area for VApp start ordering, and applying it to HA.

      As an additional benefit the properties would also then be carried with the VApp, because they are persisted to the .OVF as part of the standard.

      The startup / shutdown properties could also be imported into the HA priority structure from the VApp. The solution provides the flexibility of creating a hierachical n-level construction.

      Regarding the restart of dependent objects in the case of failure of a VM, I feel this should be made an optional flag, so that both behaviours could be supported and the user defines which case is preferable for the context.

    16. Warren Legg says

      P.S. From a user’s perspective, I think a GUI treeview solution (something like the VMs and Templates view) would be the best solution. It would show the inventory objects (VMs and top-level VApps), and then allow these to be moved into a dependency hierarchy. The VApps would already contain their start ordering defined internally, but it would also be possible to define the VM-VM, and VM-VApp relationships.

    17. Blomart says

      With large environment a system tree to manage could be difficult.
      You need to be able to deal with parallelism and multiple application team able to define restart ordre.
      One key functionnality (at least for our environment) are business priorities.
      To keep things simple:
      -technical team could define services and its dependencies
      * internal: vm to vm
      * external: service to service (dns, dhcp first per example)
      -buisness could “tag” each services with a priority.
      => a service will recieve the priority of itself or the highest of its dependant service.

      Interfaces to set each should be splitted:
      -technical
      -buisness: set the tag and see effective priorities (technical service dependancies)
      ==> give back the hand to the buisness (if they are schizophrenic its their fault)

      As explained with these hierarchisation consturcting batch of vm to stop/start allows to handle parallelism.

    18. Jesús Rodríguez Núñez says

      First of all apologize for my English , answering the question:
      I think the best solution of the 2 raised is the use of chains (it represents directly the relationships that you want to manage) . Although I think the ideal design would be a tree where a machine can be requirement of various and in turn have several other requirement , then each connection would establish certain properties:
      - What is the condition in the upper machine : power on, vmware tools up , a user script should return true, etc.
      - Time to wait before reboot bottom machine: Time in seconds ( if you put 0 will be instant ) the bottom machine cant wait for the condition when the machine goes up / reset.
      Example : A – (Required : active Mysql service, time 300 secs) – > B. This allows that if B is prepared to handle the situation of losing your DB, you can give it a margin for the restoration however if the application does not handle exceptions well or it isnot prepared to restore the operations when the DB fails you can put a 0 and you would force a instant shutdown.

      This design could also be used for simple scenarios in string form without any difficulty and provides total flexibility for advanced scenarios (with better uptimes).

      The same idea could have another representation for the user or internally by adding to each machine some prerequisites (other machines conditions, type of relation: strict/not, etc)

    19. Jesús Rodríguez Núñez says

      To better understand my approach I will discuss another question that would be as important :
      Each machine that does not belong to any chain and each chain would have a property called “Priority” ( a system in the style of the shares would be the ideal level ‘ values ​​’ ) , at the failure of a host the system would turn of the low levels and ensure the high ones. The same calculations would apply in the Admission Control, and Host Failure would offer the option to guarantee a certain priority level ( very useful to have multiple environments: pre – pro – dev ).
       
      With this system you have the best granularity and easy construction of power on/off actions through cli/gui (with priority level upeer/under limits)

    Leave a Reply