HA Futures: Orchestrated Restart and Restart Priority – Part 1 of 4 – (Please comment!)

Duncan Epping · Oct 15, 2018 ·

Last week I visited Palo Alto. I had a long conversation with my friends from the vSphere HA team. I have done a series of articles in the past asking for feedback/comments on future features/functions of HA that the team is looking to implement. One of the first features I would like to discuss is Orchestrated Restart and Restart Priority. Funny enough, this feature is one I discussed in the previous series as well.

For those who don’t know, today you can specify in the vSphere Client what the dependency is between VMs. If a host fails, or multiple hosts fail, and the VMs in a dependency chain are impacted than HA ensures that these VMs are powered on in a particular order. And actually, vSphere HA also has the ability to specify what the restart priority should be of VMs. I described both of these features here. The difference between the two is fairly straightforward though: restart priority are considered “soft rules” and restart orchestration is considered a hard rule. In other words: if one VM can’t be restarted when restart orchestration is used then the next batch will not start.

The UI, to be honest, is confusing, and having two similar concepts that more or less do the same is also confusing. We have discussed various things we would like to have your opinion on. Please leave your feedback in the comments using a valid email address, this way when needed we can follow up.

Would you like to have the ability to restart a full chain of VMs, when Orchestration or Priority is enabled and only a few VMs in the chain are impacted by a failure? In other words, would you like to have an option that allows you to restart running VMs that are part of a chain which is impacted by a failure?
Would you like to have Orchestrated Restarts / Restart Priority for APD impacted VMs and VM & App Monitoring as well?
Would you like to have Orchestrated Restarts and Restart Priority combined in a single feature?
- Potentially have an option to have multi-level for Orchestration like Restart Priority has
- Define if it is a “hard” or “soft” rule

And of course, if you feel anything else needs to change about this feature, then please also leave that in a comment. The HA team will be reading this, and are happy to take all feedback they can get. Please help shape the future of HA!

Comments

Lewis Bowman says

15 October, 2018 at 16:59

The ability to restart ‘the whole chain’ if only a subset is affected by an outage is a great idea, and I know that some of my environments most certainly would have a use case. I appreciate that this settings would need to be configurable, maybe at the cluster level?

The simplification of these settings would be appreciated, I think restart priority is implicitly a dervied value from the orchestration order (vm-vm dependecy) so deprecating restart priority is probably the best approach. Maybe have low / medium / high vm-vm depenecy groups ‘pre-baked’ into the cluster settings would ease adoption of these settings?

The ability for the orchestrated restart to respond to VM monitoring and APD failures would make it consistent with the rest of HA, so consistency is always appreciated
Suresh Siwach says

16 October, 2018 at 12:30

Hello Duncan,

First of all thanks for inviting us for giving feedback.
The improvement which you have mentioned in your Post will take VMware HA to next level and I am agree with the changes you suggested.

We are learning from the books\posts which you have published or about to publish. So I don’t have much to say for new feature\improvement.

As of now what I can think\suggest in improvement in HA for cluster VM’s (MS SQL DB, Oracle RAC, Standalone DB, and Application Cluster etc) VMware. In VMware Cluster there should be an option where we can mark\flag two or more VM part of a Cluster. Situation and action is mentioned in image attached.

By doing this way we will be able to make all\maximum services available and utilized resource optimally.

In future marking\flagging can be automated by VMware Development Team. I trust on VMware Team they can achieve next to infinite level.

I am not able to upload image so

Situation Action
Host failure in a cluster where enough compute are available Restart all VM as per there priority and sequence.

One\multiple Host failure in a cluster where enough compute are not available “HA should check and only restart N-1 VM in case of two\three node cluster if VM are not running

HA should check and only restart N/2 for four and four+ node cluster if VM are not running.

Custom option should be given for advance user so they can decide how many node they want to make up.”

I am looking forward for your comment on my feedback.

Thanks
Suresh Siwach
Dan Krestin says

18 October, 2018 at 12:36

Hi Duncan,

As a SP we would love have the ability to configure HA restart priority on VMs based on their start / stop policy defined in a vCD vAPP.

Yes, would like to see an option that allows us to restart running VMs that are part of a chain which is impacted by a failure?

Definitely for APD events
Thanks
Dan
Dan Kahl says

18 October, 2018 at 15:30

Hey Duncan –

I like the concept of being able to restart a chain of VM’s with HA, certainly can provide more value.

I also realize your question is around vCenter cluster HA, but I think this might be a great opportunity to be driven with some of the self automation features in vROps?

I’m just thinking around apps which may reside in different, datacenters, clusters or even vCenters, but are managed in a single vROps.

Thanks,

Dan K.
barry says

19 October, 2018 at 11:42

I would welcome the above suggested changes! Love the concepts and think all are very valid features!
nick says

19 October, 2018 at 12:56

I am not a fan of the current UI, it is very difficult to use. Hope their will be a major overhaul as a result of this.
Michael says

19 October, 2018 at 23:09

I like all this ideas.
Especially for the option that allows you to restart running VMs that are part of a chain which is impacted by a failure I have some use cases.

I’m also looking for the possibility to use tags or folders/resource pools in VM rules and overrides. This would make it easy to assign a low restart priority to a resource pool of test systems. So you don’t have to think about configuring the priority when creating a new VM or do not need the permissions (at least not on cluster level).
Adam Berus says

12 November, 2018 at 17:10

Hi Duncan,

I really like the changes you guys are making to HA and the overall direction it is moving in. Currently we don’t use restart priority because we found it to be essentially useless with todays server capacity/density. So the introduction of restart dependency is a welcome addition and I’m looking forward to playing with it when we upgrade from 6.01 to 6.7 in the next couple of months.

As is the case with the majority of VMware users we have virtualized the vast majority of our SQL and Oracle database servers, and due to licensing this requires them to reside on physically different hardware meaning different HA clusters but managed by the same vCenter instance. The primary use case I can see in using HA dependency groups would be between the application, web and their database servers (we are a huge SAP shop). Will HA dependency groups support using resources that are managed by a different cluster? From what I can tell so far this is cluster dependent, which technically makes sense, but it does limit our use case to things like domain controllers and such.
Babak says

14 November, 2018 at 11:13

Dear Duncan
Hi

Actualy i was read your Book (vsphere cluster deep dive 6.7) in admission part i have a question :

1- Is the best practice choose Host failover cluster tolerate when all of my esxi hosts have same resources ?
2- Is the best practice choose Cluster Resource Percentage Algorithm when i have esxi hosts with difference resource ?

3- Actually if i want use Cluster Resource Percentage Algorithm had to know all of vms resource because you use this method :

((total amount of available resources – total reserved VM
resources)/total amount of available resources) <=
(percentage HA should reserve as spare capacity)

because i had to know total reserve vm resource

4-Totally can i say we use Cluster Resource Percentage Algorithm while know the number of vms a resources or number of vms are ordinary fix
and use Host failover cluster tolerate while i don't know number of vms certailnly

Is that correct ?
Jay Hallsworth says

25 January, 2019 at 02:09

Duncan,

I deal with lots of small clusters, 2 – 5 hosts with up to 100 or so VMs per cluster… (this must surely be a more common scenario than the large 10/15+ host clusters)

In this case, the most useful interface is the host startup whereby you can set a group of VMs to start in sequence, then a group of VMs to start in any order, and another group of VMs to remain powered off.
Cat Mucius says

31 December, 2019 at 02:48

>> “If a host fails, or multiple hosts fail, and the VMs in a dependency chain are impacted than HA ensures that these VMs are powered on in a particular order.”

Sounds great, but will this work if _all_ hosts fail and then (some of them) are restarted?

Related

Reader Interactions

Comments