ha

Don’t create a Frankencluster just because you can…

Duncan Epping · Feb 19, 2014 ·

In the last couple of weeks I have had various discussions around creating imbalanced clusters. Imbalanced from either CPU, memory and even a storage point of view. This typically comes up in discussions where either someone wants to bring larger scale to their cluster and they want to add hosts with more resources of any of the before mentioned types. Or also when licensing costs need to be limited and people want to restrict certain VMs to run a specific set of hosts. Something that comes up often when people are starting to look at virtualizing Oracle. (Andrew Mitchell published this excellent article on the topic of Oracle Licensing and soft vs hard partitioning which is worth reading!)

Why am I not a fan of imbalanced clusters when it comes to compute or storage resources? Why am I not a fan of crippling your environment purposely to ensure your VMs will only run on a subset of vSphere hosts? The reason is simple, the problems I have seen and experienced and the inefficiency in certain scenarios. Lets look at some examples:

Lets assume I have 4 hosts with each 128GB of memory. I need more memory in my cluster and I add a host with 256GB of memory. Now you just went from 512Gb to 768GB which is a huge increase. However, this is only true when you don’t do any form of admission control and resource management. When you do proper resource management or admission control than you would need to make sure that all of your virtual machines can run in the case of a failure, and preferably run with equal performance before and after the failure has occured. If you added 256GB of memory and this is being used and that host containing 256GB goes down your virtual machines could potentially be impacted. They might not restart, and if they restart they may not get the same amount of resources as they received before the failure. This scenario also applies to CPU, if you create an imbalance .

Another one I encountered recently was presenting a LUN to a limited set of hosts, in this case a LUN was only presented to 2 hosts out of the 20 hosts in that cluster… Guess what, when those two hosts die… so do your VMs. Not optimal right when they are running an Oracle database for instance. On top of that I have seen people pitching a VSAN cluster of 16 nodes with only 3 hosts contributing storage. Yes you can do that, but again… when things go bad, they will go horribly bad. Just imagine 1 host fails, how will you rebuild your components that were impacted? What is the performance impact? Very difficult to predict how it will impact your workload, so just keep it simple. Sure there is a cost overhead associated with separating workloads and creating dedicated clusters, but it will be easier to manage and more predictable in failure scenarios.

I guess in summary: If you want predictability in terms of availability and recoverability of your virtual machines go for a balanced environment, don’t create a Frankencluster!

vSphere Availability Survey, please help out!

Duncan Epping · Feb 17, 2014 ·

Just received the below from the vSphere Availability team. It takes a couple of minutes to fill out and it helps the vSphere Availability team to set priorities correctly for upcoming releases, yes indeed based on your answers!

— copy / paste —

The Availability team (that brings to you products such as vSphere HA, FT etc.) would like to get your input on how you use our products today and your projected needs. The survey has mainly multiple choice questions, and will take 10-15 minutes to complete. Your feedback is invaluable in helping us tailor our development efforts towards valuable enhancements. So, thank you!

Here’s the link to the survey: http://tinyurl.com/vmwavailability

vSphere HA and VMs per Datastore limit!

Duncan Epping · Feb 5, 2014 ·

I felt I would need to get this out there, as it is not something many seem to be aware off . More and more people are starting to use storage solutions which offer 1 large shared datastore, examples are solutions like Virtual SAN, Tintri and Nutanix. I have seen various folks saying: unlimited number of VMs per datastore, but of course there are limits to everything! If you are planning to build a big cluster (HA enabled), keep in mind that per cluster your limit for a datastore is 2048 powered-on virtual machines! Say what? Yes that is right, per cluster you are limited to 2048 powered-on VMs on a single datastore. This is documented in the Max Config Guide of both vSphere 5.5 and vSphere 5.1. Please note it says datastore and not VMFS or NFS explicitly, this applies to both!

The reason for this today is the vSphere HA poweron list. I described that list in this article, in short: this list keeps track of the power-state of your virtual machines If you need more VMs in your cluster than 2048 you will need to create multiple datastores for now. (More details in the blog post) Do note that this is a known limitation and I have been told that the engineering team is researching a solution to this problem. Hopefully it will be in one of the upcoming releases.

HA Futures: Pro-active response

Duncan Epping · Oct 4, 2013 ·

We all know (at least I hope so) what HA is responsible for within a vSphere Cluster. Although it is great that vSphere HA responds to a failure of a host / VM / application and even in some cases your storage device; wouldn’t it be nice if vSphere HA could pro-actively respond to conditions which might lead to a failure? That is what we want to discuss in this article.

What we are exploring right now is the ability for HA to avoid unplanned downtime. HA would detect specific (health) conditions that could lead to catastrophic failures and pro-actively move virtual machines of that host. You could for instance think of a situation where 1 out of 2 storage paths goes down. Although not directly impacting the machines from an availability perspective, it could be catastrophic if that second path goes down. So in order to avoid ending up in this situation vSphere HA would vMotion all the virtual machines to a host which does not have a failure.

This could of course also apply to other components like networking or even memory or CPU. You could potentially have a memory dimm which is reporting specific issues that could impact availability, this in its turn could then trigger HA to pro-actively move all potentially impacted VMs to a different host.

A couple of questions we have for you:

When such partial host failures occur today, how do you address these conditions? When do you bring the host back online?
What level of integration do you expect with management tools? In other words, should we expose an API that your management solution can consume, or do you prefer this to be a stand-alone solution using a CIM provider for instance?
Should HA treat all health conditions the same? I.e., always evacuate all VMs from an “unhealthy” host?
How would you like HA to compare two conditions? E.g., H1 fan failure, H2 network path failure?

Please chime in,

vSphere HA advanced settings, the KB

Duncan Epping · Sep 26, 2013 ·

I’ve posted about vSphere HA advanced settings various times in the past, and let me start by saying that you shouldn’t play around with them unless you have a requirement to do so. But if you do, there is a KB article which I can highly recommend as it lists all the known and lesser known advanced settings. I had the KB article updated with vSphere 5.5 advanced settings yesterday (Thanks KB team for being so responsive!) but it also applies to vSphere 5.0 and 5.1. Recommended read for those who want to get in to the nitty gritty details of vSphere HA.

http://kb.vmware.com/kb/2033250