Lately I have been thinking about what it would take to deploy a stretched vCloud Director (vCD) infrastructure. “The problem” with a vCloud Director infrastructure is that there are so many moving components, this makes it difficult to figure out how to protect each component. Let me point out that I do not have all the definitive answers to this yet, I am writing this article to get a better understanding of the problem myself. If you do not agree with my reasoning please feel free to comment, as I need YOUR help defining the recommended practices around vCD on a stretched infrastructure.
I listed the components I used in my lab:
- vCenter Server Management
- vCenter Server Cloud Resources
- vCloud Director Cells
- vShield Manager
- Database Server
That would be 5 moving components, but in reality we are talking more around 8. The thing here is that vCenter Server also has multiple components:
- Single Sign On
- Inventory Service
- Web Client
- vCenter Server
How do I protect these 8 components? The first 5 listed will be individual VMs and vCloud Director itself will be multiple cells even. What would this look like?
As you can see there are multiple vCenter Servers, one manages the Management Cluster and its components. While the other manages the “Cloud Resource Cluster”. Lets start listing all the components and discuss what the options are and if we can protect them in a special way or not.
vCenter Server (cloud resources and management)
vCenter Server can be protected through various methods. There is vCenter Heartbeat and of course we have vSphere HA (including VM Monitoring). First of all it is key to realize that neither of these solutions are fully “non-disruptive”. Both vSphere HA and vCenter Heartbeat will cause a slight disruption. vSphere HA will simply restart your VM when a host has failed, and vSphere HA – VM Monitoring can restart the Guest OS when the VM has failed. vCenter Heartbeat is a more intelligent solution, it can detect outages using a heartbeat mechanism and respond to that.
I guess the question is availability vs operational simplicity. How important is vCenter Server availability in your environment? Setting up vSphere HA and VM Monitoring is a matter of seconds. Installing and configuring vCenter Heartbeat is probably hours… And think about upgrade processes etc. I personally prefer not using vCenter Heartbeat but going for vSphere HA and VM Monitoring in this scenario, how about you?
What about these vCenter services like SSO / Inventory Service / Web Client etc. Although in a way, from a scalability/performance perspective, it might make sense to split things out… It also makes your environment more vulnerable to failures. What if 1 VM in your “vCenter service chain” is down. That might render your whole solution unusable. I would personally prefer to have vCenter Server, Inventory Service and the Web Client to be installed in a single VM. I can imagine that for SSO you would like to split it out, so that when you have multiple vCenter Server instances you can link them to the same SSO instance.
As mentioned SSO potentially could be deployed in an HA fashion. HA with regards to SSO is an active/standby solution, but I have been told there are other ways of deploying it and more info would be released soon.
Recommended Practice: I am a big fan of keeping things simple. Keep vCenter and at a minimum the Inventory Service together, and potentially the Web Client. Although Heartbeat has the potential of decreasing vCenter Server downtime, in many cloud environments SLAs are around vCloud workload availability and not about vCenter itself. One component that I would recommended to configure in a HA fashion is SSO. Without SSO you cannot login, this is critical for operations.
Hopefully all of you are aware that vCloud Director can easily scale by deploying new “cells” as we call it. A cell is simply said a virtual machine running the vCD software. These cells are all connected to the same database and can handle load. Not only can they handle load, but they can also continue where another stopped. So from an Availability perspective this is ideal. I already depicted this in the diagram above by the way.
Recommended Practice: Deploy multiple vCloud Director cells in your management cluster. Ensure that at a minimum two cells reside on each of the “sites” of your stretched cluster. In order to achieve this vSphere DRS VM-Host affinity groups should be used!
vShield Manager is one of the difficult components. It is a single virtual machine. You can protect it using vSphere HA but that is about it as the VM has multiple vCPUs which rules out FT. So what would make sense in this case? I would try to ensure that the vShield Manager is in the same site as vCenter Server. In the case there is a network failure between sites, at least the vShield Manager and vCenter Server can communicate when needed.
Recommended Practice: The vShield Manager virtual appliance resides in the same site as the vCenter Server, in other words it is a recommended practice to have both be part of the same vSphere DRS VM-Host affinity group. It is also recommended to leverage vSphere HA – VM Monitoring to allow for automatic restarts to occur in the case of a host or guest failure.
This is the challenging one… As of vCloud Director 5.1 it is supported to cluster your database. So you could potentially cluster the vCD database. However this Database Server will host more than just vCD, it will probably also host the vCenter Server database and potentially other bits and pieces like Chargeback / Orchestrator etc. Not all of these support a clustered database solution today unfortunately. It is difficult defining a recommended practice in this case. Although Database Clustering will theoretically increase availability it will also complicate operations. From an operational perspective the difficult part is how to manage site isolations. Just imagine the network between Site-A and Site-B is down but all components are still running. What will you do with the database?
This is definitely one I am not sure about what to do with…
As you can see this is not a fully worked out set of recommended practices guide yet, there is still stuff to be figured out and I am going through the exercise as we speak. If you have an opinion about this, and I am sure many do, don’t hesitate to leave a comment!