Thinking about a stretched vCloud Director deployment

Duncan Epping · Nov 20, 2012 ·

Lately I have been thinking about what it would take to deploy a stretched vCloud Director (vCD) infrastructure. “The problem” with a vCloud Director infrastructure is that there are so many moving components, this makes it difficult to figure out how to protect each component. Let me point out that I do not have all the definitive answers to this yet, I am writing this article to get a better understanding of the problem myself. If you do not agree with my reasoning please feel free to comment, as I need YOUR help defining the recommended practices around vCD on a stretched infrastructure.

I listed the components I used in my lab:

vCenter Server Management
vCenter Server Cloud Resources
vCloud Director Cells
vShield Manager
Database Server

That would be 5 moving components, but in reality we are talking more around 8. The thing here is that vCenter Server also has multiple components:

Single Sign On
Inventory Service
Web Client
vCenter Server

How do I protect these 8 components? The first 5 listed will be individual VMs and vCloud Director itself will be multiple cells even. What would this look like?

As you can see there are multiple vCenter Servers, one manages the Management Cluster and its components. While the other manages the “Cloud Resource Cluster”. Lets start listing all the components and discuss what the options are and if we can protect them in a special way or not.

vCenter Server (cloud resources and management)

vCenter Server can be protected through various methods. There is vCenter Heartbeat and of course we have vSphere HA (including VM Monitoring). First of all it is key to realize that neither of these solutions are fully “non-disruptive”. Both vSphere HA and vCenter Heartbeat will cause a slight disruption. vSphere HA will simply restart your VM when a host has failed, and vSphere HA – VM Monitoring can restart the Guest OS when the VM has failed. vCenter Heartbeat is a more intelligent solution, it can detect outages using a heartbeat mechanism and respond to that.

I guess the question is availability vs operational simplicity. How important is vCenter Server availability in your environment? Setting up vSphere HA and VM Monitoring is a matter of seconds. Installing and configuring vCenter Heartbeat is probably hours… And think about upgrade processes etc. I personally prefer not using vCenter Heartbeat but going for vSphere HA and VM Monitoring in this scenario, how about you?

What about these vCenter services like SSO / Inventory Service / Web Client etc. Although in a way, from a scalability/performance perspective, it might make sense to split things out… It also makes your environment more vulnerable to failures. What if 1 VM in your “vCenter service chain” is down. That might render your whole solution unusable. I would personally prefer to have vCenter Server, Inventory Service and the Web Client to be installed in a single VM. I can imagine that for SSO you would like to split it out, so that when you have multiple vCenter Server instances you can link them to the same SSO instance.

As mentioned SSO potentially could be deployed in an HA fashion. HA with regards to SSO is an active/standby solution, but I have been told there are other ways of deploying it and more info would be released soon.

Recommended Practice: I am a big fan of keeping things simple. Keep vCenter and at a minimum the Inventory Service together, and potentially the Web Client. Although Heartbeat has the potential of decreasing vCenter Server downtime, in many cloud environments SLAs are around vCloud workload availability and not about vCenter itself. One component that I would recommended to configure in a HA fashion is SSO. Without SSO you cannot login, this is critical for operations.

vCloud Director

Hopefully all of you are aware that vCloud Director can easily scale by deploying new “cells” as we call it. A cell is simply said a virtual machine running the vCD software. These cells are all connected to the same database and can handle load. Not only can they handle load, but they can also continue where another stopped. So from an Availability perspective this is ideal. I already depicted this in the diagram above by the way.

Recommended Practice: Deploy multiple vCloud Director cells in your management cluster. Ensure that at a minimum two cells reside on each of the “sites” of your stretched cluster. In order to achieve this vSphere DRS VM-Host affinity groups should be used!

vShield Manager

vShield Manager is one of the difficult components. It is a single virtual machine. You can protect it using vSphere HA but that is about it as the VM has multiple vCPUs which rules out FT. So what would make sense in this case? I would try to ensure that the vShield Manager is in the same site as vCenter Server. In the case there is a network failure between sites, at least the vShield Manager and vCenter Server can communicate when needed.

Recommended Practice: The vShield Manager virtual appliance resides in the same site as the vCenter Server, in other words it is a recommended practice to have both be part of the same vSphere DRS VM-Host affinity group. It is also recommended to leverage vSphere HA – VM Monitoring to allow for automatic restarts to occur in the case of a host or guest failure.

Database

This is the challenging one… As of vCloud Director 5.1 it is supported to cluster your database. So you could potentially cluster the vCD database. However this Database Server will host more than just vCD, it will probably also host the vCenter Server database and potentially other bits and pieces like Chargeback / Orchestrator etc. Not all of these support a clustered database solution today unfortunately. It is difficult defining a recommended practice in this case. Although Database Clustering will theoretically increase availability it will also complicate operations. From an operational perspective the difficult part is how to manage site isolations. Just imagine the network between Site-A and Site-B is down but all components are still running. What will you do with the database?

This is definitely one I am not sure about what to do with…

Summary

As you can see this is not a fully worked out set of recommended practices guide yet, there is still stuff to be figured out and I am going through the exercise as we speak. If you have an opinion about this, and I am sure many do, don’t hesitate to leave a comment!

Comments

Lee Christie says

20 November, 2012 at 15:55

Hi Duncan

we recently built a new platform based on 5.1 and vCloud Director.and tried to implement as much HA goodness as possible.

SSO – we tried to build an SSO cluster but this is no easy task due to lack of, and conflicting information. At one point I had an SSO cluster behind a load balancer but ran into trouble with vCenter only talking to one of the SSO nodes – it wouldn’t accept the certificates from the other SSO nodes. We felt the benefit was outweighed by the complexity/newness of this setup (and also lack of database HA, see below) so abandoned that path.

an SSO cluster would only be as good as its database right? And we found no supported way of achieving HA here. Memory is sketchy but I don’t believe a SQL2008 mirror is supported in the JDBC URL, a true SQL cluster won’t work, SQL2012 Availability Group worked a treat (with some hacking of SQL scripts) but of course SQL2012 isn’t on the supported list so we abandoned that.

In our implementation we have vCenter talking to a SQL2008R2 mirror just fine. Hurray for good old ODBC and the native client which achieves this.

Ultimately we deemed SSO to be on an equal par with vCenter in terms of importance, so decided to implement these together on the same VM (protected by “just” HA). Our database server is separate, vCenter has a mirrored db whereas SSO has a single db.

As well as vcloud cells we also have our active directory infrastructure to think about.

We are running a two-site strategy and have the luxury of a 20Gb/s resilient link (two 10Gb channels) however I cannot see any way of achieving seamless failover if the primary site fails and the concept of a vSphere/vCloud split brain situation would be a nightmare. For now we have configured replication of our important stuff from Site A to Site B so with a few clicks we can have the entire operation up and running again.

Customer VMs are protected by HA pairs running cross site inside VXLAN segments (with HA VSE pairs).

We also implemented some other safety ideas like running a management LAN across standard vSwitches, so in the event of a vCenter/vDS meltdown we can still access our important infrastructure.

When SQL2012 appears on the supported list the database side of things will be much easier since applications are not aware they are talking to a cluster as such.

A big learning curve for sure !
Ivan Pepelnjak says

20 November, 2012 at 16:05

From the networking perspective, the major vulnerability is the DCI link. Always consider what happens when it fails (because eventually it will).

If the individual cells/HA clusters are not stretched across both data centers, they’ll do just fine, but the cells in the “orphan” DC might be impacted by the vCD database connectivity issues, and the users deployed in the “orphan” DC will lose orchestration capabilities. Survivable, but not for long.

Anyhow, always consider every possible failure scenario – server, storage and network failure … and don’t expect that a redundant system will never fail; you’ve just reduced the probability of a total failure.
Eiad Al-Aqqad says

22 November, 2012 at 22:15

Hi Duncan,

I was wondering which components in vCenter 5.1 support Database clustering & which one does not. I believe that would be a great topic of a new blog post, as that seems to still confusing many people these days.

Thanks,
Eiad Al-Aqqad
http://www.VirtualizationTeam.com
- Duncan Epping says
  
  22 November, 2012 at 23:41
  
  As far as I know vCenter itself doesn’t even support it, so there really is no point any longer then.
  - Lee Christie says
    
    22 November, 2012 at 23:49
    
    Since vCenter uses an ODBC connection, surely this means you can use anything that the underlying driver supports ?
    
    Or are we not talking about “what works” but instead “what vmware themselves will support” ?
    - Duncan Epping says
      
      22 November, 2012 at 23:57
      
      I am not talking about what works, a lot of stuff works but it doesn’t mean you get full support on your solution. In this case database clustering is not support for vCenter Server at this point in time. This might change in the future, who knows.
Tomas Fojta says

24 November, 2012 at 14:27

I think we should first define what is a stretched cluster vCloud Director deployment and what is the business driver for it. And then you can discuss each HA scenario and how it will affect the service availability based on the failures of compute, storage and network components. I personally do not see much sense (or demand) in stretching the workloads. It gets too expansive as you need N+N redundancy and stretched L2 networks with large bandwidth. More realistic alternative (IMHO) is to provide to the customer something like Amazon availability zones. They will get two org VDCs each from a PVDC in different site and it is up to them to deploy the workloads with application clustering. Each PVDC would be managed by its local vCenter+vSM. Stretched cluster protection might then be used only for the rest of the management components VCD DB, Chargeback, etc.
- Duncan says
  
  24 November, 2012 at 16:27
  
  I have had many questions around this from SPs with DCs in a 50km distance lately. Hence the reason for the discussion. Whether I see the value or not doesn’t change the fact people are looking to implement it.
  - Tomas Fojta says
    
    25 November, 2012 at 11:18
    
    Then we should mention to them that stretching the storage is the easy bit. As none of the components are site aware the networking gets complicated. On top of what Ivan said above, we have no easy mechanism to keep all VMs of one org VDC together. The same goes to their gateway Edges (traffic tromboning). The transfers between VCD cells and hosts will also not be site optimized (console and OVF traffic). Catalogs are not site aware, Edge deployments from vSM to hosts as well. So they need really thick DCI link. It also means that in terms of availability you won’t get much (if anything) with 2-2 VCD cell setup over 1-1 or 4-0.
    Other considerations: redundant geographical load balancing for external cloud workload networks and for external VCD cell networks.
    So back to my original question: why to do it in the first place? What is the business driver?
    - Duncan says
      
      25 November, 2012 at 12:11
      
      I know VCD is not site optimized and neither are most components. (see my vMSC white paper which addresses these issues, or preso at VMworld where we mention that stretching the the storage is the easy part.)
      
      Nevertheless these solutions are still being sold. So what the business driver is doesn’t really matter that point. (Probably sold as DR or Disaster Avoidance solution, whether this is valid is besides the point.)
Udubplate says

26 November, 2012 at 02:09

I think it is absolutely unacceptable that vCenter doesn’t officially support a clustered database instance. This is supposed to be an enterprise and service provider grade platform and this is a basic requirement for many people. Of course we know it works and people have been doing it since day one, but lack of official support is disappointing to say ghe least. As was mentioned, there is no reason this shouldn’t be supported given underlying odbc driver supports it. There must be a disconnect within the organization about the importance of this being officially supported. Frank, would be great if you could help the cause of getting attention in the right group to get this resolved.
- Lee Christie says
  
  26 November, 2012 at 09:17
  
  One could play devils advocate here and argue there is no point vCenter supporting a clustered database instance when SSO cannot. Our testing with SSO clustering and HA concluded that SSO has been released a bit prematurely for production use .
  - Udubplate says
    
    27 November, 2012 at 03:46
    
    Agreed, but I don’t think that’s so much being devils advocate, rather its reinforcing the point…that a simple enterprise requirement is not being accounted for as part of day 1 support in the case of SSO or after many many years in the case of vCenter. It emphasizes that this isn’t recognized within VMware if new products are being releasd with the same limitations.
Joseph Griffiths says

26 November, 2012 at 15:35

Thanks for posting this blog entry. This is something I have been trying to figure out for a while now. I had a lot of the same thoughts that you had. Then problem is it really all breaks down at the database layer. No matter what you do vcenter does not support any type of real clustered database solution. I am wondering if this is a product of the JDBC connection. Does the vcenter appliance support clustered solutions? We are a Oracle RAC shop so support for the Oracle client natively in Linux would solve this issue 100%… I understand that the vcenter appliance does not scale past 100 vm’s (this is the answer I got from Vmware support not personal experience) but it sure would be nice to see vcenter gain some redundancy. I have worked with heartbeat between a physical and virtual system to provide redundancy but this solution is less than elegant.

What are the future plans to solve what I would consider to be a major hole in Vmware redundancy?

Our dependency on vcenter only increases over time.
Adrian Parks says

26 November, 2012 at 20:17

As regards the management cluster, my thinking is that if you’re going to stretch it and protect each individual component (as opposed to using a holistic solution such as SRM, which is supported for the management cluster and what we use for our vCD environment), then you’ll need to replicate much if not all of the functionality that SRM provides. In a complete site failure, probably the most important part is scheduling recovery of all the moving components in the correct order. If your vCD cells come up before their database servers are ready, then they won’t start correctly and you won’t have achieved a successful recovery. And as clustering (some of) the database servers isn’t supported, this becomes quite complex in a stretched cluster, where your database servers and their application servers might be on different sites. Probably you want to make sure that they aren’t.

As you pointed out, there are certain key pieces of infrastructure – some database servers, vShield Manager – that today can pretty much only be protected (in a supported way) with HA or SRM. In our case, to complete the full recovery of the management cluster, we needed more granularity than HA could give us (we have four tiers of recovery priority – authentication servers/DNS/other supporting infrastructure first, then VCenter/vCD database servers, then vCenter/vCD cells, finally other stuff like chargeback – whereas HA only gives us three: High, Medium and Low). Also we wanted to be able to pause at various points in the recovery process to ensure everything is as it should be – that DNS server may look like it’s up, but is the DNS service actually started?

All in all, we found SRM to be a good fit for the management cluster.

The resource clusters are a different story, of course, because you can’t use SRM to recover them. We do deploy those in a stretched cluster. But even then the hosts at the recovery site are in maintenance mode, so our workloads for a particular Provider vDC are always at one site. If that site fails, we use custom PowerShell scripts to bring customer workloads up at the recovery site, ordered by priority (vShield Edge devices are brought up first, for example).