cloud

VMware vCloud Director Infrastructure Resiliency Case Study paper published!

Duncan Epping · Mar 1, 2012 ·

Yesterday the paper that Chris Colotti and I were working on titled “VMware vCloud Director Infrastructure Resiliency Case Study” was finally published. This white paper is an expansion on the blog post I published a couple of weeks back.

Someone asked me at PEX where this solution came from all of a sudden, well this is based on a solution I came up with on a random Friday morning half of December when I woke up at 05:00 in Palo Alto still jet-lagged. I diagrammed it on a napkin and started scribbling things down in Evernote. I explained the concept to Chris over breakfast and that is how it started. Over the last two months Chris (+ his team) and I validated the solution and this is the outcome. I want to thank Chris and team for their hard work and dedication.

I hope that those architecting / implementing DR solutions for vCloud environments will benefit from this white paper. If there are any questions feel free to leave a comment.

Source – VMware vCloud Director Infrastructure Resiliency Case Study

Description: vCloud Director disaster recovery can be achieved through various scenarios and configurations. This case study focuses on a single scenario as a simple explanation of the concept, which can then easily be adapted and applied to other scenarios. In this case study it is shown how vSphere 5.0, vCloud Director 1.5 and Site Recovery Manager 5.0 can be implemented to enable recoverability after a disaster.

Download:
http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.pdf
http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.epub http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.mobi

I am expecting that the MOBI and EPUB version will also soon be available. When they are I will let you know!

Cool tool: vBenchmark fling

Duncan Epping · Feb 29, 2012 ·

Today I decided to start testing the vBenchmark fling. It sounded like a cool tool so I installed it in my lab. You can find the fling here for those wanting to test it themselves. So what doe the tool do? The VMware Labs website summarizes it in a good way:

Have you ever wondered how to quantify the benefits of virtualization to your management? If so, please consider using vBenchmark. vBenchmark measures the performance of a VMware virtualized infrastructure across three categories:

Efficiency: for example, how much physical RAM are you saving by using virtualization?

Operational Agility: for example, how much time do you take on average to provision a VM?

Quality of Service: for example, how much downtime do you avoid by using availability features?

vBenchmark provides a succinct set of metrics in these categories for your VMware virtualized private cloud. Additionally, if you choose to contribute your metrics to the community repository, vBenchmark also allows you to compare your metrics against those of comparable companies in your peer group. The data you submit is anonymized and encrypted for secure transmission.

The appliance can be deployed in a fairly simple way:

Download OVA –> unzip
Open vCenter client –> File –> Deploy OVF Template
Select the vBenchmark OVA as a source
Give it a name, I used used the default (vBenchmark)
Select a resource pool
Select a datastore or datastore cluster
Select the disk format
Select the appropriate (dv)portgroup
Fill out the network details
Finish

Now after it has been deployed you can power it on. When it is powered on check the summary tab and remember the ip-address (for those using dhcp). You can access the web interface on “http://<ip-address>:8080/”.

Now you will see a config screen. You can simply enter the details of the vCenter Server of the vSphere environment you want to “analyze” and hit “Initiate Query & Proceed to Dashboard”.

Now comes the cool part. vBenchmark will analyze your environment and provide you with a nice clean looking dashboard… but that is not it. You can decide to upload your dataset to VMware and compare it with “peers”. I tried it and noticed their wasn’t enough data for the peer group I selected. So I decided to select “All / All” to make sure I saw something.

I can understand that many of you don’t want to send data to an “unknown” destination. The good thing is though that you can inspect what is being sent. Before you configure the upload just hit “Preview all data to be sent” and you will get a CSV file of the data set. This data is transported over SSL, just in case you were wondering.

I am going to leave this one running for a while and am looking forward to see what the averages are of my peers. I also am wondering what this tool will evolve in to.

One thing that stood out from the “peer results” is the amount of GBs of Storage per VM: 116.40GB. That did surprise me as I would have estimated this to be around 65GB. Anyway, download it and try it out. It is worth it.

Resource pool shares don’t make sense with vCloud Director?

Duncan Epping · Feb 28, 2012 ·

I’ve had multiple discussions around Resource Pool level shares in vCloud Director over the last 2 years so I figured I would write an article about it. A lot easier to point people to the article instead, and it also allows me to gather feedback on this topic. If you feel I am completely off, please comment… I am going to quote a question which was raised recently.

One aspect of “noise neighbor” that seems to never be discussed within vCloud is the allocation of shares. An organization with a single VM has better CPU resource access per VM than an organization that has 100 VMs. The organization resource pools have equal number of shares, so each VM gets a smaller and smaller allocation of shares as the VM count in an organization virtual data center increases.

Before I explain the rationale behind the design decision around shares behavior in a vCloud environment it is important to understand some of the basics. An Org vDC is nothing more than a resource pool. The chosen “allocation model” for your Org vDC and the specified charateristics determine what your Resource Pool will look like. I wrote a fairly lengthy article about it a while back, if you don’t understand allocation models take a look at it.

When an Org vDC is created on a vSphere layer a resource pool is created and it will typically have the following characteristics. In this example I will use the “Allocation Pool” allocation model as it is the most commonly used:

Org vDC Characteristics –> Resource Pool Characteristics

Total amount of resources –> Limit set to Y
Percentage of resources guaranteed –> Reservation set to X

On top of that each resource pool has a fixed number of shares. The difference between the limit and the reservation is often referred to as the “bust space”. Typically each VM will also have a reservation set. If 80% of your memory resources are guaranteed this will result in a 80% reservation on memory on your VM as well. This means that when you start deploying new VMs in to that resource pool you will be able to create as many until the limit is reached. In other words:

10GHz/10GB allocation pool Org vDC with 80% guaranteed resources = Resource pool with a 10GHz/GB limit and an 8GHz/GB reservation. In this pool you can create as many VMs until you hit those limits. Resources are guaranteed up to 8GHz/8GB!

Now what about those shares? The statement is, will the Org vDC with 100 VMs have less resource access than the Org vDC with only 10 VMs? Lets use that previous example again:

10GHz/10GB allocation pool with 80% resource guaranteed. This results in a resource pool with a 10GHz/10GB limit and an 8GHz/GB reservation.

Two Org VDCs are deployed, and each have the exact same characteristics. In “Org VDC – 1” 10 VMs were provisioned, while in “Org VDC – 2” 100 VMs are provisioned. It should be pointed out that the provider charges these customers for their Org VDC. As both decided to have 8GHz/GB guaranteed that is what they will pay for and when they exceed that “guarantee” they will be charged for it on top of that. They are both capped at 10GHz/GB however.

If there is contention than shares come in to play. But when is that exactly? Well after the 8GHz/GB of resources has been used. So in that case Org VDCs will be fighting over:

limit - reservation

In this scenario that is “10GHz/GB – 8GHz/GB = 2GHz/GB”. Is Org VDC 2 entitled to more resource access than Org VDC 1? No it is not. Let me repeat that, NO Org VDC 2 is not entitled to more resources.

Both Org VDC 1 and Org VDC 2 bought the exact same amount of resource. The only difference is that Org VDC 2 chose to deploy more VMs. Does that mean Org VDC 1’s VMs should receive less access to these resources just because they have less VMs? No they should not have less access! A provider cannot, in any shape or form, decide which Org VDC is entitled to more resources in that burst space, especially not based on the amount of VMs deployed as this gives absolutely no indication of the importance of these workloads.Org VDC 2 should buy more resources to ensure their VMs get what they are demanding.

Org VDC 1 cannot suffer because Org VDC 2 decided to overcommit. Both are paying for an equal slice of the pie… and it is up to themselves to determine how to carve that slice up. If they notice their slice of the pie is not big enough, they should buy a bigger or an extra slice!

However, there is a a scenario where shares can cause a “problem”… If you use “Pay As You Go” and remove all “guarantees” (reservations) and have contention in that scenario each resource pool will get the same access to the resources. If you have resource pools (Org VDCs) with 500 VMs and resource pools with 10 VMs this could indeed lead to a problem for the larger resource pools. Keep in mind that there’s a reason these “guarantees” were introduced in the first place, and overcommitting to the point where resources are completely depleted is most definitely not a best practice.

vCloud Director infrastructure resiliency solution

Duncan Epping · Feb 13, 2012 ·

By Chris Colotti (Consulting Architect, Center Of Excellence) and Duncan Epping (Principal Architect, Technical Marketing)

This article assumes the reader has knowledge of vCloud Director, Site Recovery Manager and vSphere. It will not go in to depth on some topics, we would like to refer to the Site Recovery Manager, vCloud Director and vSphere documentation for more in-depth details around some of the concepts.

Creating DR solutions for vCloud Director poses multiple challenges. These challenges all have a common theme. That is the automatic creation of objects by VMware vCloud Director such as resource pools, virtual machines, folders, and portgroups. vCloud Director and vCenter Server both heavily rely on management object reference identifiers (MoRef ID’s) for these objects. Any unplanned changes to these identifiers could, and often will, result in loss of functionality as Chris has described in this article. vSphere Site Recovery Manager currently does not support protection of virtual machines managed by vCloud Director for these exact reasons.

The vCloud Director and vCenter objects, which are referenced by each product, that are both identified to cause problems when identifiers are changed are:

Folders
Virtual machines
Resource Pools
Portgroups

Besides automatically created objects the following pre-created static objects are also often used and referenced to by vCloud Director.

Clusters
Datastores

Over the last few months we have worked on, and validated a solution which avoids changes to any of these objects. This solution simplifies the recovery of a vCloud Infrastructure and increases management infrastructure resiliency. The amazing thing is it can be implemented today with current products.

In this blog post we will give an overview of the developed solution and the basic concepts. For more details, implementation guidance or info about possible automation points we recommend contacting your VMware representative and you engage VMware Professional Services.

Logical Architecture Overview

vCloud Director infrastructure resiliency can be achieved through various scenarios and configurations. This blog post is focused on a single scenario to allow for a simple explanation of the concept. A white paper explaining some of the basic concepts is also currently being developed and will be released soon. The concept can easily be adapted for other scenarios, however you should inquire first to ensure supportability. This scenario uses a so-called “Active / Standby” approach where hosts in the recovery site are not in use for regular workloads.

In order to ensure all management components are restarted in the correct order, and in the least amount of time vSphere Site Recovery Manager will be used to orchestrate the fail-over. As of writing, vSphere Site Recovery Manager does not support the protection of VMware vCloud Director workloads. Due to this limitation these will be failed-over through several manual steps. All of these steps can be automated using tools like vSphere PowerCLI or vCenter Orchestrator.

The following diagram depicts a logical overview of the management clusters for both the protected and the recovery site.

In this scenario Site Recover Manager will be leveraged to fail-over all vCloud Director management components. In each of the sites it is required to have a management vCenter Server and an SRM Server which aligns with standard SRM design concepts.

Since SRM cannot be used for vCloud Director workloads there is no requirement to have an SRM environment connecting to the vCloud resource cluster’s vCenter Server. In order to facilitate a fail-over of the VMware vCloud Director workloads a standard disaster recovery concept is used. This concept leverages common replication technology and vSphere features to allow for a fail-over. This will be described below.

The below diagram depicts the VMware vCloud Director infrastructure architecture used for this case study.

Both the Protected and the Recovery Sites have a management cluster. Each of these contain a vCenter Server and an SRM Server. These are used facilitate the disaster recovery procedures. The vCloud Director Management virtual machines are protected by SRM. Within SRM a protection group and recovery plan will be created to allow for a fail-over to the Recovery Site.

Please note that storage is not stretched in this environment and that hosts in the Recovery Site are unable to see storage in the Protected Site and as such are unable to run vCloud Director workloads in a normal situation. It is also important to note that the hosts are also attached to the cluster’s DVSwitch to allow for quick access to the vCloud configured port groups and are pre-prepared by vCloud Director.

These hosts are depicted as hosts, which are placed in maintenance mode. These hosts can also be stand-alone hosts and added to the vCloud Director resource cluster during the fail-over. For simplification and visualization purposes this scenario describes the situation where the hosts are part of the cluster and placed in maintenance mode.

Storage replication technology is used to replicate LUNs from the Protected Site to the Recover Site. This can be done using asynchronous or synchronous replication; typically this depends on the Recovery Point Objective (RPO) determined in the service level agreement (SLA) as well as the distance between the two sites. In our scenario synchronous replication was used.

Fail-over Procedure

In this section the basic steps required for a successful fail-over of a VMware vCloud Director environment are described. These steps are pertinent to the described scenario.

It is essential that each component of the vCloud Director management stack be booted in the correct order. The order in which the components should be restarted is configured in an SRM recovery plan and can be initiated by SRM with a single button. The following order was used to power-on the vCloud Director management virtual machines:

Database Server (providing vCloud Director, vCenter Server, vCenter Orchestrator, and Chargeback Databases)
vCenter Server
vShield Manager
vCenter Chargeback (if in use)
vCenter Orchestrator (if in use)
vCloud Director Cell 1
vCloud Director Cell 2

When the fail-over of the vCloud Director management virtual machines in the management cluster has succeeded, multiple steps are required to recover the vCloud Director workload. These are described in a manual fashion but can be automated using PowerCLI or vSphere Orchestrator.

Validate all vCloud Director management virtual machines are powered on
Using your storage management utility break replication for the datastores connected to the vCloud Director resource cluster and make the datastores read/write (if required by storage platform)
Mask the datastores to the recovery site (if required by storage platform)
Using ESXi command line tools mount the volumes of the vCloud Director resource cluster on each host of the cluster

esxcfg-volume –m <volume ID>

Using vCenter Server rescan the storage and validated all volumes are available
Take the hosts out of maintenance mode for the vCloud Director resource cluster (or add the hosts to your cluster, depending on the chosen strategy)
In our tests the virtual were automatically powered on by vSphere HA. vSphere HA is aware of the situation before the fail-over and will power-on the virtual machines according to the last known state

Alternatively, virtual machines can be powered-on manually leveraging the vCloud API to they are booted in the correct order as defined in their vApp metadata. It should be noted that this could possibly result in vApps being powered-on which were powered-off before the fail-over as there is currently no way of determining their state.

Using this vCloud Director infrastructure resiliency concept, a fail-over of a vCloud Director environment has been successfully completed and the “cloud” moved from one site to another.

As all vCloud Director management components are virtualized, the virtual machines are moved over to the Recovery Site while maintaining all current managed object reference identifiers (MoRef IDs). Re-signaturing the datastore (giving it a new unique ID) has also been avoided to ensure the relationship between the virtual machines / vApps within vCloud Director and the datastore remained in tact.

Is that cool and simple or what? For those wondering, although we have not specifically validated it, yes this solution/concept would also apply to VMware View. Yes it would also work with NFS if you follow my guidance in this article about using a CNAME to mount the NFS datastore.

Fling: Auto Deploy GUI

Duncan Epping · Feb 9, 2012 ·

Many of you probably know the PXE Manager fling which Max Daneri created… Max has been working on something really cool, a brand new fling: Auto Deploy GUI! I had the pleasure of test driving the GUI and providing early feedback to Max when he had just started working on it and since then it has come a long way! It is a great and useful tool which I hope will at some point be part of vCenter. Once again, great work Max! I suggest that all of you check out this excellent fling and provide Max with feedback so that he can continue to develop and improve it.

The Auto-Deploy GUI fling is an 8MB download and allows you to configure auto-deploy without the need to use PowerCLI. It comes with a practical deployment guide which is easy to follow and should allow all of you to test this in your labs! Download it it now and get started!

source
The Auto Deploy GUI is a vSphere plug-in for the VMware vSphere Auto Deploy component. The GUI plug-in allows a user to easily manage the setup and deployment requirements in a stateless environment managed by Auto Deploy. Some of the features provided through the GUI include the ability to add/remove Depots, list/create/modify Image Profiles, list VIB details, create/modify rules to map hosts to Image Profiles, check compliance of hosts against these rules and re-mediate hosts.