VMware vCloud Director (vCD)

As many of you know months ago I moved from VMware Professional Services to the VMware Cloud Practice. A major part of our work revolves around VMware vCloud Director(vCD) so you can imagine that I am glad it has finally been released. This lifts the NDA and as such you can expect a whole bunch of articles in the near future about vCD.

What is?

VMware vCloud Director is a new abstraction layer. vCD, as I will refer to it as of now, is a layer on top of vCenter and abstracts all the resources vCenter manages. All these resources are combined into large pools for your customers consume or should I call them tenants which seem to be the cool term these days. VMware vCloud Director does not only abstracts and pools resources it also adds a self service portal. As stated before it is more or less bolted on top of vCenter/ESX(i). I created a diagram to visualize this a bit more. Please note that this is still a simplistic and high level overview:

vcd01

Now, I guess you noticed it says “VMware vCloud Director Cluster”. This cluster is formed by multiple vCD servers or as we refer them to “cells”. The cells form vCD and are responsible for the abstraction of the resources and the portal amongst other features that will be discussed later.

As stated before, vCD abstracts resources which are managed by vCenter. There are currently three types of resources that can be used by a tenant. Below each of the resource types I have mentioned what it links to on a vSphere layer so that it makes a bit more sense:

  1. Compute
    - clusters and resource pools
  2. Network
    - dvSwitches and/or portgroups
  3. Storage
    - VMFS datastores and NFS shares

These resources will be offered through a self-service portal which is part of vCD. As a vCD Administrator you can use the vCD portal to carve up these resources as required and assign these to a customer or department, often referred to in vCD as an “Organization”. Please note here that vCD is not purely designed for Service Providers, vCD is also designed for Enterprise environments.

In order to carve up these resources a container will need to be created and this is what we call a Virtual Datacenter. There are two different types of Virtual Datacenter’s:

  • Provider Virtual Datacenter (Provider vDC)
  • Organization Virtual Datacenter (Org vDC)

A Provider Virtual Datacenter is the foundation for your Compute Resources. When creating a Provider Virtual Datacenter you will need to select a resource pool, however this can also be the root resource pool aka your vSphere cluster. At the same time you will need to associate a set of datastores with the Provider vDC, generally speaking this will be all LUNs masked to your cluster. Some of my colleagues described the Provider vDC as the object where you specify the SLA and I guess that explains the concept a bit more. So for instance you could have a Gold Provider vDC with 15K FC disks and N+2 redundancy for HA while your Silver Provider vDC just offers N+1 redundancy and runs on SATA disk… everything is possible.

After you have created a Provider vDC you can create an Org vDC and tie that Org vDC to a vCD Organization. Please note that an Organization can have multiple Org vDCs associated to it. I depicted this in the following diagram, there are three Org vDCs owned by a single Organization across two Provider vDCs. The two provider vDCs each have a specific SLA.

So what can I do with these Org vDCs? Simply said: consume them. You can create vApps, and a vApp is just a logical container for 1 or more virtual machines. This vApp could for instance contain a three tiered app which has an internal network and a firewalled outbound connection for a single VM, which would look something like this:

Of course there are various ways to create a network, but that is way to complex for an introduction article. I will go into more details around all the cool networking functionality that is offered in one of the following articles however.

As you can see there is a lot possible with vCD, I guess too much to describe in a single article.

To summarize, vCD offers a self service portal. This portal enables you to provision resources to a tenant and enables the tenant to consume these resources by creating vApps. vApps are a container for one or multiple virtual machines and can contain isolated networks. As said, there is a lot more to vCD but I feel that all of you should play around with it a bit more before I dive into some of the specifics. (For those at VMworld, LAB 13!)

As you can imagine I am really excited about this release, and am absolutely thrilled that I can finally talk about this. There are more blog articles coming up, but I just want the dust to settle a bit first so everyone can see those clouds!

More background/download links can be found here:

Release Notes:
http://www.vmware.com/support/vcd/doc/rel_notes_vcloud_director_10.html

Download Landing Page:
http://downloads.vmware.com/d/info/datacenter_downloads/vmware_vcloud_director/1_0

Documentation Landing:
http://www.vmware.com/support/pubs/vcd_pubs.html

Product site:
http://www.vmware.com/products/vcloud-director/

Eval Guide:
http://www.vmware.com/files/pdf/techpaper/VMW-vCloud-Director-EvalGuide.pdf

Screenshots:
http://www.vmware.com/products/vcloud-director/screens.html

VMworld 2010: Labs are the place to be!

As some of you might now I am not only doing a session at VMworld 2010 but I am also a Lab Captain. We have been working really hard over the last couple of months to get the labs up and running for you guys.

Over the last three days it has been chaos here at VMworld. Setting up, testing and stress testing labs and of course some last minute changes to make sure all of you guys have a great experience.

I must say, looking at the lab environment it has been worth it. We are not talking about a couple of labs here. No we are talking 480 seats and about 30 different labs ranging from “VMware ESXi Remote Management Utilities” to “Intro to Zimbra Colloaboration Suite” and even products which will be formally announced tomorrow.

I took a couple of pictures this morning of the labs just to get you guys as excited as we are:

We all hope you will enjoy the Labs at VMworld 2010, but looking at the content and the set-up I am confident you will! Enjoy,

Soon in a book store near you! HA and DRS Deepdive

Over the last couple of months Frank Denneman and I have been working really hard on a secret project. Although we have spoken about it a couple of times on twitter the topic was never revealed.

Months ago I was thinking about what a good topic would be for my next book. As I already wrote a lot of articles on HA it made sense to combine these and do a full deepdive on HA. However a VMware Cluster is not just HA. When you configure a cluster there is something else that usually is enabled and that is DRS. As Frank is the Subject Matter Expert on Resource Management / DRS it made sense to ask Frank if he was up for it or not… Needless to say that Frank was excited about this opportunity and that was when our new project was born: VMware vSphere 4.1 – HA and DRS deepdive.

As both Frank and I are VMware employees we contacted our management to see what the options were for releasing this information to market. We are very excited that we have been given the opportunity to be the first official publication as part of a brand new VMware initiative, codenamed Rome. The idea behind Rome along with pertinent details will be announced later this year.

Our book is currently going through the final review/editing stages. For those wondering what to expect, a sample chapter can be found here. The primary audience for the book is anyone interested in high availability and clustering. There is no prerequisite knowledge needed to read the book however, the book will consist of roughly 220 pages with all the detail you want on HA and DRS. It will not be a “how to” guide, instead it will explain the concepts and mechanisms behind HA and DRS like Primary Nodes, Admission Control Policies, Host Affinity Rules and Resource Pools. On top of that, we will include basic design principles to support the decisions that will need to be made when configuring HA and DRS or when designing a vSphere infrastructure.

I guess it is unnecessary to say that both Frank and I are very excited about the book. We hope that you will enjoy reading it as much as we did writing it. Stay tuned for more info, the official book title and url to order the book. We hope to be able to give you an update soon.

Frank and Duncan

Two new HA Advanced Settings

  • das.perHostConcurrentFailoversLimit
    When multiple VMs are restarted on one host, up to 32 VMs will be powered on concurrently by default. This is to avoid resource contention on the host. This limit can be changed through the HA advanced option: das.perHostConcurrentFailoversLimit. Setting a larger value will allow more VMs to be restarted concurrently and might reduce the overall VM recovery time, but the average latency to recover individual VMs might increase. We recommend using the default value.
  • das.sensorPollingFreq
    The das.sensorPollingFreq option controls the HA polling interval. HA polls the system periodically to update the cluster state with such information as how many VMs are powered on, and so on. The polling interval was 1 second in vSphere 4.0. A smaller value leads to faster VM power on, and a larger value leads to better scalability if a lot of concurrent power operations need to be performed in a large cluster. The default is 10 seconds in vSphere 4.1, and it can be set to a value between 1 and 30 seconds.

I want to note that I would not recommend changing these. There is a very good reason the defaults have been selected. Changing these can lead to instability, however when troubleshooting they might come in handy.

HA and a Metrocluster

I was reading an excellent article on NetApp metroclusters and vm-host affinity rules Larry Touchette the other day. That article is based on the tech report TR-3788 which covers the full solution but did not include the 4.1 enhancements.

The main focus of the article is on VM-Host Affinity Rules. Great stuff and it will “ensure” you will keep your IO local. As explained when a Fabric Metrocluster is used the increased latency when going across for instance 80KM of fibre will be substantial. By using VM-Host Affinity Rules where a group of VMs are linked to a group of hosts this “overhead” can be avoided.

Now, the question of course is what about HA? The example NetApp provided shows 4 hosts. With only four hosts we all know, hopefully at least, that all of these hosts will be primary. So even if a set of hosts fail one of the remaining hosts will be able to take over the failover coordinator role and restart the VMs. Now if you have up to an 8 host cluster that is still very much true as with a max of 5 primaries and 4 hosts on each side at least a single primary will exist in each site.

But what about 8 hosts or more? What will happen when the link between sites fail? How do I ensure each of the sites has primaries left to restart VMs if needed?

Take a look at the following diagram I created to visualize all of this:

We have two datacenters here, Datacenter A and B. Both have their own FAS with two shelves and their own set of VMs which run on that FAS. Although storage will be mirrored there is still only one real active copy of the datastore. In this case VM-Host Affinity rules have been created to keep the VMs local in order to avoid IO going across the wire. This is very much similar to what NetApp described.

However in my case there are 5 hosts in total which are a darker color green. These hosts were specified as the preferred primary nodes. This means that each site will have at least 2 primary nodes.

Lets assume the link between Datacenter A and B dies. Some might assume that this will trigger an HA Isolation Response but it actually will not.

The reason for this being is the fact that an HA primary node still exists in each site. Isolation Response is only triggered when no heartbeats are received. As a primary node sends a heartbeat to both the primary and secondary nodes a heartbeat will always be received. Again as I can’t emphasize this enough, an Isolation Response will not be triggered.

However if the link dies between these Datacenter’s, it will appear to Datacenter A as if Datacenter B is unreachable and one of the primaries in Datacenter A will initiate restart tasks for the allegedly impacted VMs and vice versa. However as the Isolation Response has not been triggered a lock on the VMDK will still exist and it will be impossible to restart the VMs.

These VMs will remain running within their site. Although it might appear on both ends that the other Datacenter has died HA is “smart” enough to detect it hasn’t and it will be up to you to decide if you want to failover those VMs or not.

I am just so excited about these developments, that I can’t get enough of it. Although the “das.preferredprimaries” setting is not supported as of writing, I thought this was cool enough to share it with you guys. I also want to point out that in the diagram I show 2 isolation addresses, this of course is only needed when a gateway is specified which is not accessible at both ends when the network connection between sites is dead. If the gateway is accessible at both sites even in case of a network failure only 1 isolation address, which can be the default gateway, is required.

Layer 2 Adjacency for vMotion (vmkernel)

Recently I had a discussion around Layer 2 adjacency for the vMotion(vmkernel interface) network. With that meaning that all vMotion interfaces, aka vmkernel interfaces, are required to be on the same subnet as otherwise vMotion would not function correctly.

Now I remember when this used to be part of the VMware documentation but that requirement is nowhere to be found anywhere. I even have a memory of documentation of the previous versions stating that it was “recommended” to have layer-2 adjacency but even that is nowhere to be found. The only reference I could find was an article by Scott Lowe where Paul Pindell from F5 chips in and debunks the myth, but as Paul is not a VMware spokes person it is not definitive in my opinion. Scott also just published a rectification of his article after we discussed this myth a couple of times over the last week.

So what are the current Networking Requirements around vMotion according to VMware’s documentation?

  • On each host, configure a VMkernel port group for vMotion
  • Ensure that virtual machines have access to the same subnets on source and destination hosts
  • Ensure that the network labels used for virtual machine port groups are consistent across hosts

Now that got me thinking, why would it even be a requirement? As far as I know vMotion is all layer three today, and besides that the vmkernel interface even has the option to specify a gateway. On top of that vMotion does not check if the source vmkernel interface is on the same subnet as the destination interface, so why would we care?

Now that makes me wonder where this myth is coming from… Have we all assumed L2 adjacency was a requirement? Have the requirements changed over time? Has the best practice changed?

Well one of those is easy to answer; no the best practice hasn’t changed. Minimize the amount of hops needed to reduce latency, is and always will be, a best practice. Will vMotion work when your vmkernels are in two different subnets, yes it will. Is it supported? No it is not as it has not explicitly gone through VMware’s QA process. However, I have had several discussions with engineering and they promised me a more conclusive statement will be added to our documentation and the KB in order to avoid any misunderstanding.

Hopefully this will debunk this myth that has been floating around for long enough once and for all. As stated, it will work it just hasn’t gone through QA and as such cannot be supported by VMware at this point in time. I am confident though that over time this statement will change to increase flexibility.

References:

Memory states

I was just browsing the vsinodes/procnodes. I noticed the following:

Free memory state thresholds {
soft:64 pct
hard:32 pct
low:16 pct
}

As explained in Frank’s excellent article on memory reservations, ESX/ESXi uses memory states to determine what type of memory reclamation technique to use. Techniques that can be used are TPS, ballooning and swapping. Of course you will always want to avoid ballooning and swapping but that is not the point here.  The point is that as far as I am aware the thresholds for those states have always been:

  • High – 6%
  • Soft – 4%
  • Hard – 2%
  • Low – 1%

This is also what our documentation states. Now if you do the math you will notice that 64% of 6% is indeed 4% and so on… Although it doesn’t seem to be substantial it is something I wanted to document, just for completeness sake.

Subscribe to RSS Feed Follow me on Twitter!