Want to do some VMware Hands-on Labs but don’t have the kit?

Those who have visited VMworld and done a couple of labs know how awesome these are. Recently VMware announced that the VMware Hands-On Labs (HoL) would be made available online. You had the option to register for the beta and last week they announced the public beta was opened! On twitter some early reports are already popping up and judging by the comments people are loving it.

 

 

 

If you want to know more about the HoL portal make sure to read this blog, I played around with it various times… I loved it. Sign up and try it out. There is no excuse any longer for not having hands-on experience with VMware products, you don’t need to own a lab…

vSphere 5.1 Clustering Deepdive only $17.95, limited time!

Frank and I decided to put the vSphere 5.1 Clustering Deepdive (paper copy) up for sale for only $ 17.95. This is a limited time offer (21st of December), so if you want to get yourself, your friend-husband-father-kids or even grandmother a nice present be quick.

How’s that for a black-friday, cyber-monday, christmas / sinterklaas special? Yes indeed, the paper copy is cheaper than all e-books on vSphere 5.x out there on Amazon and with 5 stars (11 reviews) you know you can’t go wrong.

Happy holidays,

Frank and Duncan
(ps: the kindle copy is only 7.49, so even combined it is cheaper than most e-books out there :-) )

Thinking about a stretched vCloud Director deployment

Lately I have been thinking about what it would take to deploy a stretched vCloud Director (vCD) infrastructure. “The problem” with a vCloud Director infrastructure is that there are so many moving components, this makes it difficult to figure out how to protect each component. Let me point out that I do not have all the definitive answers to this yet, I am writing this article to get a better understanding of the problem myself. If you do not agree with my reasoning please feel free to comment, as I need YOUR help defining the recommended practices around vCD on a stretched infrastructure.

I listed the components I used in my lab:

  • vCenter Server Management
  • vCenter Server Cloud Resources
  • vCloud Director Cells
  • vShield Manager
  • Database Server

That would be 5 moving components, but in reality we are talking more around 8. The thing here is that vCenter Server also has multiple components:

  • Single Sign On
  • Inventory Service
  • Web Client
  • vCenter Server

How do I protect these 8 components? The first 5 listed will be individual VMs and vCloud Director itself will be multiple cells even. What would this look like?

As you can see there are multiple vCenter Servers, one manages the Management Cluster and its components. While the other manages the “Cloud Resource Cluster”. Lets start listing all the components and discuss what the options are and if we can protect them in a special way or not.

vCenter Server (cloud resources and management)

vCenter Server can be protected through various methods. There is vCenter Heartbeat and of course we have vSphere HA (including VM Monitoring). First of all it is key to realize that neither of these solutions are fully “non-disruptive”. Both vSphere HA and vCenter Heartbeat will cause a slight disruption. vSphere HA will simply restart your VM when a host has failed, and vSphere HA – VM Monitoring can restart the Guest OS when the VM has failed. vCenter Heartbeat is a more intelligent solution, it can detect outages using a heartbeat mechanism and respond to that.

I guess the question is availability vs operational simplicity. How important is vCenter Server availability in your environment? Setting up vSphere HA and VM Monitoring is a matter of seconds. Installing and configuring vCenter Heartbeat is probably hours… And think about upgrade processes etc. I personally prefer not using vCenter Heartbeat but going for vSphere HA and VM Monitoring in this scenario, how about you?

What about these vCenter services like SSO / Inventory Service / Web Client etc. Although in a way, from a scalability/performance perspective, it might make sense to split things out… It also makes your environment more vulnerable to failures. What if 1 VM in your “vCenter service chain” is down. That might render your whole solution unusable. I would personally prefer to have vCenter Server, Inventory Service and the Web Client to be installed in a single VM. I can imagine that for SSO you would like to split it out, so that when you have multiple vCenter Server instances you can link them to the same SSO instance.

As mentioned SSO potentially could be deployed in an HA fashion. HA with regards to SSO is an active/standby solution, but I have been told there are other ways of deploying it and more info would be released soon.

Recommended Practice: I am a big fan of keeping things simple. Keep vCenter and at a minimum the Inventory Service together, and potentially the Web Client. Although Heartbeat has the potential of decreasing vCenter Server downtime, in many cloud environments SLAs are around vCloud workload availability and not about vCenter itself. One component that I would recommended to configure in a HA fashion is SSO. Without SSO you cannot login, this is critical for operations.

vCloud Director

Hopefully all of you are aware that vCloud Director can easily scale by deploying new “cells” as we call it. A cell is simply said a virtual machine running the vCD software. These cells are all connected to the same database and can handle load. Not only can they handle load, but they can also continue where another stopped. So from an Availability perspective this is ideal. I already depicted this in the diagram above by the way.

Recommended Practice: Deploy multiple vCloud Director cells in your management cluster. Ensure that at a minimum two cells reside on each of the “sites” of your stretched cluster. In order to achieve this vSphere DRS VM-Host affinity groups should be used!

vShield Manager

vShield Manager is one of the difficult components. It is a single virtual machine. You can protect it using vSphere HA but that is about it as the VM has multiple vCPUs which rules out FT. So what would make sense in this case? I would try to ensure that the vShield Manager is in the same site as vCenter Server. In the case there is a network failure between sites, at least the vShield Manager and vCenter Server can communicate when needed.

Recommended Practice: The vShield Manager virtual appliance resides in the same site as the vCenter Server, in other words it is a recommended practice to have both be part of the same vSphere DRS VM-Host affinity group. It is also recommended to leverage vSphere HA – VM Monitoring to allow for automatic restarts to occur in the case of a host or guest failure.

Database

This is the challenging one… As of vCloud Director 5.1 it is supported to cluster your database. So you could potentially cluster the vCD database. However this Database Server will host more than just vCD, it will probably also host the vCenter Server database and potentially other bits and pieces like Chargeback / Orchestrator etc. Not all of these support a clustered database solution today unfortunately. It is difficult defining a recommended practice in this case. Although Database Clustering will theoretically increase availability it will also complicate operations. From an operational perspective the difficult part is how to manage site isolations. Just imagine the network between Site-A and Site-B is down but all components are still running. What will you do with the database?

This is definitely one I am not sure about what to do with…

Summary

As you can see this is not a fully worked out set of recommended practices guide yet, there is still stuff to be figured out and I am going through the exercise as we speak. If you have an opinion about this, and I am sure many do, don’t hesitate to leave a comment!

VMware Innovate magazine edition available for download!

Internally at VMware we have this cool magazine called “Innovate”. I am part of the team which is responsible for VMware Innovate. I noticed this tweet from Julia Austin and figured I would share it with all of you. This specific edition is about RADIO 2012, which is a VMware R&D innovation offsite. (So looking forward to RADIO 2013!)

There is some cool stuff to be found in this magazine in my opinion. Just one of the many nuggets, did you know VMware was already exploring vSphere FT in 2001? Just a nice reminder of how long typical engineering efforts can take. Download the magazine now!

Ganesh Venkitachalam presented “Hardware Fault Tolerance with Virtual Machines” (or Fault Tolerance, for short) at the “Engineering Offsite 2001.” This was released as a feature called Fault Tolerance for vSphere 4.0.

vSphere Metro Storage Cluster – Uniform vs Non-Uniform

Last week I presented in Belgium at the quarterly VMUG event in Brussels. We did a Q&A and got some excellent questions. One of them was about vSphere Metro Storage Cluster (vMSC) solutions and more explicitly about Uniform vs Non-Uniform architectures. I have written extensively about this in the vSphere Metro Storage Cluster whitepaper but realized I never blogged that part. So although this is largely a repeat of what I wrote in the white paper I hope it is still useful for some of you.

Uniform Versus Nonuniform Configurations

VMware vMSC solutions are classified in two distinct categories, based on a fundamental difference in how hosts access storage. It is important to understand the different types of stretched storage solutions because this will impact your design and operational considerations. Most storage vendors have a preference for one of these solutions, so depending on your preferred vendor it could be you have no choice. The following two main categories are as described on the VMware Hardware Compatibility List:

  • Uniform host access configuration – When ESXi hosts from both sites are all connected to a storage node in the storage cluster across all sites. Paths presented to ESXi hosts are stretched across distance.
  • Nonuniform host access configuration – ESXi hosts in each site are connected only to storage node(s) in the same site. Paths presented to ESXi hosts from storage nodes are limited to the local site.

We will describe the two categories in depth to fully clarify what both mean from an architecture/implementation perspective.

With the Uniform Configuration, hosts in Datacenter A and Datacenter B have access to the storage systems in both datacenters. In effect, the storage-area network is stretched between the sites, and all hosts can access all LUNs. NetApp MetroCluster is an example of this. In this configuration, read/write access to a LUN takes place on one of the two arrays, and a synchronous mirror is maintained in a hidden, read-only state on the second array. For example, if a LUN containing a datastore is read/write on the array at Datacenter A, all ESXi hosts access that datastore via the array in Datacenter A. For ESXi hosts in Datacenter A, this is local access. ESXi hosts in Datacenter B that are running virtual machines hosted on this datastore send read/write traffic across the network between datacenters. In case of an outage, or operator-controlled shift of control of the LUN to Datacenter B, all ESXi hosts continue to detect the identical LUN being presented, except that it is now accessed via the array in Datacenter B.

The notion of “site affinity”—sometimes referred to as “site bias” or “LUN locality”—for a virtual machine is dictated by the read/write copy of the datastore. For example, when a virtual machine has site affinity with Datacenter A, its read/write copy of the datastore is located in Datacenter A.

The ideal situation is one in which virtual machines access a datastore that is controlled (read/write) by the array in the same datacenter. This minimizes traffic between datacenters and avoids the performance impact of reads’ going across the interconnect. It also minimizes unnecessary downtime in case of a network outage between sites. If your virtual machine is hosted in Datacenter B but its storage is in Datacenter A you can imagine the virtual machine won’t be able to do I/O when there is a site partition.

With the Non-uniform Configuration, hosts in Datacenter A have access only to the array in Datacenter A. Nonuniform configurations typically leverage the concept of a “virtual LUN.” This enables ESXi hosts in each datacenter to read and write to the same datastore/LUN. The clustering solution maintains the cache state on each array, so an ESXi host in either datacenter detects the LUN as local. Even when two virtual machines reside on the same datastore but are located in different datacenters, they write locally without any performance impact on either of them.

Note that even in this configuration each of the LUNs/datastores has “site affinity” defined. In other words, if anything happens to the link between the sites, the storage system on the preferred site for a given datastore is the only remaining one that has read/write access to it, thereby preventing any data corruption in the case of a failure scenario. This also means that it is recommended to align virtual machine – host affinity with datastore affinity to avoid any unnecessary disruption caused by a site isolation.

I hope this helps understanding the differences between Uniform vs Non-Uniform configurations. Many more details about vSphere Metro Storage Cluster solutions, including design and operational considerations, can be found in the vSphere Metro Storage Cluster whitepaper. Make sure to read it if you are considering, or have implemented, a stretched storage solution!

Warning: Latest OS X updates causes issues with Fusion 5.0.x!

On the VMware VMTN Forums it is reported that VMware Fusion 5.0.x in combination with the latest OS X updates (MacBook Air and MacBook Pro Update 2.0) causes virtual machines to crash. This is reported in the following threads:

The error encountered is:

VMware Fusion has encountered an error and has shut down the virtual machine.

You can simply solve this as mentioned by Darius:

In the meantime, please try this: With your VM powered off, go into the Virtual Machine > Settings, then choose Display, and turn off the Accelerate 3D Graphics option.  Then close the Settings window and try to power on your VM.

vSphere HA compatibility list, how do I check it?

Someone reported issues that in their environment VMs could not be restarted as there were no compatible hosts available. The relevant part of the error message was:

N3Vim5Fault16NoCompatibleHostE

I don’t know why in this case it happened as the log files unfortunately don’t provide these details. This person had manually restarted all of his VMs and that actually worked okay. This could mean that some how the “compatibility list” that vSphere HA maintains was not complete or it wasincorrect. So the question would be how do you validate that if you ever end up in a scenario like this?

First of all before I forget, create a support dump. That way VMware Global Support Services can help pinpointing your problems and provide tips on how to prevent these from occurring.

On a host, and you will have to SSH in to one, you can actually run a script that provides you with some nice details around this. Lets go through the options of the script and explain what you can get out of it. The script is called “prettyPrint.sh” can be found in “/opt/vmware/fdm/fdm/”.

./prettyPrint.sh hostlist

The hostlist option provides all relevant details about the hosts which are part of this cluster including “hostId”, host name, ip address etc.

./prettyPrint.sh clusterconfig

The clusterconfig option provides all configuration info of your cluster like admission control and isolation response.

./prettyPrint.sh compatlist

The compatlist option provides the list of VMs and host they are compatible with, only for vSphere 5.0.

./prettyPrint.sh vmmetadata

The vmmetadata option provides the list of VMs and host they are compatible with, only for vSphere 5.1.

So in this case “vmmetadata” was important as it lists VMs compatible with which host. In this case “<index>0</index> refers to a VM and “<compatMask>0,1,2,3</compatMask> refers to the hosts it is compatible with. Nice right?!

   <compatMatrix>
      <restartCompat>
         <index>0</index>
         <compatMask>0,1,2,3</compatMask>
      </restartCompat>
      <restartCompat>
         <index>1</index>
         <compatMask>0,1,2,3</compatMask>
      </restartCompat>
      <restartCompat>
         <index>2</index>
         <compatMask>0,1,2,3</compatMask>
      </restartCompat>
   </compatMatrix>

** Update: Added Portgroup Test **

On VMTN someone asked if HA also takes networking in to account when restarting VMs. If a given portgroup is not available on specific hosts will HA smartly place VMs? In my test I removed the “VM Network” portgroup from one of my hosts (host with ID 2). When listing the compatibility list the following shows up:

<restartCompat>
       <index>0</index>
       <compatMask>0,1,3</compatMask>
</restartCompat>

As you can see host with ID 2 is missing.