Distributed vSwitches and vCenter outage, what’s the deal?

Recently my colleague Venky Deshpande released  a whitepaper around VDS Best Practices. This white paper describes various architectural options when adopting a VDS only strategy. A strategy of which I can see the benefits. On Facebook multiple people made comments around why this would be a bad practice instead of a best practice, here are some of the comments:

“An ESX/ESXi host requires connectivity to vCenter Server to make vDS operations, such as powering on a VM to attach that VM’s network interface.”

“The issue is that if vCenter is a VM and changes hosts during a disaster (like a total power outage) and then is unable to grant itself a port to come back online.”

I figured the best way to debunk all these myths was to test it myself. I am confident that it is no problem, but I wanted to make sure that I could convince you. So what will I be testing?

  • Network connectivity after Powering-on a VM which is connected to a VDS while vCenter is down.
  • Network connectivity restore of vCenter attached to a VDS after a host failure.
  • Network connectivity restore of vCenter attached to a VDS after HA has moved the VM to a different host and restarted it.

Before we start I think it is useful to rehash something, which is different types of portgroups which is described in more depth in this KB:

  • Static binding – Port is immediately assigned and reserved for it when VM is connected to the dvPortgroup through vCenter. This happens during the provisioning of the virtual machine!
  • Dynamic binding – Port is assigned to a virtual machine only when the virtual machine is powered on and its NIC is in a connected state. The Port is disconnected when the virtual machine is powered off or the virtual machine’s NIC is disconnected. (Deprecated in 5.0)
  • Ephemeral binding – Port is created and assigned to a virtual machine when the virtual machine is powered on and its NIC is in a connected state. The Port is deleted when the virtual machine is powered off or the virtual machine’s NIC is disconnected. Ephemeral Port assignments can be made through ESX/ESXi as well as vCenter.

Hopefully this makes it clear straight away that their should be no problem at all, “Static Binding” is the default and even when vCenter is down a VM which has been provisioned before vCenter went down can easily be powered on and will have network access. I don’t mind spending some lab hours on this, so lets put this to a test. Lets use the defaults and see what the results are.

First I made sure all VMs were connected to a dvSwitch. I powered of a VM and checked the “Network settings and this is what it revealed… a port already assigned even when powered off:

This is not the only place you can see port assignments, you can verify it on the VDS’s “ports” tab:

Now lets test this, as that is ultimately what it is all about. First test, Network connectivity after Powering-on a VM which is connected to a VDS while vCenter is down:

  • Connected VM to dvPortgroup with static binding (is the default and best practice)
  • Power off VM
  • Power off vCenter VM
  • Connect vSphere Client to host
  • Power on VM
  • Ping VM –> Positive result
  • You can even see on the command line that this VM uses its assigned port:
    esxcli network vswitch dvs vmware list
    Client: w2k8-001.eth0
    DVPortgroup ID: dvportgroup-516
    In Use: true
    Port ID: 137

Second test, Network connectivity restore of vCenter attached to a VDS after a host failure:

  • Connected vCenter VM to dvPortgroup with static binding (is the default and best practice)
  • Power off vCenter VM
  • Connect vSphere Client to host
  • Power on vCenter VM
  • Ping vCenter VM –> Positive result

Third test, Network connectivity restore of vCenter attached to a VDS after HA has moved the VM to a different host and restarted it.

  • Connected vCenter VM to dvPortgroup with static binding (is the default and best practice)
  • Yanked the cable out of the ESXi host on which vCenter was running
  • Opened a ping to the vCenter VM
  • HA re-registered the vCenter VM on a different host and powered it on
    • The re-register / power-on took roughly 45 – 60 seconds
  • Ping vCenter VM –> Positive result

I hope this debunks some of those myths floating around. I am the first to admit that there are still challenges out there, these will hopefully be addressed soon, but I can assure you that your virtual machines will regain connection as soon as they are powered on through HA or manually… yes even when your vCenter Server is down.

 

Using a CNAME (DNS alias) to mount an NFS datastore

I was playing around in my lab with NFS datastores today. I wanted to fail-over a replicated NFS datastore without the need to re-register the virtual machines running on them. I had mounted the NFS datastore using the IP address and as that is used to create the UUID it was obvious that it wouldn’t work. I figured there should be a way around it but after a quick search on the internet I still hadn’t found anything yet.

I figured it should be possible to achieve this using a CNAME but also recalled something around vCenter screwing this up again. I tested it anyway and with success. This is what I did:

  • Added both NFS servers to DNS
  • Create a CNAME (DNS Alias) and pointed to the “active” NFS server
    • I used the name “nasdr” to make it obvious what it is used for
  • Created an NFS share (drtest) on the NFS server
  • Mount the NFS export using vCenter or though the CLI
    • esxcfg-nas -a -o nasdr -s /drtest drtest
  • Check the UUID using vCenter or through the CLI
    • ls -lah /vmfs/volumes
    • example output:
      lrwxr-xr-x    1 root     root           17 Feb  6 10:56 drtest -> e9f77a89-7b01e9fd
  • Created a virtual machine on the nfsdatastore
  • Enabled replication to my “standby” NFS server
  • I killed my “active” NFS server environment (after validating it had completed replication)
  • Changed the CNAME to point to the secondary NFS server
  • Unmounted the volume old volume
    • esxcfg-nas -d drtest
  • I did a vmkping to “nasdr” just to validate the destination IP had changed
  • Rescanned my storage using “esxcfg-rescan -A”
  • Mounted the new volume
    • esxcfg-nas -a -o nasdr -s /drtest drtest
  • Checked the UUID using the CLI
    • ls -lah /vmfs/volumes
    • example output:
      lrwxr-xr-x    1 root     root           17 Feb  6 13:09 drtest -> e9f77a89-7b01e9fd
  • Powered on the virtual machine now running on the secondary NFS server

As you can see, both volumes had the exact same UUID. After the fail-over I could power-on the virtual machine. No need to re-register the virtual machines within vCenter first. Before I wanted to share it with the world I reached out to my friends at NetApp. Vaughn Stewart connected me with Peter Learmonth who validated my findings and actually pointed me to a blog article he wrote about this topic. I suggest to head-over to Peter’s article for more details on this.

Setting the default affinity rule for Storage DRS

On my blog article for yesterday “Rob M” commented that the default affinity rule for Storage DRS (SDRS), keep VM files together, did not make sense to him. One of the reasons this affinity rule is set is because customers indicated that from an operational perspective it would be easier if all files of a given VM (vmx / vmdk’s) would reside in the same folder. Especially troubleshooting was one of the main reasons, as this lowers complexity. I have to say that I fully agree with this, I’ve been in the situation where I needed to recover virtual machines and having them spread across multiple datastore really complicates things.

But, just like Rob, you might not agree with this and rather have SDRS handling balancing on a file per file basis. That is possible and we documented this procedure in our book. I was under the impression that I blogged this, but just noticed that somehow I never did. Here is how you change the affinity rule for the current provisioned VMs in a datastore cluster:

  1. Go to Datastores and Datastore Clusters
  2. Right click a datastore cluster and select “edit settings”
  3. Click “Virtual machine settings”
  4. Deselect “Keep VMDKs together”
    1. For virtual machines that need to stick together you can override the default by ticking the tick box next to the VM


Also check out this article by Frank about DRS/SDRS affinity rules, useful to know!

How cool and useful is Storage DRS?!

I was just playing around in my lab and created a whole bunch of VMs when I needed to deploy to large virtual machines. Both of them had 500GB disks. The first one deployed without a hassle, but the second one was impossible to deploy, well not impossible for Storage DRS. Just imagine you had to figure this out yourself! Frank wrote a great article about the logic behind this and there is no reason for me to repeat this, just head over to Frank’s blog if you want to know more..

And the actually migrations being spawned:

Yes, this is the true value of Storage DRS… initial placement recommendations!

Creating an IP-Pool for VC Ops

I was importing the VC Ops virtual appliance and during the import I got a question around IP addresses. So I figured I would enter two IP addresses and that would be it. As soon as I powered on the VM I received the following error:

Cannot initialize property ‘vami.netmask0.VM_1′ since network ‘VM Network’ has no associated IP Pools configuration.

I figured this would be simple so I jumped back to “home” and went to the network section… Nothing around IP Pools. Even on a host or cluster layer there was nothing. Luckily my colleague Cormac jumped in and said check the “Datacenter” object, there should be an IP Pool tab there. He was right. Weird place and definitely something that needs to be improved. Anyway, configuring an IP Pool itself, now that I found it, was easy:

  1. Click your Datacenter object
  2. Go to the “IP Pools” tab
  3. Click “Add”
  4. Fill out the details:
    1. Subnet: which network will be used and what is the mask? (You can use a subnet calculator if you don’t know…)
    2. Enter the details of the gateway
    3. Specify a range, the format is “10.1.1.10#10″, this would result in a range from 10.1.1.10 until 10.1.1.19 (10 addresses counting from .10)
    4. Don’t forget to tick the “Enable IP Pool” check box
    5. Click on the “Associations” Tab and associate it to a network!
    6. Also, fill out the DNS and proxy details if and when required.
  5. This is what it should look like:

It is as simple as that, but indeed not easy to find hence the reason I figured a short article was in place.

PS: Creating a range and enabling the “IP Pool” is not required. “Enable IP Pool” enables the use of the Range. In this example I had to use a range as I could only use a specific range of this subnet.