Scale out building block style, or should I say (yellow) brick style!

I attended VMware PEX a couple of weeks back and during some of the sessions and discussions I had after the sessions I realized that many customers out there still design using legacy concepts. Funny thing is that this mainly applies to server virtualization projects and to a certain extend to cloud environments.It appears that designing in building blocks is something that EUC side of this world has embraced a long time ago.

I want to use this post to get feedback about your environments. How you scale up / scale out. I discussed a concept with one of the PEX attendees which I want to share. (This is no rocket science or something revolutionary, let that be clear.) This attendee worked for one of our partners, a service provider in the US, and was responsible for creating a scalable architecture for an Infrastructure as a Service (IaaS) offering.

The original plan they had was to build an environment that would allow for 10.000 virtual machines. Storage, networking and compute sizing and scaling was all done with these 10k VMs in mind. However it was expected that in the first 12 months only 1000 virtual machines would be deployed. You can imagine that internally there was a lot of debate around the upfront investment. Especially the storage and compute platform was a huge discussion. What if the projections where incorrect, what if 10k virtual machines was not realistic in three years. What if the estimated compute and IOps requirements where incorrect? This could lead to substantial underutilization of the environment, especially in IaaS where it is difficult to predict how the workload will behave this could lead to a significant loss. On top of that, they were already floor space constraint… which made it impossible to scale / size for 10k virtual machines straight from the start,

During the discussion I threw the building block (pod, stack, block… all the same) method on the table, as mentioned not unlike what the VDI/EUC folks have been doing for years and not unlike some of you have been preaching. Kris Boyd mentioned this in his session at Partner Exchange and let me quote him on this as I fully agree with his statemenet “If you know what works well on a certain scale, why not just repeat that?!” The advantage to that would be that the costs are predictive, but even more important for the customers and ops team the result of the implementation would be predictive. So what was discussed and what will be the approach for this particular environment, or at least will be the proposed as a possible architecture?

First of all a management cluster would be created. This is the mothership of the environment. It will host all vCenter virtual machines, vCloud Director, Chargeback, Databases etc. This environment does not have high IOps requirements or high compute requirements. It would be implemented on a small storage device, NFS based storage that is. The reason it was decided to use NFS is because of the fact that the vCloud Director cells require NFS to transport files. Chris Colotti wrote an article about when this NFS share is used, might be useful to read for those interested in it. This “management cluster” approach is discussed in-depth in the vCloud Architecture Toolkit.

For the vCloud Director resource the following was discussed. The expectation was a 1000 VMs in the first 12 months. The architecture would need to cater for this. It was decided to use averages to calculate the requirements for this environment as the workload was unknown and could literally be anything. How did they come up with a formula in this case? Well what I suggested was looking at their current “hosted environment” and simply averaging things out. Do a dump of all data and try to come up with some common numbers. This is what it resulted in:

  • 1000 VMs (4:1 core / VM, average of 6GB memory per VM)
    • Required cores = 250 (for example 21 x dual socket 6 core host)
    • Required memory = 6TB (for example 24 x 256GB host)

This did not take any savings due to TPS in to account and the current hardware platform used wasn’t as powerful as the new one. In my opinion it is safe to say that 24 hosts would cater for these 1000 VMs and that would include N+2. Even if it did not, they agreed that this would be their starting point and max cluster size. They wanted to avoid any risks and did not like to push the boundaries too much with regards to cluster sizes. Although I believe 32 hosts is no problem at all in a cluster I can understand where they were coming from.

The storage part is where it got more interesting. They had a huge debate around upfront costs and did not want to invest at this point in a huge enterprise level storage solution. As I said they wanted to make sure the environment would scale, but also wanted to make sure the costs made sense. On average in their current environment the disk size was 60GB. Multiply that by a 1000 and you know you will need at least 60TB of storage. This is a lot of spindles. Datacenter floor space was definitely a constraint, so this would be huge challenge… unless you use techniques like deduplication / compression and you have a proper amount of SSD to maintain a certain service level / guarantee performance.

During the discussion it was mentioned several times that they would be looking at the upcoming storage vendors like Tintri, Nimble and Pure Storage. There were the three specifically mentioned by this partner, but I realize there are many others out there. I have to agree that the solutions offered by these vendors are really compelling and each of them have something unique. It is difficult to compare them on paper though as Tintri does NFS, Nimble iSCSI and Pure Storage HC (and iSCSI soon) but is also SSD only. Especially Pure Storage intrigued them due to the power/cooling/rackspace savings. Also the great thing about all of these solutions is again that they are predictable from a cost / performance perspective and it allows for an easy repeatable architecture. They haven’t made a decision yet and are planning on doing an eval with each of the solutions to see how it integrates, scales, performs and most importantly what the operational impact is.

Something we did not discuss unfortunately was networking. These guys, being a traditional networking provider, did not have much control over what would be deployed as their network department was in charge of this. In order to keep things simple they were aiming for a 10Gbit infrastructure, the cost of networking ports was significant and they wanted to reduce the amount of cables coming out of the rack for simplicity reasons.

All in all it was a great discussion which I thought was worth sharing, although the post is anonymized I did ask their permission before I wrote this up :-). I realize that this is by far a complete picture but I hope it does give an idea of the approach, if I can find the time I will expand on this with some more examples. I hope that those working on similar architectures are willing to share their stories.

I selected “failover host” and my VMs still end up on a different host after an HA event

I received a question today about HA admission control policies, and more specifically about the “failover host” admission control policy. The question was why VMs were restarted on a different host then selected with the “Failover Host” admission control policy. Shouldn’t this policy guarantee that a VM is restarted on the designated host?

The answer is fairly straight forward, and I thought I blogged about this already but I cannot find it so here it goes. Yes, in a normal condition HA will request the designated failover host to restart the failed VMs. However there are a couple of cases where HA will not restart a VM on the designated failover host(s):

  • When the failover host is not compatible with the virtual machine (portgroup or datastore missing)
  • When the failover host does not have sufficient resource available for the restart
  • When the virtual machine restart fails HA retries on a different host

Keep that in mind when using this admission control policy, it is no hard guarantee that the designated failover host will restart all failed VMs.

VMware vCloud Director Infrastructure Resiliency Case Study paper published!

Yesterday the paper that Chris Colotti and I were working on titled “VMware vCloud Director Infrastructure Resiliency Case Study” was finally published. This white paper is an expansion on the blog post I published a couple of weeks back.

Someone asked me at PEX where this solution came from all of a sudden, well this is based on a solution I came up with on a random Friday morning  half of December when I woke up at 05:00 in Palo Alto still jet-lagged. I diagrammed it on a napkin and started scribbling things down in Evernote. I explained the concept to Chris over breakfast and that is how it started. Over the last two months Chris (+ his team) and I validated the solution and this is the outcome. I want to thank Chris and team for their hard work and dedication.

I hope that those architecting / implementing DR solutions for vCloud environments will benefit from this white paper. If there are any questions feel free to leave a comment.

Source – VMware vCloud Director Infrastructure Resiliency Case Study

Description: vCloud Director disaster recovery can be achieved through various scenarios and configurations. This case study focuses on a single scenario as a simple explanation of the concept, which can then easily be adapted and applied to other scenarios. In this case study it is shown how vSphere 5.0, vCloud Director 1.5 and Site Recovery Manager 5.0 can be implemented to enable recoverability after a disaster.


I am expecting that the MOBI and EPUB version will also soon be available. When they are I will let you know!

Digging deeper into the VDS construct

The following comment was made on my VDS blog and I figured while would investigate this a bit further:

It seems like the ESXi host only tries to sync the vDS state with the storage at boot and never again afterward. You would think that it would keep trying, but it does not.

Now lets look at the “basics” first. When an ESXi host boots it will get the data required to recreate the VDS structure locally by reading /etc/vmware/dvsdata.db and from esx.conf. You can view the dvsdata.db file yourself by doing:

net-dvs -f /etc/vmware/dvsdata.db

But is that all that is used? If you check the output of that file you will see that all data required for a VDS configuration to work is actually stored in there, so what about those files stored on a VMFS volume?

Each VMFS volume that holds a working directory (place where .vmx is stored) for at least 1 virtual machine that is connected to a VDS will have the following folder:

drwxr-xr-x    1 root     root          420 Feb  8 12:33 .dvsData

If you go to this folder you will see another folder. This folder appears to be some sort of unique identifier, and when comparing the string to the output of “net-dvs” it appears to be the identifier of the dvSwitch that was created.

drwxr-xr-x    1 root     root         1.5k Feb  8 12:47 6d 8b 2e 50 3c d3 50 4a-ad dd b5 30 2f b1 0c aa

Within this folder you will find a collection of files:

-rw------- 1 root root 3.0k Feb 9 09:00 106
-rw------- 1 root root 3.0k Feb 9 09:02 136
-rw------- 1 root root 3.0k Feb 9 09:00 138
-rw------- 1 root root 3.0k Feb 9 09:05 152
-rw------- 1 root root 3.0k Feb 9 09:00 153
-rw------- 1 root root 3.0k Feb 9 09:05 156
-rw------- 1 root root 3.0k Feb 9 09:05 159
-rw------- 1 root root 3.0k Feb 9 09:00 160
-rw------- 1 root root 3.0k Feb 9 09:00 161

It is no coincidence that these files are “numbers” and that these numbers resemble the port ID of the virtual machines stored on this volume. This is the port information of the virtual machines which have their working directory on this particular datastore. This port info is also what HA uses when it needs to restart a virtual machine which uses a dvPort. Let me emphasize that, this is what HA uses when it needs to restart a virtual machine! Is that all?

Well I am not sure. When I tested the original question I powered on the host without access to the storage system and powered on my storage system when the host was fully booted. I did not get this confirmed, but it seems to me that access to the datastore holding these files is somehow required during the boot process of your host, in the case of “static port bindings” that is. (Port bindings are more in-depth described here.)

Does this imply that if your storage is not available during the boot process virtual machines cannot connect to the network when they are powered on? Yes that is correct, I tested it and when you have a full power-outage and your hosts come-up before your storage you will have a “challenge”. As soon as the storage is restored you probably will want to restart your virtual machines but if you do you will not get a network connection. I’ve tested this 6 or 7 times in total and not once did I get a connection.

As a workaround you can simply reboot your ESXi hosts. If you reboot the host the problem is solved and your virtual machines can be powered on and will get access to the network. Rebooting a host can be a painfully slow exercise though, as I noticed during my test runs in my lab. Fortunately there is a really simple workaround: restarting the management agents! Before you power-on your virtual machines and after your storage connection has been restored do the following from the ESXi shell: restart

After the services have been restarted you can power-on your virtual machines and network connection will be restored!

Side note, on my article there was one question about the auto-expand property of static port groups and whether this was officially supported and where it was documented. Yes it is fully supported. There’s a KB Article about how to enable it and William Lam recently blogged about it here. That is it for now on VDS…

Migrating to a VDS switch when using etherchannels

Last week at PEX I had a discussion around migrating to a Distributed Switch (VDS) and some of the challenges one of our partners faced. During their migration they ran in to a lot of network problems, which made them decide to change back to a regular vSwitch. They were eager to start using the VDS but could not take the risk again to run in to the problems they faced.

I decided to grab a piece of paper and we quickly drew out the current implemented architecture at this customer site and discussed the steps the customer took to migrate. The steps described were exactly as the steps documented here and there was absolutely nothing wrong with it. At least not at first sight…. When we dived in to their architecture a bit more a crucial keyword popped up, ether channels. Why is this a problem? Well look at this process for a minute:

  • Create Distributed vSwitch
  • Create dvPortgroups
  • Remove vmnic from vSwitch0
  • Add vmnic to dvSwitch0
  • Move virtual machines to dvSwitch0 port group

Just imagine you are using an ether channel and traffic is being load balanced, but now you have one “leg” of the ether channel ending up in dvSwitch0 and one in vSwitch0. Yes not a pretty sight indeed. In this scenario the migration path would need to be:

  1. Create Distributed vSwitch
  2. Create dvPortgroups
  3. Remove  all the ports from the etherchannel configuration on the physical switch
  4. Change vSwitch load balancing from “IP Hash” to “Virtual Port ID”
  5. Remove vmnic from vSwitch0
  6. Add vmnic to dvSwitch0
  7. Move virtual machines to dvSwitch0 port group

For the second vmnic only steps 5 and 6 (and any subsequent NICs) would need to be repeated. After this the dvPor group can be configured to use “IP-hash” load balancing and the physical switch ports can be added to the etherchannel configuration again. You can repeat this for additional portgroups and VMkernel NICs.

I do want to point out, that I am personally not a huge fan of etherchannel configurations in virtual environments. One of the reason is the complexity which often leads to problems when things are misconfigured. An example of when things can go wrong is displayed above. If you don’t have any direct requirements to use IP-hash… use Load Based Teaming on your VDS instead, believe me it will make your life easier in the long run!