Want to attend PuppetConf 22/23 August in San Francisco for free?

Do you want to attend PuppetConf 22/23 August in San Francisco for free? The Puppet Labs folks were so kind to provide me with 2 registration codes, if you want to attend their conference in San Francisco this year just drop a comment with why you love Puppet and I will randomly pick two winners.

I hoped I would be able to attend myself but unfortunately it clashes with another conference for me. I believe they will have over 70 sessions and they have labs. So a great way to get to learn about Puppet and meet like minded people! More info can be found on PuppetConf.com. Anyway, if you like to go but haven’t secured your ticket yet, drop a comment here and maybe you will win one of the two tickets I have available. Good luck.

For those who aren’t lucky enough to win a ticket… I have a discount code which gives you $ 200,- off, just use “DuncanEpping” when you register!

Deadline: July 25th 2013

Testing your infrastructure!

Last week I was helping someone on the VMTN community forums. They were hitting what appeared to be strange HA behavior. After some standard questions this person told me that all VMs were powered down after a network outage. Sounds like a familiar problem? Yes I can hear most of you think: Isolation response set to “power off” and no proper network redundancy?

Well yes and no. They had the isolation response indeed configured to “power off” all VMs when the host is isolated. They did however have proper network redundancy, so how on earth did this happen? With 2 physical NICs and 2 physical switches and only 1 being impacted this should not have happened right?!?

Wrong! In this case the fail-over from a “vmkernel” perspective worked fine. The first “path” went down, so the second was used for this management vmkernel. All VMs were up and running until this point, and they remained running until… network connection was restored and the vmnic returned to the original physical NIC. Meaning that the mac address that showed up on port 1 popped up on port 2 and then went back to 1 again. The switch was not impressed and went through the spanning tree process and traffic was blocked instantly as a result of it. Now when traffic is blocked bad things can happen, especially when you configure HA to “power off” VMs. Basically what caused this issue to happen was the fact the spanning tree was not set to the recommended “port fast”, more details here.

I knew instantly that this was the reason for this problem, not because I know stuff about HA but because I had seen this many times in the past while testing environments I configured and designed. Not just testing after implementing a new infrastructure, but also testing after making changes to an infrastructure or introducing a new version / feature. I guess this kind of comes back to the “disaster” scenario as well, test it if you want to know if it works as expected. Just a simple example, I want to introduce QoS for my vMotion network and make changes to my physical network. Now what? How do I test these changes? How many times do I run through my test scenarios? What kind of “problems” do I introduce during my tests?

So I guess by now some might wonder why on earth I brought this up… well the problem above could have been prevented by simply testing the infrastructure when implemented and after changes have been introduced, and maybe even on a regular basis. If HA / Networking was tested properly, those VMs would not have been powered off…

My VMworld San Francisco session recommendations

Every year I do this blog post with recommendations, I hadn’t yet this year and several people started asking for it so I figured it was about time. People always complain sessions aren’t technical enough, and my answer typically is: then you have been attending the wrong sessions… So if you were one of those folks last year, make sure to include some of the sessions below in to your schedule as I will personally guarantee you will get something out of each of these. These are not the average marketing sessions, but rather sessions by deep technical people, or just plain awesome presenters. Note that I tried to limit myself to just 20 / 30 sessions, so some awesome sessions might be missing, don’t shoot me for that as when going through the list I figured I could easily get to 50… but then I might as well just link the content catalog.

This is my top 30, in no particular order:

  1. VSVC5511 – Deploying vSphere with OpenStack: What It Means to Your Cloud Environment by Scott Lowe and Dan Wendlandt
  2. VSVC5364 – Storage IO Control: Concepts, Configuration and Best Practices to Tame Different Storage Architectures by Sachin Manpathak and Ajay Gulati
  3. VSVC5280 – DRS: New Features, Best Practices and Future Directions by Aashish Parikh and Ajay Gulati
  4. VSVC4966 – vSphere Distributed Switch – Technical Deep Dive by Jason Nash
  5. VSVC4944 – PowerCLI Best Practices – A Deep Dive by Luc Dekens and Alan Renouf
  6. VSVC4886 – Innovations in vMotion: A Technical Preview by Jennifu Wu, Gabe Tarasuk-Levin, Sreekanth Setty and Min Cai
  7. VSVC4830 – vCenter Deep Dive by Ameet Jani and Justin King
  8. VCM5477 – Integration Deep Dive: Cloud Service Automation with NSX and vCloud Automation Center by Somik Behera and Thomas Kraus
  9. VCM5008 – vCenter Operations and the Quest for the Missing Metrics by Eric Sloof and Duco Jaspars
  10. VAPP4683 – Maximize Database Performance in Your Software-Defined Datacenter by Mark Achtemichuk and Michael Webster
  11. VAPP4679 – Software-Defined Datacenter Design Panel for Monster VM’s: Taking the Technology to the Limits for High Utilisation, High Performance Workloads by Andrew Mitchell, Mark Achtemichuk, Mostafa Khalil and Michael Webster
  12. STO5638 – Best Practices with Software Defined Storage by Vaughn Stewart and Chad Sakac
  13. STO5636 – Storage DRS: Deep Dive and Best Practices to Suit Your Storage Environments by Mustafa Uysal and Sachin Manpathak
  14. STO5559 – Storage Industry Trends by Alex Jauch and Vijay Ramachandran
  15. STO5027 – VMware Virtual SAN Technical Best Practices by Cormac Hogan and Kiran Madnani
  16. STO4798 – Software-Defined Storage: The VCDX Way by Wade Holmes and Rawlinson Rivera
  17. STO4791 – Just Because You Could, Doesn’t Mean You Should: Lessons Learned in Storage Best Practices (v2.0) by Patrick Carmichael
  18. SEC5891 – Technical Deep Dive: Build a Collapsed DMZ Architecture for Optimal Scale and Performance Based on NSX Firewall Services by Ranga Maddipudi and Shubha Bheemarao
  19. SEC5828 – Datacenter Transformation with Network Virtualization: Today and Tomorrow by Martin Casado
  20. SEC5582 – Multi-site Deployments with Network Virtualization by Pepe Garcia and Kamau Wanguhu
  21. PHC5640 – The Story Behind Designing and Building a Distributed Automation Framework for vCloud Hybrid Services by Nick Weaver
  22. PHC4750 – How to Build a Hybrid Cloud in Less than a Day by David Hill
  23. NET5716 – Advanced NSX Architecture by Bruce Davie
  24. NET5521 – vSphere Distributed Switch – Design and Best Practices by Ray Budavari and Venky Deshpande
  25. NET5184 – Designing Your Next Generation Datacenter for Network Virtualization by Ray Budavari and Ben Basler
  26. EUC5291 – Horizon View Troubleshooting: Looking under the Hood by Matt Coppinger and Pat Lee
  27. EUC5238 – Horizon Workspace: Data Deep Dive by Rasmus Jensen and Marcello Golfieri
  28. EUC4546 – Architecting VMware Horizon Workspace for Scale and Performance by Kit Colbert, Jared Cook and Andrew Johnson
  29. BCO4977 – VMware vSphere Replication: Technical Walk-Through with Engineering by Aleksey Pershin and Ken Werneburg
  30. BCO4756 – VMware vSphere Data Protection (VDP) Technical Deep Dive And Troubleshooting Session by Jacy Townsend and Darryl Hing


Network port diagram for vSphere 5.x

Somehow I missed this one, but as I reviewed the diagram and helped selecting the right format I figured I would still share it. This Network port diagram for vSphere 5.x is one awesome resource for those folks who want to get to the bottom of how components interact with each other.

I don’t think there is a lot more I can say about it, those who love diagrams and like to know the details make sure to hit: http://kb.vmware.com/kb/2054806

Prepare for the worst…

Over the last couple of months I have been contacted by various folks who thought long and hard about their Business Continuity and Disaster Recovery design. They bought a great backup solution which integrated with vSphere and they replicated their SAN to a second site. In their mind they were definitely prepared for the worst… I agree on that to a certain extend, their design was well thought-out indeed and carefully covered all aspects there are for BC/DR. From an operational perspective though things look different, first significant failure occurred and then they couldn’t fully recall the steps to recovery. That is what my tweet below was inspired by…

Funny thing is that this tweet also triggered some responses like “Go SRM” or “that is where Zerto comes in”, and again I agree that an orchestration layer should be part of your DR plan but when talking about BC/DR I think it is more about the strategy, the processes that will need to be triggered in a particular scenario. What is involved typically? I am not going in to the business specific side of things even and all the politics that comes along with it. But instead look at you process, take one step back and ask yourself: what if this part of the process fails?

One of the things Lee and I will mention multiple times during our VMworld session on Stretched Clusters is: Test It! Not once, not twice but various times and be prepared for the worst to happen. Yes, none of us likes to test the most destructive and disruptive failure scenario, but you bet when something goes wrong it will be that scenario you did not test. Although I think for instance SRM is a rock solid solution, what if for whatever reason your recovery plan does not work as planned? While testing make sure you document your recovery plan, even though you might have a bunch of scripts laying around who knows if they will work as expected? Some scripts (or SRM type of solutions) have a dependency on certain components / services to be up, what if they are not? Besides your BC/DR strategy of course a lot of procedures will need to be documented. What kind of procedures are we talking about? Just a couple of random ones I would suggest you document while testing your scenarios at a bare minimum:

  • Order in which to power-on all physical components in your Datacenter (and power-off)
  • Location of infrastructure related services (AD, DNS, vCenter, Syslogging, NTP, etc), when virtual and on SAN document the datastore for instance
  • Order in which to power-on all infrastructure related services
  • Order in which to power-on all remaining virtual machines /vApps
  • How to get your vCenter Server up and running from the commandline (this will make it a lot easier to get the rest of your VMs up and running)
  • How to power-on virtual machines from the commandline after a failure
  • How to re-register a virtual machine from the commandline after a failure
  • How to mount a LUN from the commandline after a failover
  • How to resignature  a LUN from the commandline after a failover
  • How to restore a full datastore
  • How to restore a virtual machine
  • etc etc

Now I can hear some of you think why would I document that, I know all of that stuff inside out? Well what if you are on a holiday or at home sick? Just imagine your junior colleague is by himself when disaster strikes, does he know in which order the services of that business critical multi tier application need to start?

When you do document these, make sure to have a (physical) copy available outside of your infrastructure, believe me … you wouldn’t be the first finding yourself locked out of a system and trying to find the documents to recover and then realizing they are stored on the system they need to recover. Those who have ever been in a total datacenter outage know what I am talking about. I have been in the situation where a full datacenter went down due to a power-outage, believe me when I say that bringing up over 300 VMs and all associated physical components without documentation was a living nightmare.

Although you probably get it by now… it is not the tool but a proper strategy, procedures and documentation are the key to success! Just do it.

Cool Tool: VisualEsxtop

My ESXTOP page is still one of the most visited pages I have, it actually comes in on a second spot just right after the HA Deepdive. Every once in a while I revise the page and this week it was time to add VisualEsxtop to the list of tools people should use. I figured I would write a regular blog post first and roll it up in to the page at the same time. So what is VisualEsxtop?

VisualEsxtop is an enhanced version of resxtop and esxtop. VisualEsxtop can connect to VMware vCenter Server or ESX hosts, and display ESX server stats with a better user interface and more advanced features.

That sounds nice right? Lets have a look how it works, this is what I did to get it up and running:

  • Go to “http://labs.vmware.com/flings/visualesxtop” and click “download”
  • Unzip “VisualEsxtop.zip” in to a folder you want to store the tool
  • Go to the folder
  • Double click “visualesxtop.bat” when running Windows (Or follow William’s tip for the Mac)
  • Click “File” and “Connect to Live Server”
  • Enter the “Hostname”, “Username” and “Password” and hit “Connect”
  • That is it…

Now some simple tips:

  • By default the refresh interval is set to 5 seconds. You can change this by hitting “Configuration” and then “Change Interval”
  • You can also load Batch Output, this might come in handy when you are a consultant for instance and a customers sends you captured data, you can do this under: File -> Load Batch Output
  • You can filter output, very useful if you are looking for info on a specific virtual machine / world! See the filter section.
  • When you click “Charts”  and double click “Object Types” you will see a list of metrics that you can create a chart with. Just unfold the ones you need and double click them to add them to the right pane

There are a bunch of other cool features in their like color-coding of important metrics for instance. Also the fact that you can show multiple windows at the same time is useful if you ask me and of course the tooltips that provide a description of the counter! If you ask me, a tool everyone should download and check out.

If you have feedback, make sure to leave a comment on the flings site as the engineers of this tool will be tracking that to see where improvements can be made.


You don’t need any brains to listen to music pt IV

It has been a while since I have done one of these articles (1, 2, 3). Last one was in 2010 and it started in 2009. (If you only want to read technical / IT related articles, don’t bother continuing reading this one…) Let me quote my first post in 2009 so those who are recent followers understand where this is coming from:

Luciano Pavarotti once said, “You don’t need any brains to listen to music”, and he’s right… that’s why I love music. Whenever I need to clear my head I pick up my mp3 player iphone and go outside for run.

Every once in a while a “new song / album / band” comes along that helps me clear my mind. That helps me take three steps back when I can’t see the solution to a difficult problem. [Read more...]