Yellow Bricks

Prepare for the worst…

Duncan Epping · Jul 9, 2013 ·

Over the last couple of months I have been contacted by various folks who thought long and hard about their Business Continuity and Disaster Recovery design. They bought a great backup solution which integrated with vSphere and they replicated their SAN to a second site. In their mind they were definitely prepared for the worst… I agree on that to a certain extend, their design was well thought-out indeed and carefully covered all aspects there are for BC/DR. From an operational perspective though things look different, first significant failure occurred and then they couldn’t fully recall the steps to recovery. That is what my tweet below was inspired by…

https://twitter.com/DuncanYB/status/352832506552262658

Funny thing is that this tweet also triggered some responses like “Go SRM” or “that is where Zerto comes in”, and again I agree that an orchestration layer should be part of your DR plan but when talking about BC/DR I think it is more about the strategy, the processes that will need to be triggered in a particular scenario. What is involved typically? I am not going in to the business specific side of things even and all the politics that comes along with it. But instead look at you process, take one step back and ask yourself: what if this part of the process fails?

One of the things Lee and I will mention multiple times during our VMworld session on Stretched Clusters is: Test It! Not once, not twice but various times and be prepared for the worst to happen. Yes, none of us likes to test the most destructive and disruptive failure scenario, but you bet when something goes wrong it will be that scenario you did not test. Although I think for instance SRM is a rock solid solution, what if for whatever reason your recovery plan does not work as planned? While testing make sure you document your recovery plan, even though you might have a bunch of scripts laying around who knows if they will work as expected? Some scripts (or SRM type of solutions) have a dependency on certain components / services to be up, what if they are not? Besides your BC/DR strategy of course a lot of procedures will need to be documented. What kind of procedures are we talking about? Just a couple of random ones I would suggest you document while testing your scenarios at a bare minimum:

Order in which to power-on all physical components in your Datacenter (and power-off)
Location of infrastructure related services (AD, DNS, vCenter, Syslogging, NTP, etc), when virtual and on SAN document the datastore for instance
Order in which to power-on all infrastructure related services
Order in which to power-on all remaining virtual machines /vApps
How to get your vCenter Server up and running from the commandline (this will make it a lot easier to get the rest of your VMs up and running)
How to power-on virtual machines from the commandline after a failure
How to re-register a virtual machine from the commandline after a failure
How to mount a LUN from the commandline after a failover
How to resignature a LUN from the commandline after a failover
How to restore a full datastore
How to restore a virtual machine
etc etc

Now I can hear some of you think why would I document that, I know all of that stuff inside out? Well what if you are on a holiday or at home sick? Just imagine your junior colleague is by himself when disaster strikes, does he know in which order the services of that business critical multi tier application need to start?

When you do document these, make sure to have a (physical) copy available outside of your infrastructure, believe me … you wouldn’t be the first finding yourself locked out of a system and trying to find the documents to recover and then realizing they are stored on the system they need to recover. Those who have ever been in a total datacenter outage know what I am talking about. I have been in the situation where a full datacenter went down due to a power-outage, believe me when I say that bringing up over 300 VMs and all associated physical components without documentation was a living nightmare.

Although you probably get it by now… it is not the tool but a proper strategy, procedures and documentation are the key to success! Just do it.

Cool Tool: VisualEsxtop

Duncan Epping · Jul 8, 2013 ·

My ESXTOP page is still one of the most visited pages I have, it actually comes in on a second spot just right after the HA Deepdive. Every once in a while I revise the page and this week it was time to add VisualEsxtop to the list of tools people should use. I figured I would write a regular blog post first and roll it up in to the page at the same time. So what is VisualEsxtop?

VisualEsxtop is an enhanced version of resxtop and esxtop. VisualEsxtop can connect to VMware vCenter Server or ESX hosts, and display ESX server stats with a better user interface and more advanced features.

That sounds nice right? Lets have a look how it works, this is what I did to get it up and running:

Go to “http://labs.vmware.com/flings/visualesxtop” and click “download”
Unzip “VisualEsxtop.zip” in to a folder you want to store the tool
Go to the folder
Double click “visualesxtop.bat” when running Windows (Or follow William’s tip for the Mac)
Click “File” and “Connect to Live Server”
Enter the “Hostname”, “Username” and “Password” and hit “Connect”
That is it…

Now some simple tips:

By default the refresh interval is set to 5 seconds. You can change this by hitting “Configuration” and then “Change Interval”
You can also load Batch Output, this might come in handy when you are a consultant for instance and a customers sends you captured data, you can do this under: File -> Load Batch Output
You can filter output, very useful if you are looking for info on a specific virtual machine / world! See the filter section.
When you click “Charts” and double click “Object Types” you will see a list of metrics that you can create a chart with. Just unfold the ones you need and double click them to add them to the right pane

There are a bunch of other cool features in their like color-coding of important metrics for instance. Also the fact that you can show multiple windows at the same time is useful if you ask me and of course the tooltips that provide a description of the counter! If you ask me, a tool everyone should download and check out.

If you have feedback, make sure to leave a comment on the flings site as the engineers of this tool will be tracking that to see where improvements can be made.

You don’t need any brains to listen to music pt IV

Duncan Epping · Jul 6, 2013 ·

It has been a while since I have done one of these articles (1, 2, 3). Last one was in 2010 and it started in 2009. (If you only want to read technical / IT related articles, don’t bother continuing reading this one…) Let me quote my first post in 2009 so those who are recent followers understand where this is coming from:

Luciano Pavarotti once said, “You don’t need any brains to listen to music”, and he’s right… that’s why I love music. Whenever I need to clear my head I pick up my ~~mp3 player~~ iphone and go outside for run.

Every once in a while a “new song / album / band” comes along that helps me clear my mind. That helps me take three steps back when I can’t see the solution to a difficult problem. [Read more…] about You don’t need any brains to listen to music pt IV

Unmounting datastore fails due to vSphere HA?

Duncan Epping · Jul 5, 2013 ·

On the VMware Community Forums someone reported he was having issues unmounting datastores when vSphere HA was enabled. Internally I contacted various folks to see what was going on. The error that this customer was hitting was the following:

The vSphere HA agent on host '<hostname>' failed to quiesce file activity on datastore '/vmfs/volumes/<volume id>'

After some emails back and forth with Support and Engineering (awesome to work with such a team by the way!) the issue was discovered and it seems that in two separate instances issues were resolved that had to do with unmounting of datastores. Keith Farkas explained on the forums how you can figure out if you are hitting those exact problems or not and in which release they are fixed, but at I realize those kind of threads are difficult to find I figured I would post it here for future reference:

You can determine if you are encountering this issue by searching the VC log files. Find the task corresponding to the unmount request, and see if the follow error message is logged during the task’s execution (Fixed in 5.1 U1a) :

2012-09-28T11:24:08.707Z [7F7728EC5700 error 'DAS'] [VpxdDas::SetDatastoreDisabledForHACallback] Failed to disable datastore /vmfs/volumes/505dc9ea-2f199983-764a-001b7858bddc on host [vim.HostSystem:host-30,10.112.28.11]: N3Csi5Fault16NotAuthenticated9ExceptionE(csi.fault.NotAuthenticated)

While we are on the subject, I’ll also mention that there is another know issue in VC 5.0 that was fixed in VC5.0U1 (the fix is in VC 5.1 too). This issue related to unmounting a force mounted VMFS datastore. You can determine whether you are hitting this error by again checking the VC log files. If you see an error message such as the following with VC 5.0, then you may be hitting this problem. A work around, like above, is to disable HA while you unmount the datastore.

2011-11-29T07:20:17.108-08:00 [04528 info 'Default' opID=19B77743-00000A40] [VpxLRO] -- ERROR task-396 -- host-384 -- vim.host.StorageSystem.unmountForceMountedVmfsVolume: vim.fault.PlatformConfigFault:

CloudPhysics card builder, how awesome is that?

Duncan Epping · Jul 2, 2013 ·

A while ago Irfan Ahmad (CloudPhysics CTO), Frank Denneman and I were discussing various ideas around the CloudPhysics platform… One of the ideas that Ifran and team pitched was this notion of a card builder. Both Frank and I are advisors to CloudPhysics and immidiately jumped up and said “YES PLEASE, when can we have it?” Over the last couple of weeks you have probably seen various blog posts pop up about the card builder that CloudPhysics created and I can honestly say that it has exceeded my expectations. (Suggested reads: Willam’s blog post, Anthony Spiteri’s post) So what is so special about this card designer? I think this paragraph from William’s blog post describes it best:

The vSphere platform provides a very powerful and rich set of APIs (Application Programming Interface) that can be consumed by both vSphere administrators as well as developers. However, there is a high learning curve when using the API and it takes quite a bit of time to learn and of course your manager is expecting the report to be done in the next 5 minutes. Even with abstraction tools such as PowerCLI, quickly building a robust, scalable and performant script is not always a trivial task, not to mention the maintenance and updates to the script because your manager wants to continually add more things to the report.

Not everyone is an API guru like William or a scripting god like Alan Renouf or Luc Dekens. Sure, these guys will knock out an awesome looking report in a matter of minutes, maybe 10 – 15 minutes depending on what kind of metrics they need and how complex the report will be. For normal people, like myself, who aren’t scripting gods this typically takes a lot longer. Personally I am happy if I can produce something within an hour, but when it gets more complex you are probably talking about way more than that, potentially a full day. The CloudPhysics card builder was designed to lower the barrier to create meaningful reports!

How simple is it? I would say, that if I can figure it out in seconds it is dead simple:

Click “Card Builder”
Click “Create card”
Select the “Property”
I selected “Datastore:Name” and “Datastore:Attached Hosts” and below the results

That is it, really easy right? In just a couple of clicks I can see which hosts are connected to which datastores. Yes of course this was a simple example, but the nice thing is that you can make it as complex as you want or need. Currently this is in a limited Beta, but soon (I mean really soon!!) this will be exposed to the rest of the world. If you want to know more, just check the webinar recording by Irfan link can be found on the CPhy website!

Only thing I wonder is… why on earth did no one come up with this concept before for the virtualization space? Creating reports and should always be dead simple if you ask me, and now with CloudPhysics Card Builder it finally is.