Unmounting datastore fails due to vSphere HA?

On the VMware Community Forums someone reported he was having issues unmounting datastores when vSphere HA was enabled. Internally I contacted various folks to see what was going on. The error that this customer was hitting was the following:

The vSphere HA agent on host '<hostname>' failed to quiesce file activity on datastore '/vmfs/volumes/<volume id>'

After some emails back and forth with Support and Engineering (awesome to work with such a team by the way!) the issue was discovered and it seems that in two separate instances issues were resolved that had to do with unmounting of datastores. Keith Farkas explained on the forums how you can figure out if you are hitting those exact problems or not and in which release they are fixed, but at I realize those kind of threads are difficult to find I figured I would post it here for future reference:

You can determine if you are encountering this issue by searching the VC log files. Find the task corresponding to the unmount request, and see if the follow error message is logged during the task’s execution (Fixed in 5.1 U1a) :

2012-09-28T11:24:08.707Z [7F7728EC5700 error 'DAS'] [VpxdDas::SetDatastoreDisabledForHACallback] Failed to disable datastore /vmfs/volumes/505dc9ea-2f199983-764a-001b7858bddc on host [vim.HostSystem:host-30,10.112.28.11]: N3Csi5Fault16NotAuthenticated9ExceptionE(csi.fault.NotAuthenticated)

While we are on the subject, I’ll also mention that there is another know issue in VC 5.0 that was fixed in VC5.0U1 (the fix is in VC 5.1 too). This issue related to unmounting a force mounted VMFS datastore. You can determine whether you are hitting this error by again checking the VC log files. If you see an error message such as the following with VC 5.0, then you may be hitting this problem. A work around, like above, is to disable HA while you unmount the datastore.

2011-11-29T07:20:17.108-08:00 [04528 info 'Default' opID=19B77743-00000A40] [VpxLRO] -- ERROR task-396 -- host-384 -- vim.host.StorageSystem.unmountForceMountedVmfsVolume: vim.fault.PlatformConfigFault:

CloudPhysics card builder, how awesome is that?

A while ago Irfan Ahmad (CloudPhysics CTO), Frank Denneman and I were discussing various ideas around the CloudPhysics platform… One of the ideas that Ifran and team pitched was this notion of a card builder. Both Frank and I are advisors to CloudPhysics and immidiately jumped up and said “YES PLEASE, when can we have it?” Over the last couple of weeks you have probably seen various blog posts pop up about the card builder that CloudPhysics created and I can honestly say that it has exceeded my expectations. (Suggested reads: Willam’s blog post, Anthony Spiteri’s post) So what is so special about this card designer? I think this paragraph from William’s blog post describes it best:

The vSphere platform provides a very powerful and rich set of APIs (Application Programming Interface) that can be consumed by both vSphere administrators as well as developers. However, there is a high learning curve when using the API and it takes quite a bit of time to learn and of course your manager is expecting the report to be done in the next 5 minutes. Even with abstraction tools such as PowerCLI, quickly building a robust, scalable and performant script is not always a trivial task, not to mention the maintenance and updates to the script because your manager wants to continually add more things to the report.

Not everyone is an API guru like William or a scripting god like Alan Renouf or Luc Dekens. Sure, these guys will knock out an awesome looking report in a matter of minutes, maybe 10 – 15 minutes depending on what kind of metrics they need and how complex the report will be. For normal people, like myself, who aren’t scripting gods this typically takes a lot longer. Personally I am happy if I can produce something within an hour, but when it gets more complex you are probably talking about way more than that, potentially a full day. The CloudPhysics card builder was designed to lower the barrier to create meaningful reports!

How simple is it? I would say, that if I can figure it out in seconds it is dead simple:

  1. Click “Card Builder”
    CloudPhysics Card Builder
  2. Click “Create card”
    CloudPhysics Card Builder
  3. Select the “Property”
    CloudPhysics Card Builder
  4. I selected “Datastore:Name” and “Datastore:Attached Hosts” and below the results
    CloudPhysics Card Builder

That is it, really easy right? In just a couple of clicks I can see which hosts are connected to which datastores. Yes of course this was a simple example, but the nice thing is that you can make it as complex as you want or need. Currently this is in a limited Beta, but soon (I mean really soon!!) this will be exposed to the rest of the world. If you want to know more, just check the webinar recording by Irfan link can be found on the CPhy website!

Only thing I wonder is… why on earth did no one come up with this concept before for the virtualization space? Creating reports and should always be dead simple if you ask me, and now with CloudPhysics Card Builder it finally is.

Survey time!

Today I received two requests to plug a survey. The first survey is on the topic of NAS usage and cloud storage and the second one on the topic of multi-tier apps. Please fill them out, as this is your way of defining what the future of VMware (potential) products and features looks like!

Takes about 2 minutes to fill out:

NAS / Cloud Storage survey:

This is a survey on alternatives to traditional NAS storage systems! We would like your opinion on NAS usage within your environment and consideration for alternatives to NAS solutions.

http://bit.ly/1aohsoG

And the second one, takes 10 minutes roughly, but with the chance of winning a giftcard:

Multi-tier app Survey:
We would like your input on virtualization of multi-tier applications in our quest for continuous improvement

We have created a survey to capture your feedback: http://tinyurl.com/VMware-multi-tier-application . The survey should only take 5-10 minutes to complete.

As an incentive, respondents will be entered in a drawing to win one of three $50 Visa gift cards!

The survey will be open until July 9, 2013, so please participate soon!

Startup intro: SolidFire

This seems to becoming a true series, introducing startups… Now in the case of SolidFire I am not really sure if I should use the word startup as they have been around since 2010. But then again, it is not a consumer solution that they’ve created and enterprise storage platforms do typically take a lot longer to develop and mature. SolidFire was founded in 2010 by Dave Wright who discovered a gap in the current storage market when he was working for Rackspace. The opportunity Dave saw was in the Quality of Service area. Not many storage solutions out there could provide a predictable performance in almost every scenario, and were designed for multi-tenancy and offered a rich API. Back then the term Software Defined Storage wasn’t coined yet, but I guess it is fair to say that is how we would describe it today. This actually how I got in touch with SolidFire. I wrote various articles on the topic of Software Defined Storage, and tweeted about this topic many times, and SolidFire was one of the companies who consistently joined the conversation. So what is SolidFire about?

SolidFire is a storage company, they sell a storage systems and today they offer two models namely the SF3010 and the SF6010. What is the difference between these two? Cache and capacity! With the SF3010 you get 72Gb of cache per node and it uses 300GB SSD’s where the SF6010 gives you 144GB of cache per node and uses 600GB SSD’s. Interesting? Well to a certain point I would say, SolidFire isn’t really about the hardware if you ask me. It is about what is inside the box, or boxes I should say as the starting point is always 5 nodes. So what is inside?

Architecture

SolidFire’s architecture is based on a scale-out model and of course flash, in the form of SSD. You start out with 5 nodes and you can go up to 100 nodes, all connected to your hosts via iSCSI. Those 100 nodes would be able to provide you 5 million IOps and about 2.1 Petabyte of capacity. Each node that is added linearly scales performance and of course adds capacity. Of course SolidFire offers deduplication, compression and thin provisioning. Considering it is a scale-out model it is probably not needed to point this out, but dedupe and compression are cluster wide. Now the nice thing about the SolidFire architecture is that they don’t use a traditional RAID, this means that the long rebuild times when a disk fails or a node fails do not apply to SolidFire. Rather SolidFire evenly distributes data across all disk and nodes, so when a single disk fails or even a node fails rebuild time is not constraint due to a limited amount of resources but many components can help in parallel to get back to a normal state. What I liked most about their architecture is that it already closely aligns with VMware’s Virtual Volume (VVOL) concept, SolidFire is prepared for VVOLs when it is released.

Quality of Service

I already has briefly mentioned this, but Quality of Service (QoS) is one of the key drivers of the SolidFire solution. It revolves around having the ability to provide an X amount of capacity with an X amount of performance (IOps). What does this mean? SolidFire allows you to specify a minimum and maximum number of IOps for a volume, and also a burst space. Lets quote the SolidFire website as I think they explain it in a clear way:

  • Min IOPS - The minimum number of I/O operations per-second that are always available to the volume, ensuring a guaranteed level of performance even in failure conditions.
  • Max IOPS - The maximum number of sustained I/O operations per-second that a volume can process over an extended period of time.
  • Burst IOPS - The maximum number of I/O operations per-second that a volume will be allowed to process during a spike in demand, particularly effective for data migration, large file transfers, database checkpoints, and other uneven latency sensitive workloads.

Now I do want to point out here that SolidFire storage systems have no “form of admission control” when it comes to QoS. Although it is mentioned that there is a guaranteed level of performance this is up to the administrator, you as the admin will need to do the math and not overprovision from a performance point of view if you truly want to guarantee a specific performance level. If you do, you will need to take failure scenarios in to account!

One thing that my automation friends William Lam and Alan Renouf will like is that you can manage all these settings using their REST-based API.

(VMware) Integration

Ofcourse during the conversation integration came up. SolidFire is all about enabling their customers to automate as much as they possibly can and have implemented a REST-based API. They are heavily investing in for instance integration with Openstack but also with VMware. They offer full support for the vSphere Storage APIs – Storage Awareness (VASA) and are also working towards full support for vSphere Storage APIs – Array Integration (VAAI). Currently not all VAAI primitives are supported but they promised me that this is a matter of time. (They support: Block Zero’ing, Space Reclamation, Thin Provisioning. See HCL for more details.) On top of that they are also looking at the future and going full steam ahead when it comes to Virtual Volumes. Obvious question from my side: what about replication / SRM? This is being worked on, hopefully more news about this soon!

Now with all this integration did they forget about what is sitting in between their storage system and the compute resources? In other words what are they doing with the network?

Software Defined Networking?

I can be short, no they did not forget about the network. SolidFire is partnering with Plexxi and Arista to provide a great end-to-end experience when it comes to building a storage environment. Where with Arista currently the focus is more on monitoring the the different layers Plexxi seems to focus more on the configuration and optimization for performance aspect. No end-to-end QoS yet, but a great step forward if you ask me! I can see this being expanded in the future

Wrapping up

I had already briefly looked at SolidFire after the various tweets we exchanged but this proper introduction has really opened my eyes. I am impressed by what SolidFire has achieved in a relatively short amount of time. Their solution is all about customer experience, that could be performance related or the ability to automate the full storage provisioning process… their architecture / concept caters for this. I have definitely added them to my list of storage vendors to visit at VMworld, and I am hoping that those who are looking in to Software Defined Storage solutions will do the same as SolidFire belongs on that list.

CPU Affinity and vSphere HA

On the VMware Community Forums someone asked today if CPU Affinity and vSphere HA worked in conjunction and if it was supported. To be fair I never tested this scenario, but I was certain it was supported and would work… Never hurts to  validate though before you answer a question like that. I connected to my lab and disabled a VM for DRS so I could enable CPU affinity. I pinned the CPUs down to core 0 and 1 as shown in the screenshot below:

cpu affinity

After pinning the vCPUs to a set of logical CPUs I powered on the VM. The result was, as expected, a “Protected” virtual machine as shown in the screenshot below.

HA protection

But would it get restarted if anything happened to the host? Yes it would, and I tested this of course. I switched the server off which was running this virtual machine and within a minute vSphere HA restarted the virtual machine on one of the other hosts in the cluster. So there you have it, CPU Affinity and vSphere HA work fine.

PS: Would I ever recommend using CPU Affinity? No I would not!

vCenter Single Sign On aka SSO, what do I recommend?

I have had various people asking me over the last 9 months what I would recommend when it comes to SSO. Would I use a multi-site configuration, maybe even an HA configuration or would I go for the Basic configuration? What about when I have multiple vCenter Server instances, would I share the SSO instance between these or deploy multiple SSO instances? All very valid questions I would say. I have kept my head low intentionally the last year to be honest, but after reading this excellent blog post by Josh Odgers where he posted an awesome  architectural decision flow chart I figured it was time voice my opinion. Just look at this impression of the flow chart (for full resolution visit Josh’s website):

Complex? Yes I agree, probably too complex for most people. Difficult to digest, and that is not due to Josh’s diagramming skills. SSO has various deployment models (multi site, HA, basic), and then there is the option to deploy it centralized or localized as well. On top of that there is also the option to protect it using Heartbeat. Now you can probably understand why the flow diagram ended up looking complex. Many different options but what makes sense?

Justin King already mentioned this in his blog series on SSO (part 1, 2, 3, 4) as a suggestion, but lets drive it home! Although it might seem like it defeats the purpose I would recommend the following in almost every single scenario one can imagine: Basic SSO deployment, local to vCenter Server instance. Really, the KISS principle applies here. (Keep It Simple SSO!) Why do I recommend this? Well for the following simple reasons:

  • SSO in HA mode does not make sense as clustering the SSO database is not supported, so although you just deployed an HA solution you still end up with a single point of failure!
  • You could separate SSO from vCenter, but why would you create a dependency on network connection between the vCenter instance and the SSO instance? It is asking for trouble.
  • A centralized SSO instance sounds like it make sense, but the problem here is that it requires all connecting vCenter instances to be on the same version. Yes indeed, this complicates your operational model. So go localized for now.

So is there a valid reason to deviate from this? Yes there is and it is called Linked Mode. Linked Mode “requires” SSO to be deployed in a “multi-site” configuration, this is probably one of the few reasons I would not follow the KISS principle when there is a requirement for linked-mode… personally I never use Linked Mode though, I find it confusing.

So there you have it, KISS!

(ab)using vSphere advanced settings?

Almost on a daily basis I get questions from colleagues and customers about specific advanced settings. Somehow they spotted a vSphere advanced setting and wonder if they should set it. They go on a hunt to figure out what it is this specific vSphere advanced setting does and typically find a description on a random website that makes it sound like it is a good idea to configure it. I even had someone asking if I could give a list of all optimized values for the advanced kernel parameters recently. My answer was short and maybe a bit blunt, but I think my it was clear:

I hope the sign above makes it clear you should not randomly set advanced settings. Some of you will laugh and say “well that is obvious” while others probably will scratch their head and open their vSphere client and check the advanced settings section. I know I discuss advanced settings every once in a while, but you should only apply these settings when:

  1. You have a requirement to implement this advanced setting, do not tweak them “just because you can”. An example would be in a stretched cluster you set “disk.terminateVMOnPDLDefault” because of the infrastructure implemented.
  2. The advanced setting solves a problem in your environment (and preferably in that case see 3)
  3. When recommend by VMware Global Support Services

If you have implemented an advanced setting, document it and with every update or upgrade validate it is still applicable to that specific version or not. (If you are aspiring to be a VCDX, this is key.) If it no longer applies, remove revert to default!