EMC ViPR; My take

When I started writing this article I knew people were going to say that I am biased considering I work for VMware (EMC owns a part of VMware), but so be it. It is not like that has ever stopped me from posting anything about potential competitors so it also will not stop me now either. After seeing all the heated debates on twitter between the various storage vendors I figured it wouldn’t hurt to try to provide my perspective. I am looking at this from a VMware Infrastructure point of view and with my customer hat on. Considering I have huge interest in Software Defined Storage solutions this should be my cup of tea. So here you go, my take on EMC ViPR. Note that I did not actually played with the product yet (like most people providing public feedback), so this is purely about the concept of ViPR.

First of all, when I wrote about Software Defined Storage one the key requirements I mentioned was the ability to leverage existing legacy storage infrastructures… Primary reason for this is the fact I don’t expect customers to deprecate their legacy storage all at once, if they will at all. Keep that in mind when reading the rest of the article.

Let me summarize shortly what EMC introduced last week. EMC introduced a brand new product call ViPR. ViPR is a Software Defined Storage product; at least this is how EMC labels it. Those who read my articles on SDS know the “abstract / pool / automate” motto by now, and that is indeed what ViPR can offer:

  • It allows you to abstract the control path from the actual underlying hardware, enabling management of different storage devices through a common interface
  • It enables grouping of different types storage in to a single virtual storage pool. Based on policies/profiles the right type of storage can be consumed
  • It offers a single API for managing various devices; in other words a lower barier to automate. On top of that, when it comes to integration it for instance allows you to use a single “VASA” (vSphere APIs for Storage Awareness) provider instead of the many needed in a multi-vendor environment

So what does that look like?

What surprised me is that ViPR not only works with EMC arrays of all kinds but will also work for 3rd party storage solutions. For now NetApp support has been announced but I can see that being extended, and I now EMC is aiming to. You can also manage your fabric using ViPR, do note that this is currently limited to just a couple of vendors but how cool is that? When I did vSphere implementations the one thing I never liked doing was setting up the FC zones, ViPR makes that a lot easier and I can also see how this will be very useful in environments where workloads move around clusters. (Chad has a great article with awesome demos here) So what does this all mean? Let me give an example from a VMware point of view:

Your infrastructure has 3 different storage systems. Each of these systems have various data services and different storage tiers. Now when you need to add new data stores or introduce a new storage system without ViPR it would mean you will need to add new VASA providers, create LUNs, present these, potentially label these, see how automation works as typically API implementation differ etc. Yes a lot of work, but what if you had a system sitting in between you and your physical systems who takes some of these burdens on? That is indeed where ViPR comes in to play. Single VASA provider on vSphere, single API, single UI and self-service.

Now what is all the drama about then I can hear some of you think as it sounds pretty compelling. To be honest, I don’t know. Maybe it was the messaging used by EMC, or maybe the competition in the Software Defined space thought the world was crowded enough already? Maybe it is just the way of the storage industry today; considering all the heated debates witnessed over the last couple of years that is a perfectly viable option. Or maybe the problem is that ViPR enables a Software Defined Storage strategy without necessarily introducing new storage. Meaning that where some pitch a full new stack, in this case the current solution is used and a man-in-the-middle solution is introduced.

Don’t get me wrong, I am not saying that ViPR is THE solution for everyone. But it definitely bridges a gap and enables you to realise your SDS strategy. (Yes I know, there are other vendors who offer something similar.) ViPR can help those who have an existing storage solution to: abstract / pool / automate. Yes indeed, not everyone can afford it to swap out their full storage infrastructure for a new so-called Software Defined Storage device and that is where ViPR will come in handy. On top of that, some of you have, and probably always will, a multi-vendor strategy… again this is where ViPR can help simply your operations. The nice thing is that ViPR is an open platform, according to Chad source code and examples of all critical elements will be published so that anyone can ensure their storage system works with ViPR.

I would like to see ViPR integrate with host-local-caching solutions, it would be nice to be able to accelerate specific datastores (read caching / write back / write through) using a single interface / policy. Meaning as part of the policy ViPR surfaces to vCenter. Same applies to host side replication solutions by the way. I would also be interested in seeing how ViPR will integrate with solutions like Virtual Volumes (VVOLs) when it is released… but I guess time will tell.

I am about to start playing with ViPR in my lab so this is all based on what I have read and heard about ViPR (I like this series by Greg Schultz on ViPR). My understanding, and opinion, might change over time and if so I will be the first to admit and edit this article accordingly.

I wonder how those of you who are on the customer side look at ViPR, and I want to invite you to leave a comment.

vSphere HA – VM Monitoring sensitivity

Last week there was a question on VMTN about VM Monitoring sensitivity. I could have sworn I did an article on that exact topic, but I couldn’t find it. I figured I would do a new one with a table explaining the levels of sensitivity that you can configure VM Monitoring to.

The question that was asked was based on a false positive response of VM Monitoring, in this case the virtual machine was frozen due to the consolidation of snapshots and VM Monitoring responded by restarting the virtual machine. As you can imagine the admin wasn’t too impressed as it caused downtime for his virtual machine. He wanted to know how to prevent this from happening. The answer was simple, change the sensitivity as it is set to “high” by default.

As shown in the table high sensitivity means that VM Monitoring responds to missing “VMware Tools heartbeat” within 30 seconds. However, before VM Monitoring restarts the VM though it will check if their was any storage or networking I/O for the last 120 seconds (advanced setting: das.iostatsInterval). If the answer is no to both, the VM will be restarted. So if you feel VM Monitoring is too aggressive, change it accordingly!

Sensitivity Failure Interval Max Failures Max Failures Time window
Low 120 seconds 3 7 days
Medium 60 seconds 3 24 hours
High 30 seconds 3 1 hour

Do note that you can change the above settings individually as well in the UI, as seen in the screenshot below. For instance you could manually increase the failure interval to 240 seconds. How you should configure it is something I cannot answer, it should be based on what you feel is an acceptable response time to a failure. Also, what is the sweet spot to avoid a false positive… A lot to think about indeed when introducing VM Monitoring.

vSphere 5.1 Storage DRS Interoperability

A while back I did this article on Storage DRS Interoperability. I had questions last week about this so I figured I would write a new article which reflects the current state (vSphere 5.1). I also included some details that are part of the interoperability white paper Frank and I did so that we have a fairly complete picture. This white paper is on 5.0, it will probably be updated at some point in the future.

The first column describes the feature or functionality, the second column the recommended or supported automation mode and the third and fourth column show which type of balancing is supported.

Capability Automation Mode Space Balancing I/O Metric Balancing
Array-based Snapshots Manual Yes Yes
Array-based Deduplication Manual Yes Yes
Array-based Thin provisioning Manual Yes Yes
Array-based Auto-Tiering Manual Yes No
Array-based Replication Manual Yes Yes
vSphere Raw Device Mappings Fully Automated Yes Yes
vSphere Replication Fully Automated Yes Yes
vSphere Snapshots Fully Automated Yes Yes
vSphere Thin provisioned disks Fully Automated Yes Yes
vSphere Linked Clones Fully Automated (*) Yes Yes
vSphere Storage Metro Clustering Manual Yes Yes
vSphere Site Recovery Manager Not supported n/a n/a
VMware vCloud Director Fully Automated (*) Yes Yes
VMware View (Linked Clones) Not Supported n/a n/a
VMware View (Full Clones) Fully Automated Yes Yes

(*) = Change from 5.0

DRS not taking CPU Ready Time in to account? Need your help!

For years these rumors have been floating around that DRS does not take CPU Ready Time (%RDY) in to account when it comes load balancing the virtual infrastructure. Fact is that %RDY has always been a part of the DRS algorithm but not as a first class citizen but as part of CPU Demand, which is a combination of various metrics but includes %RDY. Still, one might ask why %RDY is not a first class citizen.

There is a good reason though that %RDY isn’t, just think about what DRS is and does and how it actually goes about balancing out the environment, trying to please all virtual machines. Yes a lot of possibilities indeed to move virtual machines around in a cluster. So you can imagine that it is is really complex (and expensive) to calculate what the possible impact is after a virtual machine has been migrated “from a host” or “to a host” for all of the first class citizen metrics.

Now, for a long time the DRS engineering team has been looking for situations in the field where a cluster is balanced according to DRS but there are still virtual machines experiencing performance problems due to high %RDY. The DRS team really wants to fix this problem or bust the myth – what they need is hard data. In other words, vc-support bundles from vCenter and vm-support bundles from all hosts with high ready times. So far, no one has been able to provide these logs / cold hard facts.

If you see this scenario in your environment regularly please let me know. I will personally get you in touch with our DRS engineering team and they will look at your environment and try to solve this problem once and for all. We need YOU!

Tested / Supported / Certified by VMware? (caching / dr solutions)

Lately I have been receiving more and more questions around support for specific “hypervisor side” solutions. With that meaning, how VMware deals with solutions which are installed within the hypervisor. I have always found it very difficult to dig up details around this both externally and internally. I figured it was time to try to make things a bit more clear, if possible at all.

For VMware Technology Partners there are various programs they can join. Some of the programs include a rigid VMware test/certification process which results in being listed on the VMware Compatibility Guide (VCG). You can find those which are officially certified on our VMware Compatibility Guide here, just type the name of the solution in the search bar. For instance when I type in “Atlantis” I get a link to the Atlantis ILIO page and can see which version of ILIO is supported today with which version of vSphere. Note that in this case on vSphere 4.x is listed, but Atlantis assured me that this will be updated to include vSphere 5.x soon.

Then there are the Partner Verified and Supported Product (PVSP) solutions. These are typically solutions that do not fit the VCG, for instance when it is new type of solution and there is no certification process yet. Now of course there are still strict guidelines for these solutions to be listed. For instance, your solution will only be listed on the PVSP (and the VCG for that matter) when you are using public APIs. An example for instance is the Riverbed Steelhead appliance, it follows all of the guidelines and is listed on the PVSP as such. You can find all the solutions which are part of the PVSP program here.

Finally there is the VMware Solutions Exchange section on vmware.com. This is where you will find most other solutions… Solutions which are not officially tested/certified (part of the VCG) or part of the PVSP program because of various reasons. Note that these solutions, although listed, are not supported by VMware in anyway. Now, of course VMware Support typically will do its best to help a customer out. However, it is not uncommon to be asked to reproduce the problem on an environment which does not have that solution installed so that it can be determined what is causing the issue and who is best equipped to help solving the issue.

I am not saying that those solution that are not listed on the VCG or PVSP should be avoided. They could very well solve that problem you have, or be the solution to fulfill your business requirements and as such be the “must use” component in your stack. It should be noted though that when introducing any 3rd party solution that there is a “risk” associated with it. From an architectural and operational perspective it is heavily recommended to validate what that risk exactly is. How you can minimize that risk? What you will need to do to get the right level of support? And ultimately, which company is responsible for which part? As when push comes to shove, you don’t want to be that person spending hours on the phone just figuring out who is supporting what! You just want to be on the phone to solve the problem right?!

I hope this helps some of you out there who asked me this question.

** Note: the above is not an official VMware Support statement or a VMware Partner Alliances statement, these are my observations made while digging through the links on vmware.com **