Vendors to check out at VMworld

Cormac just released his article about storage vendors to check out at VMworld, right when I was typing up this article. Make sure to read that one as well as it contains some great suggestions… I was looking at the list of vendors who have a booth at VMworld, there are a whole bunch I am going to try to check out this round. Of course some of the obvious ones are my friends over at Tintri, Nutanix and Pure Storage but lets try to list a few lesser known vendors. These are not all storage vendors by the way, but a mix of various types of startups from the VMware ecosystem. I have added my own oneliner to it, so you know what to expect.

  • Actifio – Business Continuity / Disaster Recovery solution that seems to be gaining traction, maybe I should say “Copy Data Management” solution instead, as that is ultimately what it is they do.
  • CloudPhysics – Monitoring / Analytics, the power of many! Or as I stated a while back: Where most monitoring solutions stop CloudPhysics continues.
  • Cumulus Networks – Linux Network Operating System is how they describe themselves, decoupling software from hardware is another way of looking at it… interesting company!
  • Infinio – Downloadable NFS performance enhancer! AKA memory caching solutions for NFS based infrastructures, check the intro article I wrote a while back…
  • Maxta – Software Defined Storage solution, virtual appliance based and hypervisor agnostic… Not spoken with them, or seen their solution yet
  • Panzura – A name that keeps popping up more and more often, a global distributed cloud storage solution. Haven’t dug in to it yet, but when I get the chance at VMworld I will…
  • PernixData – Came out of stealth this year, and as you all know is working on a write back flash caching solution… One of the few offering a clustered write back solution within the hypervisor
  • Plexxi – Networking done in a different way, SDN I would say.
  • SolidFire – SolidFire is definitely one cool scale-out storage solution to watch out for, one of the few which actually has a good answer to the question: do you offer Quality of Service? More details about what it is they do here… Not on the show floor, but outside of the expo.

Just a couple of companies which I feel are interesting and worth talking with.

Change in Permanent Device Loss (PDL) behavior for vSphere 5.1 and up?

Yesterday someone asked me a question on twitter about a whitepaper by EMC on stretched clusters and Permanent Device Loss (PDL) behavior. For those who don’t know what a PDL is, make sure to read this article firstThis EMC whitepaper states the following on page 40:

Note:

In a full WAN partition that includes cross-connect, VPLEX can only send SCSI sense code (2/4/3+5) across 50% of the paths since the cross-connected paths are effectively dead. When using ESXi version 5.1 and above, ESXi servers at the non-preferred site will declare PDL and kill VM’s causing them to restart elsewhere (assuming advanced settings are in place); however ESXi 5.0 update 1 and below will only declare APD (even though VPLEX is sending sense code 2/4/3+5). This will result in a VM zombie state. Please see the section Path loss handling semantics (PDL and APD)

Now as far as I understood, and I tested this with 5.0 U1 the VMs would not be killed indeed when half of the paths were declared APD and the other half PDL. But I guess something has changed with vSphere 5.1. I knew about one thing that has changed which isn’t clearly documented so I figured I would do some digging and write a short article on this topic. So here are the changes in behavior:

Virtual Machine using multiple Datastores:

  • vSphere 5.0 u1 and lower: When a Virtual Machine’s files are spread across multiple Datastores it might not be restarted in the case a Permanent Device Loss scenario occurs.
  • vSphere 5.1 and higher: When a Virtual Machine’s files are spread across multiple Datastores and a Permanent Device Loss scenario occurs then vSphere HA will restart the virtual machine taking availability of those datastores on the various hosts in your cluster in to account.

Half of the paths in APD state:

  • vSphere 5.0 u1 and lower: When a datastore on which your virtual machine resides is not in a 100% declared in a PDL state (assume half of the paths in APD) then the virtual machine will not be killed and restarted.
  • vSphere 5.1 and higher: When a datastore on which your virtual machine resides is not in a 100% declared in a PDL state (assume half of the paths in APD) then the virtual machine will be killed and restarted. This is a huge change compared to 5.0 U1 and lowe

These are the changes in behavior I know about for vSphere 5.1, I have asked engineering to confirm these changes for vSphere Metro Storage Cluster environments. When I have received an answer I will update this blog.

CloudPhysics KB Advisor, how cool is that?

Just imagine, you have 3-8 hosts – an EMC array – Dell hardware – some FibreChannel cards – Specific versions of firmware – Specific versions of ESXi and vCenter… How do you know what works and what does not? Well, you go to kb.vmware.com and you do a search and try to figure out what applies to you and what does not. In this depicted environment of only 3-8 hosts that should be simple? Well with thousands of KB articles I can ensure you that it is not… Just imagine now that you have 2 arrays and 2 clusters of 8 hosts… Or you add iSCSI to the mix? Yes it gets extremely overly complicated really really quick, in fact I would say it is impossible to figure out what does and does not apply to your environment. How do you solve that?

Well you don’t solve that yourself, it requires a big database and an analytics engine behind it… Big data platform even. Luckily though, the smart folks of CloudPhysics have solved it for you. Sign up, download the appliance and let them do the work for you… It doesn’t get any easier than that if you ask me. Some more details can be found in the press release.

I knew the CPhy guys were working on this, surprises me that no one else has done this so far to be honest. What an elegant / simple / awesome solution! Thanks CloudPhysics for making my life once again a whole lot easier.

Hardening recommendation to set limits on VMs or Resource Pools?

I received this question last week about a recommendation which was in the vSphere 5.1 Hardening Guide. The recommendation in the vSphere 5.1 Hardening Guide is the following:

By default, all virtual machines on an ESXi host share the resources equally. By using the resource management capabilities of ESXi, such as shares and limits, you can control the server resources that a virtual machine consumes.  You can use this mechanism to prevent a denial of service that causes one virtual machine to consume so much of the host’s resources that other virtual machines on the same host cannot perform their intended functions.

Now it might be just me but I don’t get the recommendation and my answer to this customer was as follows:
Virtual machines can never use more CPU/Memory resources then provisioned. For instance, when 4GB of memory is provisioned for a virtual machine the Guest OS of that VM will never consume more than 4GB. Same applies to CPU, if a VM has a single vCPU than that VM can never consume more than a single core of a CPU.

So how do I limit my VM? First of all: right sizing! If your VM needs 4GB then don’t provision it with 12GB as it some point it will consume it. Secondly: shares. Shares are the easiest way to ensure that the “noisy neighbor” isn’t pushing away the other virtual machines. By even leaving the shares set to default you can ensure that at least all “alike VMs” have more or less the same priority when it comes to resources. So what about limits?

Try to avoid (VM Level) limits at all times! Why? Well look at memory for a second, lets say you provision your VM with 4GB and limit it to 4GB and now someone changes the memory to 8GB but forgets to change the limit. So what happens? Well your VM uses up the 4GB and moves in to “extra 4GB” but the limit is there, so you the VM will experience memory pressure and you will see ballooning / swapping etc. Not a scenario you want to find yourself in right, indeed! What about CPU then? Well again, it is a hard limit in ALL scenarios. So if you set a 1GHz scenario but have a 2.3GHz CPU, your VM will not consume the 2.3GHz ever…. A waste? Yes it is. And not just VM level limits, there is also an operational impact with resource pool limits.

I can understand what the hardening guide is suggesting, but believe me you don’t want to go there. So let it be clear, AVOID using limits at all times!