Tested / Supported / Certified by VMware? (caching / dr solutions)

Lately I have been receiving more and more questions around support for specific “hypervisor side” solutions. With that meaning, how VMware deals with solutions which are installed within the hypervisor. I have always found it very difficult to dig up details around this both externally and internally. I figured it was time to try to make things a bit more clear, if possible at all.

For VMware Technology Partners there are various programs they can join. Some of the programs include a rigid VMware test/certification process which results in being listed on the VMware Compatibility Guide (VCG). You can find those which are officially certified on our VMware Compatibility Guide here, just type the name of the solution in the search bar. For instance when I type in “Atlantis” I get a link to the Atlantis ILIO page and can see which version of ILIO is supported today with which version of vSphere. Note that in this case on vSphere 4.x is listed, but Atlantis assured me that this will be updated to include vSphere 5.x soon.

Then there are the Partner Verified and Supported Product (PVSP) solutions. These are typically solutions that do not fit the VCG, for instance when it is new type of solution and there is no certification process yet. Now of course there are still strict guidelines for these solutions to be listed. For instance, your solution will only be listed on the PVSP (and the VCG for that matter) when you are using public APIs. An example for instance is the Riverbed Steelhead appliance, it follows all of the guidelines and is listed on the PVSP as such. You can find all the solutions which are part of the PVSP program here.

Finally there is the VMware Solutions Exchange section on vmware.com. This is where you will find most other solutions… Solutions which are not officially tested/certified (part of the VCG) or part of the PVSP program because of various reasons. Note that these solutions, although listed, are not supported by VMware in anyway. Now, of course VMware Support typically will do its best to help a customer out. However, it is not uncommon to be asked to reproduce the problem on an environment which does not have that solution installed so that it can be determined what is causing the issue and who is best equipped to help solving the issue.

I am not saying that those solution that are not listed on the VCG or PVSP should be avoided. They could very well solve that problem you have, or be the solution to fulfill your business requirements and as such be the “must use” component in your stack. It should be noted though that when introducing any 3rd party solution that there is a “risk” associated with it. From an architectural and operational perspective it is heavily recommended to validate what that risk exactly is. How you can minimize that risk? What you will need to do to get the right level of support? And ultimately, which company is responsible for which part? As when push comes to shove, you don’t want to be that person spending hours on the phone just figuring out who is supporting what! You just want to be on the phone to solve the problem right?!

I hope this helps some of you out there who asked me this question.

** Note: the above is not an official VMware Support statement or a VMware Partner Alliances statement, these are my observations made while digging through the links on vmware.com **

What is static overhead memory?

We had a discussion internally on static overhead memory. Coincidentally I spoke with Aashish Parikh from the DRS team on this topic a couple of weeks ago when I was in Palo Alto. Aashish is working on improving the overhead memory estimation calculation so that both HA and DRS can be even more efficient when it comes to placing virtual machines. The question was around what determines the static memory and this is the answer that Aashish provided. I found it very useful hence the reason I asked Aashish if it was okay to share it with the world. I added some bits and pieces where I felt additional details were needed though.

First of all, what is static overhead and what is dynamic overhead:

  • When a VM is powered-off, the amount of overhead memory required to power it on is called static overhead memory.
  • Once a VM is powered-on, the amount of overhead memory required to keep it running is called dynamic or runtime overhead memory.

Static overhead memory of a VM depends upon various factors:

  1. Several virtual machine configuration parameters like the number vCPUs, amount of vRAM, number of devices, etc
  2. The enabling/disabling of various VMware features (FT, CBRC; etc)
  3. ESXi Build Number

Note that static overhead memory estimation is calculated fairly conservative and we take a worst-case-scenario in to account. This is the reason why engineering is exploring ways of improving it. One of the areas that can be improved is for instance including host configuration parameters. These parameters are things like CPU model, family & stepping, various CPUID bits, etc. This means that as a result, two similar VMs residing on different hosts would have different overhead values.

But what about Dynamic? Dynamic overhead seems to be more accurate today right? Well there is a good reason for it, with dynamic overhead it is “known” where the host is running and the cost of running the VM on that host can easily be calculated. It is not a matter of estimating it any longer, but a matter of doing the math. That is the big difference: Dynamic = VM is running and we know where versus Static = VM is powered off and we don’t know where it might be powered!

Same applies for instance to vMotion scenarios. Although the platform knows what the target destination will be; it still doesn’t know how the target will treat that virtual machine. As such the vMotion process aims to be conservative and uses static overhead memory instead of dynamic. One of the things or instance that changes the amount of overhead memory needed is the “monitor mode” used (BT, HV or HWMMU).

So what is being explored to improve it? First of all including the additional host side parameters as mentioned above. But secondly, but equally important, based on the vm -> “target host” combination the overhead memory should be calculated. Or as engineering calls it calculating “Static overhead of VM v on Host h”.

Now why is this important? When is static overhead memory used? Static overhead memory is used by both HA and DRS. HA for instance uses it with Admission Control when doing the calculations around how many VMs can be powered on before unreserved resources are depleted. When you power-on a virtual machine the host side “admission control” will validate if it has sufficient unreserved resource available for the “static memory overhead” to be guaranteed… But also DRS and vMotion use the static memory overhead metric, for instance to ensure a virtual machine can be placed on a target host during a vMotion process as the static memory overhead needs to be guaranteed.

As you can see, a fairly lengthy chunk of info on just a single simple metric in vCenter / ESXTOP… but very nice to know!

Increase Storage IO Control logging level

I received this question today around how to increase the Storage IO Control logging level. I knew either Frank or I wrote about this in the past but I couldn’t find it… which made sense as it was actually documented in our book. I figured I would dump the blurp in to an article so that everyone who needs it for whatever reason can use it.

Sometimes it is necessary to troubleshoot your environment and having logs to review is helpful in determining what is actually happening. By default, SIOC logging is disabled, but it should be enabled before collecting logs. To enable logging:

  1. Click Host Advanced Settings.
  2. In the Misc section, select the Misc.SIOControlLogLevel parameter. Set the value to 7 for complete logging.  (Min value: 0 (no logging), Max value: 7)
  3. SIOC needs to be restarted to change the log level, to stop and start SIOC manually, use: /etc/init.d/storageRM {start|stop|status|restart}
  4. After changing the log level, you see the log level changes logged in /var/log/vmkernel

Please note that SIOC log files are saved in /var/log/vmkernel.

Limit a VM from an IOps perspective

Last couple of weeks I heard people either asking questions around how tot limit a VM from an IOps perspective or making comments that Storage IO Control (SIOC) allows you to limit VMs. As I pointed at least three folks to this info I figured I would share it publicly.

There is an IOps limit setting on the virtual disk as an option… This is what allows you to limit a virtual machine / virtual disk to a specific amount of IOps. Now it should be noted that when you set this limit this is handled (vSphere 5.1 and prior) by the local host scheduler, also known as SFQ. One thing to realize though is that when you set a limit on multiple virtual disks for a virtual machine is that all of these limits will be added up and that will be your threshold. In other words:

  • Disk01 – 50 IOps limit
  • Disk02 – 200 IOps limit
  • Combined total: 250 IOps limit
  • If Disk01 only uses 5 IOps then Disk02 can use 245 IOps!

There is one caveat though, “combined total” only goes for the disks which are stored on the same datastore. So if you have 4 disks and they are stored across 4 datastores then each of the individual limits apply respectively.

More details can be found in this KB article: http://kb.vmware.com/kb/1038241

RE: Is VSA the future of Software Defined Storage? (OpenIO)

I was reading this article on the HP blog about the future of Software Defined Storage and how the VSA fits perfectly. Although I agree that a VSA (virtual storage appliance) could potentially be a Software Defined Storage solution I do not really agree with IDC quote used for the basis of this article and on top of that I think some crucial factors are left out. Lets start with the IDC quote:

IDC defines software-based storage as any storage software stack that can be installed on any commodity resources (x86 hardware, hypervisors, or cloud) and/or off-the-shelf computing hardware and used to offer a full suite of storage services and federation between the underlying persistent data placement resources to enable data mobility of its tenants between these resources.

Software Defined Storage solutions to me are not necessarily just a software-based storage product. Yes as said a VSA, or something like Virtual SAN (VSAN), could be part of your strategy but how about the storage you have today? Do we really expect customers to forget about their “legacy” storage and just write it off? Surely that won’t happen, especially not in this economical climate and considering many companies invested heavily in storage when they started virtualizing production workloads. What is missing in this quote, or in that article (although briefly mentioned in linked article), is the whole concept of “abstract, pool, automate”. I guess some of you will say, well that is VMware’s motto right? Well yes and no. Yes, “abstract, pool, automate” is the way of the future if you ask VMware. However this is not something new. Think about Software Defined Networking for instance, this is fully based on the “abstract, pool, automate” concept.

This had me thinking, what is missing today? There are various different initiatives around networking (openflow etc), but what about storage? I created this diagram that from a logical perspective explains what I think we need when it comes to Software Defined Storage. I guess this is what Ray Lucchesi is referring to in his article on Openflow and the storage world. Brent Compton from FusionIO also had an insightful article on this topic, worth reading.

future of software defined storage - OpenIO?

If you look at my diagram… (yes FCoE/Infiniband etc is missing, not because it shouldn’t be supported but just to simplify the diagram) I drew a hypervisor at the top, reason for it being is that I have been in the hypervisor business for years but reality is this could be anything right. From a hypervisor perspective all you should see is a pool. A pool for your IO, a pool where you can store your data. Now this layer should provide you various things. Lets start at the bottom and work our way up.

  • IO Connect = Abstract / Pool logic. This should allow you to connect to various types of storage abstract it for your consumers and pool it of course
  • API = Do I need to explain this? It addresses the “automate” part but also probably even more importantly the integration aspect. Integration is key for a Software Defined Datacenter. The API(s) should be able to provide both north-, south-, east- and west-bound capabilities (for explanation around this read this article, although it is about Software Defined Networking it should get the point across)
  • PE = Policy Engine, takes care of making sure your data ends up on the right tier with the right set of services
  • DS = Data Services. Although the storage can provide specific data services this is also an opportunity for the layer in between the hypervisor and the storage. Matter a fact, data services should be possible on all layers: physical storage, appliance, host based etc
  • $$ = Caching. I drew this out separately for a reason, yes it could be seen as a data service but I wanted it separately as for any layer inserted there is an impact on performance. An elegant and efficient caching solution at the right level could mitigate the impact. Again, caching could be part of this framework but could very well sit outside of it on the host-side or at the storage layer, or appliance based

One thing I want to emphasize here is the importance of the API. I briefly mentioned enabling north-, south-, east- and west-bound capabilities but in order for a solution like this to be successful this is a must. Although with automation you can go a long way, integration is key here! Whether it is seamless integration with your physical systems, integration with your virtual management solution or with an external cloud storage solution… These APIs should be able to provide that kind of functionality and be enable a true pluggable framework experience.

If you look at this approach, and I drew this out before I even looked at Virsto, it kind of resembles what Virsto offers today. Although there are components missing the concept is similar. It also resembles VVOLs in a way, which was discussed at VMworld in 2011 and 2012. I would say that what I described is a combination of both combined with what Software Defined Networking promises.

So where am I going with this? Good question, honestly I don’t know… For me articles like these are a nice way of blowing steam, get the creative juices going and open up the conversation. I do feel the world is ready for the next step from a Software Defined Storage perspective, I guess the really question is who is going to take this next step and when? I would love to hear your feedback.

Install and configure Virsto

Last week I was in Palo Alto and I had a discussion about Virsto. Based on that discussion I figured it was time to setup Virsto in my Lab. I am not going to go in to much details around what Virsto is or does as I already did that in this article around the time VMware acquired Virsto. On top of that there is an excellent article by Cormac which provides an awesome primer. But lets use two quotes from both of the articles to give an idea of what to expect:

source: yellow-bricks.com
Virsto has developed an appliance and a host level service which together forms an abstraction layer for existing storage devices. In other words, storage devices are connected directly to the Virsto appliance and Virsto aggregates these devices in to a large storage pool. This pool is in its turn served up to your environment as an NFS datastore.

source: cormachogan.com
Virsto Software aims to provide the advantages of VMware’s linked clones (single image management, thin provisioning, rapid creation) but deliver better performance than EZT VMDKs.

Just to give an idea, what do I have running in my lab?

  • 3 x ESX 5.1 host
  • 1 x vCenter Server (Windows install as VCVA isn’t supported_
  • VNX 5500

After quickly glancing the quickstart guide I noticed I needed a Windows VM to install some of Virsto’s components, that VM is what Virsto refers to as the “vMaster”. I also need a bunch of empty LUNs which will be used for storage. I also noticed reference of Namespace VMs and vIOService VMs. Hmmm, it sounds complicated, but is it? I am guessing these components will need to be connected. This is kind of the idea, note that the empty LUNs will be automatically connected by to the IOService VMs. I did not add those to the diagram as that would make it more complex than needed.

Virsto Network Architecture Diagram

[Read more...]

vOpenData – feed up!

A couple of weeks ago I asked this question on twitter about what the average disk size of a virtual machine is these days. Within a couple of minutes Ben Thomas replied and said we might be able to create a survey script and he copied William Lam in. Now for those who never worked with William, if you ask him a question like that you can expect him to knock something out… William and Ben decided not to just knock out a survey script, but rather an open community project called vOpenData.

vOpenData

This open community project consists of a script that collects the data (and they collect a significant amount, you can see what they collect here.) and is aiming to provide various trending statistics and data for virtualized environments. The data is fed back in to the vOpenData database. The vOpenData website has a great dashboard which provides you all these cool stats. For instance, at the moment there are 77 infrastructures that provided data to their collection. The question I asked, what is the average disk size, currently says “61.51GB”. That average is based on those 77 infrastructures with over 27.000 VMs in total combined. Nice right!?!

I have already emailed William a bunch of suggestions, and as I will be in Palo Alto this week I am sure some more will bubble up during conversations. I am hoping that everyone sees the power of a solution like this and can help feeding data in to the vOpenData platform.

Go here to download the bits and feed up!

** I have had some people asking me how vOpenData compares to CloudPhysics. I have also seen some people comparing vOpenData to CloudPhysics… To be honest you can’t really compare them. Where vOpenData is about averages and statistics, CloudPhysics is more about analytics and simulation models. **