Platform9 manages private clouds as a service

A couple of months ago I introduced you to this new company founded by 4 former VMware employees called Platform9. I have been having discussions with them occasionally about what they were working on and I’ve been very intrigued by what they are building and am very pleased to see there first version go GA and want to congratulate them with hitting this major milestone. For those who are not familiar with what they do, this is what their website says:

Platform9 Managed OpenStack is a cloud service that enables Enterprises to manage their internal server infrastructure as efficient private clouds.

In short, they have a PaaS based solution which allows you to simply manage KVM based virtualization hosts. It is a very simple way of creating a private cloud and it will literally get your KVM based solution up and running in minutes which very welcome in this world where things seem to become increasingly more complex, and especially when you talk about KVM/Openstack.

Besides the GA announcement the pricing model was also announced. The pricing model follows the same “pay per month” model as CloudPhysics has. In the case of  Platform9 the costs are $49 per CPU per month with an annual subscription being required. This is for what they call their “business tier” which has unlimited scale. There is also a “lite tier” which is free but will have limited scale and is mainly aimed for people to test Platform9 and learn about their offering. An Enterprise tier is also in the works and will offer more advanced features and premium support. Features it will include additionally to what the Business tier offers appear to be mainly in the “software defined networking”  and security space, so I would expect things like firewalling, network isolation, single sign-on etc.

I highly recommend watching the Virtualization Field Day 4 videos as they demonstrate perfectly what they are capable off. The video that is probably most interesting to you is the one where they demonstrate a beta of the offering they are planning for vSphere (embedded below). The beta shows vSphere hosts and KVM hosts in a single pane of glass. The end-user can deploy “instances” (virtual machines) in the environment of choice using a single tool which from an operational perspective is great. On top of that, Platform9 discovers existing workloads on KVM and vSphere and non-disruptively adds them to their management interface.

DRS is just a load balancing solution…

Recently I’ve been hearing this comment more and more, DRS is just a load balancing solution. It seems that some folks spread this FUD to diminish what DRS really is and does. Let me start by saying that DRS is not a load balancing solution. The ultimate goal of DRS is to ensure all workloads receive the resources they demand. Frank Denneman has a great post on this topic as this has led to some confusion in the past. I would advise reading it if you want to understand why exactly VMs are not moved while the cluster seems imbalanced. In short: why balance VMs when the VMs are not constraint? In other words, DRS has a VM centric view of the virtual world and not a host centric… In the end, it is all about your applications and how they perform and not necessarily about the infrastructure it is hosted on, DRS cares about VM/Application happiness. Also, keep in mind that there is a risk and a cost involved with every move you do.

Of course there is a lot of functionality that you leverage without thinking about it and take for granted. Things like Resource Pools (limits / reservations / shares), DRS Maintenance Mode (fully automated), VM Placement, Admission Control (yes DRS has one) and last but not least the various types of (anti) affinity rules. Also, before anyone starts shouting about active memory vs consumed (PercentIdleMBInMemDemand solves this) or %RDY taken in to account… DRS has many knobs you can twist.

But besides that, there is more. Something not a lot of people realize is that for instance HA and DRS are loosely coupled but tightly integrated. When you have both enabled on your cluster then HA will be able to call upon DRS for making the right placement decision and defragmenting resources when needed. What does that mean? Well lets assume for a second that you are running at full (or almost) capacity and a host fails while taking a host failure in to account by leveraging HA admission control. When the host fails HA will need to restart your VMs, but if there at some point is not enough spare capacity left to restart a VM on a given host? Well in that case HA will call upon DRS to make space available so that these VMs can be restarted. That is nice right?! And there is more smartness coming with considering HA / DRS admission control, hopefully I can tell you all about it soon.

Then of course there is also the case where resource pools are implemented. vSphere HA and DRS work in conjunction to ensure that when VMs are failed over that shares are flattened to avoid strange prioritisation during times of contention. HA and DRS do this as VMs always failover to the root resource pool of a host, but of course DRS will place the VMs back where they belong when it runs the first time after the failover has occurred. This especially is important when you have set shares on VMs individually in a resource pool model.

So when someone says DRS is just a simple load balancing solution take their story with a grain of salt…

Must read post on Cloud Native Apps

I don’t do this too often but I wanted to share an excellent blog post by one of my colleagues. I was writing something along the same lines as it seems there is a lot of confusion around what cloud native apps are and what they bring. Even when it comes to containers there still seems to be a lot of confusion. What fits where and how you can leverage certain technologies to its full potential will all depend on your application architecture if you ask me. If you read the examples of how these types of apps are (or aren’t) administered you can also see that with the wrong understanding and knowledge applying the same logic to an app which is not just there could lead to a world of pain.

Anyway, Massimo’s post is a great start for everyone who wants to have a better understanding of the evolution which is going on in the developers world. Thanks Massimo for taking the time to write this great article. Below a short out take and the link, I urge all of you to read it and soak it in.

Cloud Native Applications for dummies

This is where the virtual machines (aka instances) hosting the code of our cloud native application live. They are completely stateless, they are an army of VMs all identically configured (on a role-basis) and whose entire life cycle is automated. In such an environment traditional IT concepts often associated to virtual machines do not even make any sense. See below for some examples.

  • You don’t install (in the traditional way) these servers, because they are generated by automated scripts that are either triggered by an external event or by a policy (e.g. autoscale a front end layer based on user demand)
  • You don’t operate these servers,  for the same reason above.

Operational Efficiency (You’re not Facebook/Google/Netflix)

In previous roles, also before I joined VMware, I was a system administrator and a consultant. The tweets below reminded me of the kind of work I did in the past and triggered a train of thought that I wanted to share…

Howard has a great point here. For some reason many people started using Google, Facebook or Netflix as the prime example of operational efficiency. Startups use it in their pitches to describe what they can bring and how they can simplify your life, and yes I’ve also seen companies like VMware use it in their presentations.When I look back at when I managed these systems my pain was not the infrastructure (servers / network / storage)… Even though the environment I was managing was based on what many refer to as legacy: EMC Clariion, NetApp FAS or HP EVA. The servers were never really the problem to manage either, sure updating firmware was a pain but not my biggest pain point. Provisioning virtual machines was never a huge deal… My pain was caused by the application landscape many of my customers had.

At companies like Facebook and Google the ratio of Application to Admin is different as Howard points out. I would also argue that in many cases the applications are developed in-house and are designed around agility, availability and efficiency… Unfortunately for most of you this is not the case. Most applications are provided by vendors which don’t really seem to care about your requirements, they don’t design for agility and availability. No, instead they do what is easiest for them. In the majority of cases these are legacy monolithic (cr)applications with a simple database which all needs to be hosted on a single VM and when you get an update that is where the real pain begins. At one of the companies I worked for we had a single department using over 80 different applications to calculate mortgages for the different banks and offerings out there, believe me when I say that that is not easy to manage and that is where I would spent most of my time.

I do appreciate the whole DevOps movement and I do see the value in optimizing your operations to align with your business needs, but we also need to be realistic. Expecting your IT org to run as efficient as Google/Facebook/Netflix is just not realistic and is not going to happen. Unless of course you invest deep and develop the majority of your applications in-house, and do so using the same design principles these companies use. Even then I doubt you would reach the same efficiency, as most simply won’t have the scale to reach it. This does not mean you should not aim to optimize your operations though! Everyone can benefit from optimizing operations, from re-aligning the IT department to the demands of todays world, from revising procedures… Everyone should go through this motion, constantly, but at the same time stay realistic. Set your expectations based on what lands on the infrastructure as that is where a lot of the complexity comes in.

(Inter-VM) TPS Disabled by default, what should you do?

We’ve all probably seen the announcement around inter-VM(!!) TPS (transparent page sharing) being disabled by default in future releases of vSphere, and the recommendation to disable it in current versions. The reason for this is the fact there was a research paper published which demonstrates how it is possible to get access to data under certain highly controlled conditions. As the KB article describes:

Published academic papers have demonstrated that by forcing a flush and reload of cache memory, it is possible to measure memory timings to determine an AES encryption key in use on another virtual machine running on the same physical processor of the host server if Transparent Page Sharing is enabled. This technique works only in a highly controlled environment using a non-standard configuration.

There were many people who blogged about what the potential impact is on your environment or designs. Typically in the past people would take a 20 to 30% memory sharing in to account when sizing their environment. With inter-VM TPS disabled of course this goes out of the window. Frank described this nicely in this post. However, as Frank also described and I mentioned in previous articles when large pages are being used (usually the case) then TPS is not used by default and only under pressure…

The under pressure part is important if you ask me as TPS is the first memory reclaiming technique used when a host is under pressure. If TPS cannot sufficiently reduce the memory pressure then ballooning is leveraged, followed by compression and swapping ultimately. Personally I would like to avoid swapping at all costs and preferably compression as well. Ballooning typically doesn’t result in a huge performance degradation so it could be acceptable, but TPS is something I prefer as it just breaks up large pages in to small pages and collapses those when possible. Performance loss is hardly measurable in that case. Of course TPS would be way more effective when pages between VMs can be shared rather then just within the VM.

Anyway, the question remains should you have (inter-VM) TPS disabled or not? When you assess the risk you need to ask yourself first who has access to your virtual machines as the technique requires you to login to a virtual machine. Before we look at the scenarios, not that I mentioned “inter-VM” a couple of times now, TPS is not completely disabled in future versions. It will be disabled for inter-VM sharing by default, but can be enabled. More to be found on that in this article on the vSphere blog.

Lets explore 3 scenarios:

  1. Server virtualisation (private)
  2. Public cloud
  3. Virtual Desktops

In the case of “Server virtualisation”, in most scenarios I would expect that only the system administrators and/or application owners have access to the virtual machines. The question then is, why would they go to this level when they have access to the virtual machines anyway? So in the scenario where Server Virtualization is your use case, and access to your virtual machines is restricted to a limited number of people, I would definitely reconsider enabling inter-VM TPS.

In a public cloud environment this however is different of course. You can imagine that a hacker could buy a virtual machine and try to retrieve the AES encryption key. What he (the hacker) does with it next of course is even then still the question. Hopefully the cloud provider ensures that that the tenants are isolated from each other from a security/networking point of view. If that is the case there shouldn’t be much they could do with it. Then again, it could be just one of the many steps they have to take to break in to a system so I would probably not want to take the risk, although the risk is low. This is one of the scenarios where I would leave inter-VM TPS disabled.

Third and last scenario is Virtual Desktops. In the case of a virtual desktop many different users have access to virtual machines… The question though is if you are running any applications or accessing applications which are leveraging AES encryption or not. I cannot answer that for you, so I will leave that up in the air… you will need to assess that risk.

I guess the answer to whether you should or should not disable (inter-VM) TPS is as always: it depends. I understand why inter-VM TPS was disabled, but if the risk is low I would definitely consider enabling it.