Must read post on Cloud Native Apps

I don’t do this too often but I wanted to share an excellent blog post by one of my colleagues. I was writing something along the same lines as it seems there is a lot of confusion around what cloud native apps are and what they bring. Even when it comes to containers there still seems to be a lot of confusion. What fits where and how you can leverage certain technologies to its full potential will all depend on your application architecture if you ask me. If you read the examples of how these types of apps are (or aren’t) administered you can also see that with the wrong understanding and knowledge applying the same logic to an app which is not just there could lead to a world of pain.

Anyway, Massimo’s post is a great start for everyone who wants to have a better understanding of the evolution which is going on in the developers world. Thanks Massimo for taking the time to write this great article. Below a short out take and the link, I urge all of you to read it and soak it in.

Cloud Native Applications for dummies

This is where the virtual machines (aka instances) hosting the code of our cloud native application live. They are completely stateless, they are an army of VMs all identically configured (on a role-basis) and whose entire life cycle is automated. In such an environment traditional IT concepts often associated to virtual machines do not even make any sense. See below for some examples.

  • You don’t install (in the traditional way) these servers, because they are generated by automated scripts that are either triggered by an external event or by a policy (e.g. autoscale a front end layer based on user demand)
  • You don’t operate these servers,  for the same reason above.

Operational Efficiency (You’re not Facebook/Google/Netflix)

In previous roles, also before I joined VMware, I was a system administrator and a consultant. The tweets below reminded me of the kind of work I did in the past and triggered a train of thought that I wanted to share…

Howard has a great point here. For some reason many people started using Google, Facebook or Netflix as the prime example of operational efficiency. Startups use it in their pitches to describe what they can bring and how they can simplify your life, and yes I’ve also seen companies like VMware use it in their presentations.When I look back at when I managed these systems my pain was not the infrastructure (servers / network / storage)… Even though the environment I was managing was based on what many refer to as legacy: EMC Clariion, NetApp FAS or HP EVA. The servers were never really the problem to manage either, sure updating firmware was a pain but not my biggest pain point. Provisioning virtual machines was never a huge deal… My pain was caused by the application landscape many of my customers had.

At companies like Facebook and Google the ratio of Application to Admin is different as Howard points out. I would also argue that in many cases the applications are developed in-house and are designed around agility, availability and efficiency… Unfortunately for most of you this is not the case. Most applications are provided by vendors which don’t really seem to care about your requirements, they don’t design for agility and availability. No, instead they do what is easiest for them. In the majority of cases these are legacy monolithic (cr)applications with a simple database which all needs to be hosted on a single VM and when you get an update that is where the real pain begins. At one of the companies I worked for we had a single department using over 80 different applications to calculate mortgages for the different banks and offerings out there, believe me when I say that that is not easy to manage and that is where I would spent most of my time.

I do appreciate the whole DevOps movement and I do see the value in optimizing your operations to align with your business needs, but we also need to be realistic. Expecting your IT org to run as efficient as Google/Facebook/Netflix is just not realistic and is not going to happen. Unless of course you invest deep and develop the majority of your applications in-house, and do so using the same design principles these companies use. Even then I doubt you would reach the same efficiency, as most simply won’t have the scale to reach it. This does not mean you should not aim to optimize your operations though! Everyone can benefit from optimizing operations, from re-aligning the IT department to the demands of todays world, from revising procedures… Everyone should go through this motion, constantly, but at the same time stay realistic. Set your expectations based on what lands on the infrastructure as that is where a lot of the complexity comes in.

(Inter-VM) TPS Disabled by default, what should you do?

We’ve all probably seen the announcement around inter-VM(!!) TPS (transparent page sharing) being disabled by default in future releases of vSphere, and the recommendation to disable it in current versions. The reason for this is the fact there was a research paper published which demonstrates how it is possible to get access to data under certain highly controlled conditions. As the KB article describes:

Published academic papers have demonstrated that by forcing a flush and reload of cache memory, it is possible to measure memory timings to determine an AES encryption key in use on another virtual machine running on the same physical processor of the host server if Transparent Page Sharing is enabled. This technique works only in a highly controlled environment using a non-standard configuration.

There were many people who blogged about what the potential impact is on your environment or designs. Typically in the past people would take a 20 to 30% memory sharing in to account when sizing their environment. With inter-VM TPS disabled of course this goes out of the window. Frank described this nicely in this post. However, as Frank also described and I mentioned in previous articles when large pages are being used (usually the case) then TPS is not used by default and only under pressure…

The under pressure part is important if you ask me as TPS is the first memory reclaiming technique used when a host is under pressure. If TPS cannot sufficiently reduce the memory pressure then ballooning is leveraged, followed by compression and swapping ultimately. Personally I would like to avoid swapping at all costs and preferably compression as well. Ballooning typically doesn’t result in a huge performance degradation so it could be acceptable, but TPS is something I prefer as it just breaks up large pages in to small pages and collapses those when possible. Performance loss is hardly measurable in that case. Of course TPS would be way more effective when pages between VMs can be shared rather then just within the VM.

Anyway, the question remains should you have (inter-VM) TPS disabled or not? When you assess the risk you need to ask yourself first who has access to your virtual machines as the technique requires you to login to a virtual machine. Before we look at the scenarios, not that I mentioned “inter-VM” a couple of times now, TPS is not completely disabled in future versions. It will be disabled for inter-VM sharing by default, but can be enabled. More to be found on that in this article on the vSphere blog.

Lets explore 3 scenarios:

  1. Server virtualisation (private)
  2. Public cloud
  3. Virtual Desktops

In the case of “Server virtualisation”, in most scenarios I would expect that only the system administrators and/or application owners have access to the virtual machines. The question then is, why would they go to this level when they have access to the virtual machines anyway? So in the scenario where Server Virtualization is your use case, and access to your virtual machines is restricted to a limited number of people, I would definitely reconsider enabling inter-VM TPS.

In a public cloud environment this however is different of course. You can imagine that a hacker could buy a virtual machine and try to retrieve the AES encryption key. What he (the hacker) does with it next of course is even then still the question. Hopefully the cloud provider ensures that that the tenants are isolated from each other from a security/networking point of view. If that is the case there shouldn’t be much they could do with it. Then again, it could be just one of the many steps they have to take to break in to a system so I would probably not want to take the risk, although the risk is low. This is one of the scenarios where I would leave inter-VM TPS disabled.

Third and last scenario is Virtual Desktops. In the case of a virtual desktop many different users have access to virtual machines… The question though is if you are running any applications or accessing applications which are leveraging AES encryption or not. I cannot answer that for you, so I will leave that up in the air… you will need to assess that risk.

I guess the answer to whether you should or should not disable (inter-VM) TPS is as always: it depends. I understand why inter-VM TPS was disabled, but if the risk is low I would definitely consider enabling it.

Underlying Infrastructure for your pets and cattle

Last week on twitter Maish asked a question that got me thinking. Actually, I have been thinking about this for a while now. The question deals with how you design your infrastructure for the various types of workloads (pets and cattle). Whether your workload falls in the “pet” category or the “cattle” category. (If you are not familiar with the terms pets/cattle read this article by Massimo)

I asked Maish what it actually means for your infrastructure, and at the same time I gave it some more thought over the last week. Cattle is the type of application architecture which handles failures by providing a distributed solution, it scales out instead of up typically, the VMs are disposable as they usually won’t hold state. With pets this is different, they typically scale up and resiliency is often provided by either a 3rd party clustering mechanism or the infrastructure under neath, in many cases contain state and recoverability is key. As you can imagine both types of workloads have different requirements of the infrastructure. Going back to Maish’s question, I guess the real question is if you can afford the “what it means for the underlying infrastructure”. What do I mean with that?

If you look at the requirements of both architectures, you could say that “pets” will typically demand more from the underlying infrastructure when it comes to resiliency / recoverability. Cattle will have less demands from that perspective but flexibility / agility is more important. You can imagine that you could implement two different infrastructure architectures for these specific workloads, but does this make sense? If you are Netflix, Google, Youtube etc then it may make sense to do this due to the scale they operate at and the fact that IT is their core business. In those cases “cattle” is what drives the business, and there are back-end systems. Reality is though that for the majority this is not the case. Your environment will be a hybrid, and more than likely “pets” will have the overhand as that is simply what the state of the world is today.

That does not mean they cannot co-exist. That is what I believe is the true strength of virtualization, it allows you to run many different types of workloads on the same infrastructure. Whether that is your Exchange environment or your in-house developed scale out web application which serves hundreds of thousands of customers does not make a difference to your virtualization platform. From an operational perspective the big benefit here is that you will not have to maintain different run books to manage your workloads. From an ops perspective they will look  the same on the outside, although they may differ on the inside. What may change though is the services required for those systems, but with the rich ecosystem available for virtualization platforms these days that should not be a problem. Need extra security / microsegmentation? VMware NSX can provide the security isolation needed to run these applications smoothly. Sub milliseconds latency requirements? Plenty of storage / caching solutions out there that can deliver this!

Will the application architecture shift that is happening right now impact your underlying infrastructure? We have made these huge steps in operational efficiency in the last 5 years, and with SDDC we are about to take the next big step, and although I do believe that the application architecture shift will result in infrastructure changes lets not make the same mistakes we made in the past by creating these infrastructure silos per workload. I strongly believe that repeatability, consistency, reliability and predictability are key and this starts with a solid, scalable and trusted foundation (infrastructure).

Re: Re: The Rack Endgame: A New Storage Architecture For the Data Center

I was reading Frank Denneman’s article with regards to new datacenter architectures. This in its turn was a response to Stephen Fosket’s article about how the physical architecture of datacenter hardware should change. I recommend reading both articles as that will give a bit more background, plus they are excellent reads by itself. (gotta love these blogging debates) Lets start with an out take of both articles which summarizes blog posts for those who don’t want to read the full article.

Stephen:
Top-of-rack flash and bottom-of-rack disk makes a ton of sense in a world of virtualized, distributed storage. It fits with enterprise paradigms yet delivers real architectural change that could “move the needle” in a way that no centralized shared storage system ever will. SAN and NAS aren’t going away immediately, but this new storage architecture will be an attractive next-generation direction!

If you look at what Stephen describes I think it is more or less in line with what Intel is working towards. The Intel Rack Scale Architecture aims to disaggregate traditional server components and then aggregate by type of resource backed by a super performing and optimized rack fabric. Rack fabric enabled by the new photonic architecture Intel is currently working on. This is not long term future, this is what Intel showcased last year and said to be available in 2015 / 2016.

Frank:
The hypervisor is rich with information, including a collection of tightly knit resource schedulers. It is the perfect place to introduce policy-based management engines. The hypervisor becomes a single control plane that manages both the resource as well as the demand. A single construct to automate instructions in a single language providing a correct Quality of Service model at application granularity levels. You can control resource demand and distribution from one single pane of management. No need to wait on the completion of the development cycles from each vendor.

There’s a bit in Frank’s article as well where he talks about Virtual Volumes and VAAI and how long it took for all storage vendors to adopt VAAI and how he believes that the same may apply to Virtual Volumes and Frank aims more towards the hypervisor being the aggregator instead of doing it through changes in the physical space.

So what about Frank’s arguments? Well Frank has a point with regards to VAAI adoption and the fact that some vendors took a long time to implement these. However, reality is though that Virtual Volumes is going full steam ahead. With many storage vendors demoing it at VMworld in San Francisco last week I have the distinct feeling that things will be different this time. Maybe timing is part of it, as it seems that many customers or on a crosspoint and want to optimize their datacenter operations / architecture by adopting SDDC, of which policy based storage management happens to be a big chunk.

I agree with Frank that the hypervisor is positioned perfect to be that control plane. However, in order to be that control plane for the future there needs to be a way to connect “things” to it which allows for far better scale and more flexibility. VMware, if you ask me, has done that for many parts of the datacenter but one aspect that stills needs to be overhauled for sure is storage. VAAI was a great start, but with VMFS there simply are too many constraints and it doesn’t cater for granular controls.

I feel that the datacenter will need to change on both ends in order to take that next step in the evolution to the SDDC. Intel Rack Scale architecture will allow for far greater scale and efficiency then seen ever before. But it will only be successful when the layer that sits on top has the ability to take all of these disaggregated resources, turn them in to large shared pools and allows to assign resources in a policy driven (and programmable) manner. Not just assign resources but also allow you to specify what the level of availability (HA, DR but also QoS) should be for whatever consumes those resources. Granularity is important here and of course it shouldn’t stop with availability but applies to any other (data) service that one may require.

So where does what fit in? If you look at some of the initiatives that were revealed at VMworld like Virtual Volumes, Virtual SAN and vSphere APIs for IO Filters you can see where the world is moving towards fast. You can see how vSphere is truly becoming that control plane for all resources and how it will be able to provide you end-to-end policy driven management. In order to make all of this reality the current platform will need to change. Changes that allow for more granularity /flexibility and higher scalability and that is where all these (new) initiatives come in to play. Some partners may take longer to adopt than others, especially those that require fundamental changes to the architecture of underlaying platforms (storage systems for instance), but just like with VAAI I am certain that over time this will happen as customers will drive this change by making decisions based on availability of functionality.

Exciting times ahead if you ask me.