future

An Industry Roadmap: From storage to data management #STO7903 by @xtosk

Duncan Epping · Sep 1, 2016 ·

This is the session I have been waiting for, I had it very high on my “must see” list together with the session presented by Christian Dickmann earlier today. Not because it happened to be presented by our Storage an Availability CTO Christos Karamanolis (@XtosK on twitter), but because of the insights I expect to be provided in this session. The title I think says it all: An Industry Roadmap: From storage to data management.

** Keep that in mind when reading the rest of article. Also, this session literally just finished a second ago, I wanted to publish it asap so if there are any typos, my apologies. **

Christos starts with explaining the current problem. There is a huge information growth, 2x growth every 2 years. And that is on the conservative side. Where does the data go? According to analyst it is not expected that this will go to traditional storage, actually the growth of traditional storage is slowing down, actually there is a negative growth seen. Two new types of storage have emerged and are growing fast, Hyper-scale Server SAN Storage and Enterprise Server SAN Storage aka Hyper-converged systems.

With new types of applications changing the world of IT, data management is more important than ever before. Todays storage product do not meet the requirements of this rapidly changing IT world and does not provide the agility your business owners demand. Many of the infrastructure problems can be solved by Hyper-Converged Software, this is all enabled by the hardware evolution we’ve witness over the last years: flash, RDMA, NVMe, 10Gbe etc. These changes from a hardware point of view allowed us to simplify storage architectures and deliver it as software. But it is not just about storage, it is also about operational simplicity. How do we enable our customers to manage more applications and VMs with less. Storage Policy Based Management has enabled this for both Virtual SAN (hyper-converged) and Virtual Volumes in more traditional environments.

Data Lifecycle Management however is still challenging. Snapshots, Clones, Replication, Dedupe, Checksums, Encryption. How do I enable these on a per VM level? How do we decouple all of these data services from the underlying infrastructure? VMware has been doing that for years, best example is vSphere Replication where VMs and Virtual Disks can be replicated on a case by case basis between different types of storage systems. It is even possible to leverage an orchestration solution like Site Recovery Manager to manage your DR strategy end to end from a single interface from private cloud to private cloud, but also from private to public. And from private to public is enabled by vCloud Availability suite, and here you can pay as you g(r)o(w). All of this again driven by policy and through the interface you use on a daily basis, the vSphere Web Client.

How can we improve the world of DR? Just imagine there was a portable snapshot. A snapshot that was decoupled from storage, can be moved between environments, can be stored in public or private clouds and maybe even both at the same time. This is something we as VMware are working on. A portable snapshot that can be used for Data Protection purposes. Local copies, archived copies in remote datacenters with a different SLA/retention.

How does this scale however when you have 10000s of VMs? Especially when there are 10s of snapshots per VM, or even hundreds. This should all be driven by policy. If I can move the data to different locations, can I use this data as well for other purposes? How about leveraging this for test&dev or analytics? Portable snapshots providing application mobility.

Christos next demoed what the above may look like in the future, the demo shows a VM being replicated from vSphere to AWS, but vSphere to vSphere or vSphere to Azure were also available as an option. The normal settings are configured (destination datastore and network) and literally within seconds the replication starts. The UI looks very crisp and seems to be similar to what was shown in the keynote on day 1 (Cross Cloud Services). But how does this work in the new world of IT, what if I have many new gen applications, containers / microservices?

A Distributed File System for Cloud Native apps is now introduced. It appears to be a solution which sits on top of Virtual SAN and provides a file system that can scale to 1000s of hosts with functionality like highly scalable and performing snapshots and clones. These snapshots provided by this Distributed File System are also portable, this concept being developed is called exoclones. It is not something that is just living in the heads of the engineering team, Christos actually showed a demo of an exoclone being exported and imported to another environment.

If VMware does provide that level of data portability, how do you track and control all that data? Data governance is key in most environments, how do we enforce compliance, integrity and availability. This will be the next big challenge for the industry. There are some products which can provide this today, but nothing that can do this cross-cloud and for both current and new application architectures and infrastructures.

Although for years we seem to have been under the impression that the infrastructure was the center of the universe. Reality is that it serves a clear purpose: host applications and provide users access to data. Your companies data is what is most important. We as VMware realize that and are working to ensure we can help you move forward on your next big journey. In short, it is our goal that you can focus on data management and no longer need to focus on the infrastructure.

Great talk,

Recommended viewing: VMUG Sessions

Duncan Epping · Nov 28, 2014 ·

Last week I presented at a couple of VMUGs and at those VMUGs all whole bunch of sessions were recorded. I receive a lot of requests to speak at VMUGs and although I try to attend many of them, there is still quite a few I have to decline unfortunately. Whenever I visit a VMUG I try to attend various sessions just to get a better understanding of what it is our partners offer, how our customers use our products and what type of questions are raised. Below you can find a couple of the sessions (including my own) which I enjoyed and recommend watching. I understand that it is difficult to find a block of 5 hrs to watch these, but I would like to urge to do so as they will prepare you for what is coming in the future.

HA Futures: Pro-active response

Duncan Epping · Oct 4, 2013 ·

We all know (at least I hope so) what HA is responsible for within a vSphere Cluster. Although it is great that vSphere HA responds to a failure of a host / VM / application and even in some cases your storage device; wouldn’t it be nice if vSphere HA could pro-actively respond to conditions which might lead to a failure? That is what we want to discuss in this article.

What we are exploring right now is the ability for HA to avoid unplanned downtime. HA would detect specific (health) conditions that could lead to catastrophic failures and pro-actively move virtual machines of that host. You could for instance think of a situation where 1 out of 2 storage paths goes down. Although not directly impacting the machines from an availability perspective, it could be catastrophic if that second path goes down. So in order to avoid ending up in this situation vSphere HA would vMotion all the virtual machines to a host which does not have a failure.

This could of course also apply to other components like networking or even memory or CPU. You could potentially have a memory dimm which is reporting specific issues that could impact availability, this in its turn could then trigger HA to pro-actively move all potentially impacted VMs to a different host.

A couple of questions we have for you:

When such partial host failures occur today, how do you address these conditions? When do you bring the host back online?
What level of integration do you expect with management tools? In other words, should we expose an API that your management solution can consume, or do you prefer this to be a stand-alone solution using a CIM provider for instance?
Should HA treat all health conditions the same? I.e., always evacuate all VMs from an “unhealthy” host?
How would you like HA to compare two conditions? E.g., H1 fan failure, H2 network path failure?

Please chime in,

vSphere HA Futures: Restart Order

Duncan Epping · Sep 13, 2013 ·

At VMworld I hosted a group discussion together with Keith Farkas (HA Lead Engineer) on the topic of HA Futures. Based on this discussion group session Keith and I decided to gather more feedback from the field, this post will hopefully help us with that. Please do not hesitate to comment. I will have a couple of articles following this one, but lets get started with HA futures for the Restart Order first.

A topic that has come up at various sessions is HA restart ordering / priorities. Today HA provides four levels of restart priority: High, Medium, Low, Disabled. The thing to note with the current restart priority though is that there is no guarantee VMs are actually restarted in that order when the VMs are started on more than one host. Even when HA would restart them in the right order there is also no guarantee around when the boot cycle completes. Typically large virtual machines with for instance a database will take longer to boot than a server just running DNS. So what do we propose? We propose restart orders instead of restart priority. What does this mean, and what would we like to now from you?

There are two complementary ways of implementing this and we would like your feedback including which one you think would be most useful.

Global Restart Order aka Bucketing
VM to VM dependency Chains

Lets explain these two options and then I let you guys chime in.

Global Restart Order aka Bucketing is basically what you have today with “restart priorities” only it will actually enforce the restart order and it will allow for more flexibility. So with this option you could for instance create 5 buckets, and then add virtual machines to these buckets appropriately. These buckets could be: Priority 1, Priority 2 and so on. When a failure has occurred vSphere HA would then restart all VMs in the bucket “Priority 1” first and when that bucket has finished starting (e.g., wait for VMware Tools Heartbeat to report “alive” for each VM) vSphere HA would continue with the next bucket and so on. Waiting for VMtools to report “alive” is one way to determine that a VM is “ready”. We are thinking of providing three other “wait” options — wait for an application heartbeat, wait a certain amount of time after the VM powers on, or today’s behavior, wait for the power on task to complete”.

I guess a couple of questions we have:

How many levels would you like to see?
Which of the wait conditions (e.g., wait on VMtools) are most useful for you?
Suppose HA could not power on a “Priority 1” VM. Do you want HA to stop powering on the “Priority 2” etc VMs until it can, move to the “Priority 2” group after a timeout, or something else?

The second option is VM to VM dependency Chains. These can be seen as an explicit restart order for a specific group of VMs which typically would form a service. I guess not unlike the vApp construct today, but then without all the caveats and restrictions around this. (vApps are essential resource pools, and we don’t want resource management in this case… just restart orderering.) In the simplest form, you could imagine specifying ordered lists of VMs, each list specifying the restart order for that set — the VMs in a list would be powered on sequentially. For example, something like the following:

Database VM –> Application Server –> Web Server

As you can see that would offer a significant amount of granularity, but also potentially a lot of operational complexity. How far would you like to go I guess is the question? Questions we have for you:

Is an ordered list sufficient to express dependencies in a chain of VMs or do you need more sophistication?
A VM with a dependent fails, do you expect HA to restart that child VM even though the previous has failed?
What if HA could not be able to restart a VM with dependents — should HA restart these dependent VMs after a delay or only after the first VM is restarted?

A final question. We think bucketing will be easier to manage operationally but it introduces artificial dependencies between VMs and will make it take much longer to restart all VMs after a failure. How significant are these limitations?

That is it for now… Please chime in, as your response will help us define the future of vSphere HA.

Future HA developments… (VMworld – BC3197)

Duncan Epping · Sep 15, 2009 ·

I was just listening to “BC3197 – High Availability – Internals and Best Practices” by Marc Sevigny. Marc is one of the HA engineers and is also my primary source of information when it comes to HA. Although most information can be found on the internet it’s always good to verify your understanding with the people who actually wrote it.

During the session Marc explains, and I’ve written about in this article, that when a dual host failure occurs the global startup order is not taking into account. The startup order will be processed per host with the current version. In other words “Host a” first with taking startup order into account and then “Host B” with taking startup order into account.

During the session however Marc revealed that in a future version of HA global startup settings(Cluster based) will be taken into account for any number of host failures! Great stuff, another thing to mention is that they are also looking into an option which would enable you to pick your primary hosts. For blade environment this will be really useful. Thanks Marc for the insights,