disaster recovery

RE: Re-Imagining Ransomware Protection with VMware Ransomware Recovery

Duncan Epping · Apr 13, 2023 ·

Last week a blog post was published on VMware’s Virtual Blocks blog on the topic of Ransomware Recovery. Some of the numbers shared were astonishing and hard to contextualize even. Global damages caused by ransomware for instance are estimated to exceed 42 billion dollars in 2024, and this is expected to be doubling every year. Also, 66% of all enterprises were hit by ransomware, of which 96% did not regain full access to their data.

Now, it explicitly mentions “enterprises”, but this does not mean that only enterprise organizations are prone to ransomware attacks. Ransomware attacks do not discriminate, every company, non-profit, and even individuals are at risk if you ask me. As a smart person once said, data is the new oil, and it seems that everyone is drilling for it, including trespassers who don’t own the land! Of course, depending on the type of organization, solutions and services are available to mitigate the risks of losing access to your company’s most valuable asset, data.

VMware, and many other vendors, have various solutions (and services) to protect your data center, your workloads, and essentially your data. But what do you do if you are breached? How do you recover? How fast can you recover, and how fast do you need to recover? How far back do you need to go, and are you allowed to go? Some of you may wonder why I ask these questions, well that has everything to do with the numbers shared at the start of this blog. Unfortunately, today, when organizations are breached malicious code is often only detected after a significant amount of time. Giving the attacker time to collect information about the environment, spread itself throughout the environment, activate the attack, and ultimately request the ransom.

This is when you, the administrator, the consultant, and the cloud admin, will get those questions. How fast can you recover? How far back do we need to go? Where do we recover to? And what about your data? All fair questions, but these shouldn’t be asked after an attack has occurred and ransom is demanded. These are questions we all need to ask constantly, and we should be aligning our Ransomware Recovery strategy with the answers to those questions.

Now, it is fair to say that I am probably somewhat biased, but it is also fair to say that I am as Dutch as it gets and I wouldn’t be writing this blog if I did not believe in this service. VMware’s Ransomware Recovery as a Service, which is part of VMware Cloud Disaster Recovery, provides a unique solution in my humble opinion. First, the service provided can just simply start as a cloud storage service to which you replicate your workloads, without needing to run a full (small but still) software-defined datacenter. This is especially useful for those organizations that can afford to take ~3hrs to spin up an SDDC when there’s a need to recover (or test the process). However, it is also possible to have an SDDC ready for recovery at all times, which will reduce the recovery time objective significantly.

Of course, VMware provides the ability to protect multiple environments, many different workloads, and many point-in-time copies (snapshots). But it also enables you to verify your recovery point (snapshot) in a fully isolated environment. What you will appreciate is that the solution will actually not only isolate the workloads, but on top of that also provide you insights at various levels about the probability of the snapshot being infected. First of all, while going through the recovery process, entropy and change rate are shown which provides insights of when potentially the environment was infected. (Or ransomware was activated for that matter.)

But maybe even more important, through the use of NSX and VMware’s Next Generation Anti-Virus software, a recovery point can be safely tried. A quarantined environment is instantiated and the recovery point can be scanned for vulnerabilities and threats, and an analysis of the workloads to be recovered can be provided, as shown below. This simplifies the recovery and validation process immensely, as it removes the need for many of the manual steps usually involved in this process. Of course, as part of the recovery process, the advanced runbook capabilities of VMware Cloud Disaster Recovery are utilized, enabling the recovery of a full data center, or simply a select group of VMs, by running a recovery plan. This recovery plan includes the order in which workloads need to be powered on and restored, but can also include IP customization, DNS registration, and more.

Depending on the outcome of the analysis, you can then determine what to do with the snapshot. Is the data not compromised? Are the workloads not infected? Are there any known vulnerabilities that we would need to mitigate first? If data is compromised, or the environment is infected in any shape or form, you can simply disregard the snapshot and clean the environment. Rinse and repeat until you find that recovery point that is not compromised! If there are known vulnerabilities, and the environment is clean, you can mitigate those and complete the recovery. Ultimately resulting in full access to your company’s most valuable asset, data.

VMware announces Ransomware Recovery as a Service and Data Protection vision!

Duncan Epping · Sep 13, 2022 ·

At VMware Explore there was a whole session (CEIB1236US) dedicated to the vision for Data Protection and Ransomware Recovery as a Service. Especially the Ransomware Recovery as a Service had my interest as it is something that keeps coming up with customers. How do I protect my data, and when needed how do recover? Probably a year ago or so I had a conversation with VMware CTO for Cloud Storage and Data (Sazzala) on this topic, and we met up with various customers to gather requirements. Those discussions ultimately led to the roadmap for this new service and new features. Below I am going to summarize what was discussed in this session at VMware Explore, but I would urge you to watch the session as it is very valuable, and it is impossible for me to capture everything.

VMware’s Disaster Recovery as a Service solution is a unique offering as it provides the best of both worlds when it comes to Disaster Recovery. With DR you typically have two options:

Fast recovery, relatively high cost.
- Traditionally most customers went for this option, they had a “hot standby” environment that provided full capacity in case of emergency. But as this environment is always up and running and underutilized, it is a significant overhead.
Slower recovery, relatively low cost.
- This is where VMs are replicated to cheap and deep storage and compute resources are limited (if available at all). When a recovery needs to happen, data rehydration is required and as such, it is a relatively slow process.

With VMware’s offering, you now have a 3rd option: Fast recovery, at a relatively low cost! VMware provides the ability to store backups on cheap storage, and then recover (without hydration) directly in a cloud-based SDDC. It provides a lot of flexibility, as you can have a minimum set of hosts constantly running within your prepared SDDC, and scale out when needed during a failure, or you can even create a full SDDC at the time of recovery.

Now, this offering is available in VMware Cloud on AWS in various regions. During the session, the intention was also shared to deliver similar capabilities on Azure VMware Solution, Oracle Cloud VMware Solution, Google Cloud VMware Engine, and/or Alibaba Cloud VMware Service. Basically all global hyper-scalers. Maybe even more important, VMware also discussed additional capabilities that are being worked on. Scaling to tens of thousands of VMs, managing multi-petabytes of storage, providing 1-minute RPO levels, proving multi-VM consistency, having end-to-end SLA observability, providing advanced insights into cost and usage, and probably most important… a full REST API.

All of those enhancements are very useful for those aiming to recover from a disaster, not just natural disasters, but also for Ransomware attacks. Some of you may wonder how common a ransomware attack is, but unfortunately, it is very common. Surveys have revealed that 60% of the surveyed organizations were hit by ransomware in the past 12 months, 92% of those who paid the ransom did not gain full access to the data, and the average downtime was 16 days. Those are some scary numbers in my opinion. Especially the downtime associated with an attack, and the fact that full access was not regained even after paying a ransom.

In general recovery from ransomware is complex as ransomware typically remains undetected for larger periods of time before you are exposed to it. Then when you are exposed you don’t have too many options, you recover to a healthy point in time or you pay the ransom. When you recover, of course, you want to know if the set you are recovering is infected or not. You also want to have some indication of when the environment was infected, as no one wants to go through 3 months of snapshots before you find the right one. That alone would take days, if not weeks, and downtime is extremely expensive. This is where VMware Ransomware Recoveryfor VMware Cloud DR comes in.

The aim for the VMware Ransomware Recoveryfor VMware Cloud DR solution is to provide the ability to recover to an Isolated Recovery Environment (including networking). This first of all prevents reinfection at the time of recovery. During the recovery process, the environment is also analyzed by a next-generation anti-virus scanner for known/current threats. Simply to prevent a situation where you recover a snapshot that was infected. What I am even more impressed by is that the plan is to also include a visual indication of when most likely an environment was infected, this is done by providing an insight into the data change rate and entropy. Now, entropy is not a word most non-native speakers are familiar with, I wasn’t, but it refers to the randomness of the data. Both the change rate and the entropy could indicate abnormal patterns, which then could indicate the time of infection and help identify a healthy snapshot to recover!

As mentioned, during recovery the snapshot is scanned by a Next-Gen AV, and of course, when infections are detected they will be reported in the UI. This then provides you the option to discard the recovery and select a different snapshot. Even if no vulnerabilities are found the environment can be powered on fully isolated, providing you the ability to manually inspect before exposing app owners, or end-users, to the environment again.

Now comes the cool part, when you have curated the environment, when you are absolutely sure this is a healthy point in time that was not infected, you have the choice to fallback to your “source” environment or simply remain running in your VMware Cloud while you clean up your “source” site. Before I forget, I’ve been talking about full environments and VMs so far, but of course, it is also the intention to provide the ability to restore files and folders of course! All in all, a very impressive solution that should be available in the near future.

If you are interested in these capabilities and would like to stay informed, please fill out this form: https://forms.office.com/r/yh69Npq7nY.

Announcing VMware Cloud Disaster Recovery! (VCDR)

Duncan Epping · Sep 30, 2020 ·

Most of you probably saw the announcements around the acquisition of Datrium not too long ago. One of the major drivers for that acquisition was the Disaster Recovery solution which Datrium developed. This week at VMworld this service was announced as a new VMware disaster recovery option. The service is named VMware Cloud Disaster Recovery, and it provides the ability to replicate workloads from on-prem into cloud storage, and recover from cloud storage into VMware Cloud on AWS! The three key pillars of the service are ease of use, fast recovery, cloud economics.

The solution is extensively covered in three VMworld sessions (HCI2876, HCI2886, HCI2865). I have watched all three and will provide a short summary here. What capabilities does VMware Cloud DR (VCDR) provide and why is VMware heading into this space?

The why was well explained by Mark Chuang in HCI2876, customers are saying that:

“DR is very complex and expensive to manage, and I can’t add IT Headcount”
“Our data grows 10-15% every year, with physical DR it is hard to accommodate the growth in the datacenter to meet the needs”
“We only test full DR once a year because it is disruptive. Any time there is a major change, how can we know it still works? It is a huge issue!”

I guess that makes it clear why VMware is interested in this space, it is a huge problem for customers and the solution typically comes at a high cost. VMware has always been in the business of solving complex solutions in preferably a simple way, and that is exactly what VMware Cloud Disaster Recovery delivers, a simple solution at a relatively low cost.

So what does it bring from a feature/functionality stance?

it all starts with cloud economics, to which ease-of-use also contributes, in my opinion. VMware Cloud Disaster Recovery is super simple to configure and it replicates data to “cheap and deep” cloud storage. This ensures that the cost can be kept low, and note that all of the typical cost that comes with cloud storage (network etc) are all included in the service offering by VMware. The challenge however typically with cloud storage is that it is relatively slow when it comes to restoring, but this is where the “on-demand” capabilities come into play. VMware Cloud DR provides the ability to instantly power-on workloads through a live mount option, without the need to convert the stored data back to a VM format.

When configuring the VMware Cloud DR solutions you will need to install/configure a DRaaS Connector on-prem. This on-prem Connector connects you to the SaaS platform and will provide the required details to the SaaS Orchestrator, note that you can have multiple DRaaS connectors for resiliency and performance reasons. When the connection is configured you will then be able to create “Protection Groups” and “DR Plans”. Those who have worked with Site Recovery Manager will recognize the terms. For those who haven’t:

Protection Groups – These groups list the workloads which will be protected by VMware Cloud DR. Of course you can define the protection schedule, basically how many snapshots need to be shipped remote cloud storage per day/week/month.
DR Plans – These plans list workloads that would need to be failed over when the plan is triggered, and for instance, include the order in which the workloads need to be powered on. Also, if workloads need to get a different IP address in the cloud, then you can specify this here also.

Of course besides creating protection groups and DR plans you have the ability to test and failover the workloads in those plans, again, very similar to what Site Recovery Manager offers. Before I forget, you will have the option of course to select the snapshot you want to recover from. So you can go back to any point in time. What is unique here is that VMs are powered without (initially) moving data from cloud storage to your VMware Cloud on AWS. It basically mounts an NFS share from the SaaS platform and the scale-out file system ensures that the VMs can be instantly be powered on. After you have tested the recovery you can then decide to migrate the VMs to your SDDC, or you can of course also discard the workloads if that is something you desire. Last but not least, of course, you also have the ability to replicate back to on-prem, so that you can bring your workloads back whenever you have recovered your environment from the disaster that occurred and you are ready to run those workloads on-prem again.

Now there are many more details, but I am not going to share those in this post, I may do a couple of additional blogs at a future time. I hope the above gives a good overview of what the offering will provide. For more details, I would recommend watching the VMworld sessions on this topic (HCI2876, HCI2886, HCI2865). The last thing I want to share though is where the solution will be available, or at least what is being planned. As shown below, the offering should be available in multiple regions soon.

VMworld Reveals: Disaster Recovery / Business Continuity enhancements! (#HCI2894BU and #HBI3109BU)

Duncan Epping · Sep 4, 2019 ·

At VMworld, various cool new technologies were previewed. In this series of articles, I will write about some of those previewed technologies. Unfortunately, I can’t cover them all as there are simply too many. This article is about enhancements in the business continuity/disaster recovery space. There were 2 sessions where futures were discussed, namely HCI2894BU and HBI3109BU. Please note that this is a brief summary of those sessions, and these are discussing a Technical Preview, these features/products may never be released, and these previews do not represent a commitment of any kind, and this feature (or it’s functionality) is subject to change. Now let’s dive into it, what can you expect for disaster recovery in the future?

The first session I watched was HCI2894BU, this was all about Site Recovery Manager. I think the most interesting part is the future support for Virtual Volumes (vVols) for Site Recovery Manager. It may sound like something simple, but it isn’t. When the version of SRM ships that supports vVols keep in mind that your vVol capable storage system also needs to support it. At day 1 HPe Nimble, HPe 3PAR and Pure Storage will support it and Dell EMC and NetApp are actively working on support. The requirements are that the storage system needs to be vVols 2.0 compliant and support VASA 3.0. Before they dove into the vVols implementation, some history was shared and the current implementation. I found it interesting to know that SRM has over 25.000 customers and has protected more than 3.000.000 workloads over the last decade.

vSphere Replication 6.5, 5 minute RPO for ALL!

Duncan Epping · Nov 16, 2016 ·

I just noticed the following in the vSphere Replication 6.5 release notes which I felt was worth sharing:

5-minute Recovery Point Objective (RPO) support for additional data store types – This version of vSphere Replication extends support for the 5 minute RPO setting to the following new data stores: VMFS 5, VMFS 6, NFS 4.1, NFS 3, VVOL and VSAN 6.5. This allows customers to replicate virtual machine workloads with an RPO setting as low as 5-minutes between these various data store options.

We have had this for vSAN in specific for a while now, but I hadn’t realized yet that we were enabling this for all sorts of datastores in this release. Definitely a great reason to move up to vSphere 6.5 and re-evaluate which VMs can do with a 5 minute RPO and use this great replication mechanism that just ships with vSphere for free! More info found in the release notes here.

If you like to know more about the 6.5 release visit this page with the links to all docs/downloads by William Lam.