cloud

Project Dimension – VMware’s Edge Computing effort

Duncan Epping · Nov 20, 2018 ·

Internally some of my focus has been shifting, going forward I will spend more time on edge computing besides vSAN. Edge (and IoT for that matter) has had my interest for a while, and when VMware announced an edge project I was intrigued and interested instantly. At VMworld US the edge computing efforts were announced. The name for the effort is Project Dimension. There were several sessions at VMworld, and I would recommend watching those if you are looking for more info then provided below. The session out of which I took most of the below info was IOT2539BE, titled “Project Dimension: the easy button for edge computing” by Esteban Torres and Guru Shashikumar. Expect more content on Project Dimension in the future as I start getting involved more.

What is Project Dimension? What discussed at VMworld was the following:

A new VMware Cloud service; starting at edge locations
Enable enterprises to consume compute, storage, and networking at the edge like they consume public cloud
VMware will work with OEM partners to deliver and manage hyperconverged appliances in edge locations
- All appliances will be managed by VMware via VMware Cloud

So what does it include? Well as mentioned it includes hardware, the type etc hasn’t been mentioned, but it was said that Dell and Lenovo are the first two OEMs to support Project Dimension. This hyperconverged solution will include:

vSphere
vSAN
Velocloud

This solution will be managed by a “hybrid cloud control plane” as it is referred to, all by VMware. Architecturally this is what the service will look like:

Now what I found very interesting is that during the session someone asked about the potential for Dimension in on-prem datacenters, and the answer was: “Edge is where we are beginning, but the long-term plan is to offer the same model for data centers as well”. Some may notice that in the above list and diagram NSX is missing, as mentioned during the session, this is being planned for, but preferably will be a “lighter” flavor. What also stands out is that the HCI solution includes not only compute but also networking (switches and SD-WAN appliance).

Now, what is most interesting is the management aspect, VMware and the OEM partner will do the full maintenance/lifecycle management for you. This means that if something breaks the OEM will fix it, you as a customer however always contact VMware, single point of contact for everything. If there’s an upgrade then VMware will go through that motion for you. Every edge cluster for instance also has a vCenter Server instance, but you as an administrator/service owner will not be managing that vCenter Server instance, you will be managing the workloads that run in that environment. This to me makes sense, as when you scale out and potentially have hundreds or thousands of locations you don’t want to spend most of your time managing the infra for that, you want to focus on where the company’s revenue is.

Now getting back to the maintenance/upgrades. How does this work, how do you know you have sufficient capacity to allow for an upgrade to happen? VMware will also ensure this is possible by doing some form of admission control, which prevents you to claim 100% of the physical resources. Another interesting thing mentioned is that Dimension will allow you to chose when the upgrade or patches will be applied. In most environments maintenance will have an impact on workloads in some shape or form, so by providing blackout dates a peak season/time can be avoided.

From a hardware point of view and procurement perspective, this service is also different then you are used to. The services will be on a subscription basis. 1 year or 3-year reserved edge clusters, or more of course. And from a hardware perspective, it kind of aligns with what you typically see in the cloud: Small, Medium or Large instance. Which then refers to the number of resources you get per node. Starting with 3 nodes, of course, have the ability to scale up and potentially start smaller than 3 nodes in the future. The process in terms of sign up / procurement is displayed in the diagram below, delivery would be within 1-2 weeks, which seems extremely fast to me.

What I also found interesting was the mention of a “try and buy” option, you pay for 3 months and if you like it you keep it, and your 3 months contract will go to 1 year (or so) automatically.

At this point you may be asking: why is VMware doing this? Well, it is pretty simple: demand and industry changes. We are starting to see a clear trend, more and more workloads are shifting closer to the consumer. This allows our customers to process data faster and more importantly respond faster to the outcome, and of course, take action through machine learning. But the biggest challenge customers have is consistently managing these locations at a global scale, and this is what Project Dimension should solve. This is not just a challenge at the edge, but across edge, on-prem and public cloud if you ask me. There are so many moving parts, various different tools, and interfaces, which just makes things overly complex.

So what is VMware planning on delivering with Project Dimension? Consistently, reliable and secure hyperconverged infrastructure which is managed through a Cloud Control Plane (single pane of glass management for edge environments) and edge-to-cloud connectivity through Velocloud SD-WAN. (Management traffic for now, but “edge to edge” and “edge to on-prem” soon!) There’s a lot of innovation happening at the back-end when it comes to managing and maintaining 1000s of edge locations, but you as a customer are buying simplicity, reliability, and consistency.

Please note, Project Dimension is in beta, and the team is still looking for beta customers. You need to have a valid use case, as I can see some of you thinking “nice for a home lab for a couple of weeks”, but that, of course, is not what the team is looking for. For those who have a good use case, please go to the product page and leave your details behind: http://vmwa.re/dimension

HA Admission control: How can I check how much reserved resources are used?

Duncan Epping · Sep 14, 2018 ·

I had this question twice in the past three months, so somehow this is something which isn’t clear to everyone. With HA Admission Control you set aside capacity for fail-over scenarios, this is capacity which is reserved. But of course VMs also use reservations, how can you see what the total combined used reserved capacity is including Admission Control?

Well, that is actually pretty simple it appears. If you open the Web Client, or the H5 client, and go to your cluster then you have a “Monitor” tab, under the Monitor tab there’s an option called “Resource Reservation” and it has graphs for both CPU as well as Memory. This actually includes admission control. To verify this I took a screenshot in the H5 client before enabling admission control, and after enabling it, as you can see a big increase in “reserved resources”, which indicates that admission control is taken into consideration.

Before:

After:

This is also covered in our Deep Dive book, by the way, if you want to know more, pick it up.

My top VMworld session picks

Duncan Epping · Aug 7, 2018 ·

Every year I post a list of my favorite VMworld sessions, my top picks. There are way too many sessions to see, but these are definitely the sessions I would like to attend personally. That could be because of the speaker, or the content, and preferably both. Yes I know, this list will have some great sessions missing, not because I did not like the abstract or speaker, but simply because I forced myself to limit this list to 10. Before we get started, here are the two sessions I have scheduled, make sure to sign up for those while you still can, as both seem to be at 80+ % capacity right now

The Power of Storage Policy-Based Management [HCI1270BU] – Cormac Hogan & Duncan Epping
Tuesday, Aug 28, 12:30 p.m. – 1:30 p.m.
The world of software-defined storage moves at a rapid pace, and VMware is one of the biggest enablers. In this session, Cormac and Duncan will guide you through the world of software-defined storage initiatives at VMware and provide a primer to VMware vSAN, VMware Virtual Volumes (VVol), persistent cloud-native storage options (Project Hatchway), the VMware vSphere APIs for I/O filtering, and the binding factor in these cases: storage policy-based management. Be warned: We will bring demos!
vSphere Clustering Deep Dive, Part 1: vSphere HA and DRS [VIN1249BU] – Frank Denneman & Duncan Epping
Monday, Aug 27, 12:30 p.m. – 1:30 p.m.
In this session, Duncan and Frank will take you through the trenches of VMware vSphere Distributed Resource Scheduler (DRS) and vSphere High Availability (HA). Find out about options to optimize your DRS settings for your specific requirements and goals, such as if you should be load balancing on active or consumed memory, as well as what has recently changed in the DRS algorithm and if it will impact DRS behavior. And for vSphere HA, you will learn about when it restarts virtual machines (VMs), what kind of restart times to expect, and where you can find evidence that a VM (or multiple) have been restarted. You will find out about all of these items and more. Prepare to dive deep, as the basics will not be covered.

Here are my top picks, note that although I picked Ravi’s session from the Extreme Performance Series, all of them are worth attending!

Extreme Performance Series: vCenter Performance Deep Dive [VIN1759BU] Ravi Soundararajan
Tuesday, Aug 28, 5:00 p.m. – 6:00 p.m.
In this talk, you will get a brief description of the internals of VMware vCenter before going into basic performance troubleshooting and monitoring techniques. Find out about various tools for analyzing resource usage, important metrics like sessions and API calls, and database performance (primarily for the vCenter Server Appliance, but also for vCenter Server for Windows). You will get to understand the differences between vCenter and Platform Services Controller, and consider the impact of linked mode and plug-ins/extensions. By the end of the talk, you’ll understand how your vCenter works, when you may need multiple vCenters, and how Platform Services Controller factors into performance. xPerfSeries
Tech Preview: The Road to a Declarative Compute Control Plane [VIN2256BU] Maarten Wiggers & Frank Denneman
Tuesday, Aug 28, 12:30 p.m. – 1:30 p.m.
Declarative control planes are becoming increasingly popular in the industry. Instead of explicitly defining configurations, declarative control planes tell the architecture what the desired state should be. The desired state could be high priority, or keep particular VMs or containers separate. Within the software-defined data center (SDDC), VMware vSphere offers two declarative control planes: one for networking and one for storage. However, there is no declarative control plane for compute yet.
Compute policy provides a framework to allow our customers the flexibility and control of VM placement and resourcing decisions based on the user’s encompassing application needs. In this session, you will learn about the capabilities introduced in the VMware Cloud SDDC as a path to achieve that goal.
Clustering Deep Dive 2: Quality Control with DRS and Network I/O Control [VIN1735BU] Niels Hagoort & Sahan Gamage
Tuesday, Aug 28, 2:00 p.m. – 3:00 p.m.
In this session, you will go through the trenches of network-aware VMware vSphere DRS and vSphere Network I/O Control. You may ask yourself what these two have to do with each other as, unfortunately, not many people know about the enhancements added to the DRS algorithm around network-aware load balancing. If you want to understand how this can help prevent problems from occurring with network-intensive workloads like NFV, then this is a session you cannot miss!
Project Fractal – The Easy Button for Edge Computing [IOT2593BU] – Dennis Lu & Sridevi Ravuri
Tuesday, Aug 28, 4:00 p.m. – 5:00 p.m.
Come and learn about how VMware can accelerate your adoption of Edge Computing by dealing with the additional complexity and cost of infrastructure management at the Edge, helping you quickly achieve the cost savings and revenue growth benefits of Edge Computing. This is also a great opportunity to shape the direction of VMware’s edge services to help fit customer needs.
vSAN Deployment Topology and Availability Deep Dive: What You Need to Know [HCI2040BU] Paudie O’Riordan & Mansi Shah
Wednesday, Aug 29, 8:00 a.m. – 9:00 a.m.
Today, VMware vSAN can be deployed in many different form factors; for example, vSAN 2-Node ROBO, vSAN Fault domains, Stretch Cluster with and without local protection, and more. These deployment models make vSAN quite flexible and unique. This session will help you understand the different trade-offs and focus on the benefits and overheads of the choice you’ve made in your vSAN proposed design. Join Mansi and Paudie as they discuss these topologies in depth from both an engineering perspective and a practical real-world implementation. Paudie and Mansi will take a no-nonsense review of how to approach designing a fault-tolerant vSAN deployment and give real-world examples of how to achieve the best design from both an availability and performance perspective.
Top 10 Automation Requests and How You Can Save Time [VIN2527BU] Alan Renouf & William Lam
Monday, Aug 27, 2:00 p.m. – 3:00 p.m.
After working firstly as customers and secondly at VMware, Alan and William have encountered hundreds of ways to save time through automation. In this session, they will take you through the top automation requests and how they were completed, teaching you not only how to reproduce them yourself, but also giving you a framework to enable you to automate your top 10 requests.
This session will include a number of techniques and languages, such as PowerShell, PowerCLI, Python, Java, .NET, and simple web applications with JavaScript.
Data Lifecycle Management in Hybrid Clouds [HCI1705BU] Christos Karamanolis & Ilya Languev
Tuesday, Aug 28, 2:00 p.m. – 3:00 p.m.
The focus of IT and DevOps organizations is shifting from storage toward data management independent of infrastructure and locations. This trend is partly driven by a new generation of applications that extract business value from data (big data, analytics, machine learning). Customers need cost-effective data storage but also data mobility, copy management, and on-demand access as business requirements and IT investments evolve. Join Christos Karamanolis (CTO, Storage and Availability) and Ilya Languev (Principal Engineer) as they outline the VMware vision around data lifecycle management that spans private data centers and public clouds. They will discuss VMware’s R&D investments in this space and use real-world examples and demos to highlight the benefits for our customers, both for traditional and cloud-native applications.
VMware CTO Panel: What’s Over the Horizon? [CTO3496PU] Ray O’Farrell, Christos Karamanolis, Chris Wolf, Shawn Bass, Pere Monclus
Tuesday, Aug 28, 5:30 p.m. – 6:30 p.m.
VMware CTOs spend significant time assessing emerging technology trends, taking a practical look at their potential impacts and opportunities for VMware. This session explores emerging areas, inclusive of edge, the Internet of things, artificial intelligence (AI)/machine learning (ML), SD-WAN and network service mesh, distributed data management, and more. There will also be ample time for you to have your most pressing questions answered.
Smart Placement of Workloads in Tomorrow’s Distributed Cloud [CTO2161BU] Daniel Beveridge
Tuesday, Aug 28, 1:00 p.m. – 2:00 p.m.
This session will offer a look at the evolution of cloud as we move from a nega-cloud-focused experience into a more distributed cloud experience where compute evolves toward a mesh of resources. Find out about a technology project sponsored by VMware’s Office of the CTO that has developed a novel approach to the placement of workloads in a vast marketplace of providers, resulting in a seamless cloud burst experience across a range of providers. You will learn about some cutting-edge cloud technology that points toward a new way of consuming cloud services with an emphasis on reducing cost, improving user experience, and offering increased flexibility and agility in workload management.
Optimizing vSAN for Performance [HCI1246BU] Cormac Hogan & Paudie O’Riordan
Tuesday, Aug 28, 3:30 p.m. – 4:30 p.m.
The VMware vSAN team gets many questions on performance. For example, does adding a second disk group improve performance? Does adding a stripe width to an object make things faster? Does increasing the MTU size matter? Does mixing SAS and SATA make a difference? Join this session for answers to these sorts of questions. Paudie and Cormac will discuss the results of various performance tests they initiated in their labs to reach these conclusions. You will learn about the benchmark tool of choice, HCIBench, as well as all the different nuances that can make a difference to your benchmarking results.

Also note, there’s a long list of “deep dive” session at vmworld this year, do a search and register before it is too late!

Now Available: vSphere 6.7 Clustering Deep Dive book!

Duncan Epping · Jul 30, 2018 ·

Over the past couple of months Frank, Niels and I have worked ferociously to update the vSphere Clustering Deep Dive. Some of the material was already brought up to date to vSphere 6.0 U2, but the majority was never updated after vSphere 5.1. As you can imagine, this was a tremendous undertaking. Not only did we need to validate every sentence, all diagrams needed to be updated, and with the introduction of the HTML-5 Client also all screenshots had to be retaken.

Now, just a couple of weeks before VMworld, we are finally at the point where we can press “publish”.

What can you expect? Well, we have said this with previous books, this is not a beginners guide! This is a deep dive, and we aimed to take you in to the trenches of vSphere Clustering technologies. We cover a multitude of different features, and for those who haven’t read the previous books expect the following features to be covered:

vSphere HA
vSphere DRS
vSphere Storage DRS
vSphere Storage I/O Control
vSphere Network I/O Control

We also have a chapter on stretched clusters, in this chapter we describe how to design and implement a vSphere Metro Storage Cluster, leveraging all of the knowledge gained in the previous chapters.

For your convenience, I copied/pasted some of the Amazon info below.

—

Paperback: 566 pages
Publisher: CreateSpace Independent Publishing Platform; 1 edition (July 29, 2018)
Language: English
ISBN-10: 1722625325
ISBN-13: 978-1722625320
Product Dimensions: 5.5 x 1.3 x 8.5 inches
Shipping Weight: 1.8 pounds

—

I hope all of you will enjoy the book as much as we enjoyed writing it. And before I forget, I want to thank my co-authors for the late night discussions, the hard work, insights and fun/laughter at times.

Get it while it is hot! (Look on the right side column for the links to the book!)

Instant Clone in vSphere 6.7 rocks!

Duncan Epping · May 1, 2018 ·

I wrote a blog post a while back about VMFork, which then afterwards got rebranded to Instant Clone. In the vSphere 6.7 release there has been a major change to the architecture of VMFork aka Instant Clone, so I figured I would do an update on the post. As an update doesn’t stand out from the rest of the content I am sharing it as a new post.

Instant Clone was designed and developed to provide a mechanism that allows you to instantaneously create VMs. In the early days it was mainly used by folks who want to deploy desktops, by the desktop community this was often referred to as “just in time” desktops. These desktops would literally be created when the user tried to login, it is indeed that fast. How did this work? Well a good way to describe it is that it is essentially a “vMotion” of a VM on the same host with a linked clone disk. This essentially leads to a situation which looks as follows:

On a host you had a parent VM and a child VM associated with it. You would have a shared base disk, shared memory and then of course unique memory pages and a delta disk for (potential) changes written to disk. The reason customers primarily used this only with VDI at first was the fact that there was no public API for it. Of course folks like Alan Renouf and William Lam fought hard for public APIs internally and they managed to get things like the PowerCLI cmdlets and python vSphere SDK pushed through. Which was great, but unfortunately not 100% fully supported. On top of that there were some architectural challenges with the 1.0 release of Instant Clones. Mainly caused by the fact that VMs were pinned to a host (next to their parent VM) and as such things like HA, DRS, vMotion wouldn’t work. Now with version 2.0 this all changes. William already wrote an extensive blog post about it here. I just went over all of the changes and watched some internal training, and I am going to write some of my findings/learnings down as well, just so that it sticks… First let’s list the things that stood out to me:

Targeted use cases
- VDI
- Container hosts
- Big data / hadoop workers
- DevTest
- DevOps
There are two workflows for instant clone
- Instant clone a running a VM, source and generated VMs continue running
- Instant clone a frozen VM, source is frozen using guestRPC at point in time defined by customer
No UI yet, but “simple API” available
Integration with vSphere Features
- Now supported: HA, DRS, vMotion (Storage / XvMotion etc)
Even when TPS is disabled (default) VMFork still leverages the P-Share technology to collapse the memory pages for efficiency
There is no explicit parent-child relationship any longer

Let’s look at the use cases first, I think the DevTest / DevOps is interesting. You could for instance do an Instant Clone (live) of a VM and then test an upgrade for instance for the application running within the VM. For this you would use the first workflow that I mentioned above: instant clone a running VM. What happens here in the workflow is fairly straight forward. I am using William’s screenshots of the diagrams the developers created to explain it. Thanks William, and dev team 🙂

Now note that above when the first clone is created the source gets a delta disk as well. This is to ensure that the shared disk doesn’t change as that would cause problems for the target. Now when a 2nd VM is created and a 3 the source VM gets additional delta’s. This as you can imagine isn’t optimal and will over time even potentially slow down the source VM. Also one thing to point out is that although the mac address changes for the generated VM, you as the admin still need to make sure the Guest OS picks this up. As mentioned above, as there’s no UI in vSphere 6.7 for this functionality you need to use the API. If you look at the MOB you can actually find the InstantClone_Task and simply call that, for a demo scroll down. But as said, be careful as you don’t want to end up with the same VM with the same IP on the same network multiple times. You can get around the Mac/IP conflict issue rather easy and William has explained how in his post here. You can even change the port group for the NIC, for instance switch over to an isolated network only used for testing these upgrade scenarios etc.

That second workflow would be used for the following use cases: VDI, Container Hosts, Hadoop workers… all more or less the same type of use case. Scale out identical VMs fast! In this case, well lets look at the diagram first:

In the above scenario the Source VM is what they call “frozen”. You can freeze a VM by leveraging the vmware-rpctool and run it with “instantclone.freeze”. This needs to happen from within the guest, and note that you need to have VMware tools installed to have vmware-rpctool available. When this is executed, the VM goes in to a frozen state, meaning that no CPU instructions are executed. Now that you froze the VM you can go through the same instant clone workflow. Instant Clone will know that the VM is frozen. After the instant clone is create you will notice that there’s a single delta disk for the source VM and each generated VM will have its own delta disk as shown above. Big benefit is that the source VM won’t have many delta disks. Plus, you know for sure that every single VM you create from this frozen VM is 100% identical as they all resume from the exact same point in time. Of course when the instant clone is created the new VM is “unfrozen / resumed”, the source will remain frozen. Note that if for whatever reason the source is restarted / power cycled then the “frozen state” is lost. Another added benefit of the frozen VM is that you can automate the “identity / IP / mac” issue when leveraging the “frozen source VM” workflow. How do you do this? Well you disable the network, freeze it, instant clone it (unfreezes automatically), make network changes, enable network. William just did a whole blog post on how to do various required Guest changes, I would highly recommend reading this one as well!

So before you start using Instant Clone, first think about which of the two workflows you prefer and why. So what else did I learn?

As mentioned, and this is something I never realized, but even when TPS is disabled Instant Clone will still share the memory pages through the P-Share mechanism. P-Share is the same mechanism that TPS leverages to collapse memory pages. I always figured that you needed to re-enable TPS again (with or without salting), but that is not the case. You can’t even disable the use of P-Share at this point in time… Which personally I don’t think is a security concern, but you may think different about it. Either way, of course I tested this, below you see the screenshot of the memory info before and after an instant clone. And yes, TPS was disabled. (Look at the shares / saving values…)

Before:

After:

Last but not least, the explicit parent-child relationship caused several problems from a functionality stance (like HA, DRS, vMotion etc not being supported). Per vSphere 6.7 this is no longer the case. There is no strict relationship, and as such all the features you love in vSphere can be fully leveraged even for your Instant Clone VMs. This is why they call this new version of Instant Clone “parentless”.

If you are wondering how you can simply test it without diving in to the API too deep and scripting… You can use the Managed Object Browser (MOB) to invoke the method as mentioned earlier. I recorded this quick demo that shows this, which is based on a demo from one of our Instant Clone engineers. I recommend watching it in full screen as it is much easier to follow that way. (or watch it on youtube in a larger window…) Pay attention, as it is a quick demo, instant clone is extremely fast and the workflow is extremely simple.

And that’s it for now. Hope that helps those interested in Instant Clone / VMFork, and maybe some of you will come up with some interesting use cases that we haven’t thought about it. Would be good if you have use cases to share those in the comment section below. Thanks,