vSphere

Why does HA not power-on VMs after a full cluster shutdown?

Duncan Epping · Dec 20, 2021 ·

I received this question and figured I would write a quick post about it, as it comes up occasionally. Why does vSphere HA no power-on VMs after a full cluster is brought back online after a full cluster shutdown? In this case, the customer had a power outage, so their hosts and all VMs were powered off, by an administrator cleanly, as a result of the backup power unit running out of power. Unfortunately, this happens more frequently than you would think.

When VMs are powered off by an administrator, or anyone/anything (PowerCLI etc) else which has permissions to power off VMs, then vCenter Server will mark these VMs as “cleanly powered off”. Next, also the state of the VMs is tracked by vSphere HA. So if a host is powered off, HA will know if the VM was powered on, or powered off at the time the host goes missing.

Now, when the host (or hosts) returns for duty, vSphere HA will of course verify what the last known state was of the cluster. It will read the list of all the VMs that were powered on, and it will restart those that were powered on and are configured for HA. It will also look at a VM property called “runtime.cleanPowerOff”, this property indicates if the VM was cleanly powered off by an Admin or a script, or if the VM was for instance powered off by vSphere HA itself. (PDL response etc.) Depending on the value of the property, the VM will, or will not be restarted.

Having said all of that, when you power off a VM manually through the UI, or via a script, then the VM will be marked as being “cleanly powered off”. This means that HA has no reason to restart it, as the powered-off VM is not the result of a host, network, or storage failure.

Unexplored Territory #005: AI Enterprise, DPUs, and NVIDIA Launchpad with Luke Wignall

Duncan Epping · Dec 13, 2021 ·

Episode 005 is out! This time we talk to Luke Wignall, who is the Director of Technical Product Marketing at NVIDIA. We talk about some of the announcements made during the NVIDIA GTC Conference. Luke discusses NVIDIA Launchpad, AI Enterprise, and of course, we touch on DPUs aka SmartNICs. A great conversation with A LOT of information to digest. Enjoy the episode, and if you haven’t done so yet, make sure to subscribe! You can also listen via your podcast apps of course for Apple: https://apple.co/3lYZGCF, Google: https://bit.ly/3oQVarH, Spotify: https://spoti.fi/3INgN3R.

Can I boot ESXi from an SD card while placing OSDATA on my SAN?

Duncan Epping · Nov 16, 2021 ·

I see this question popping up all the time, literally twice a day on our VMware internal slack, can I boot ESXi from an SD card while placing OSDATA on my SAN? I guess people are confused after reading this article. It is probably a result of skim-reading as the article is in-depth and spells it out to the letter. If you look at the following table then it mentions FCoE and iSCSI:

However, FCoE, iSCSI, and FC would only be supported when you boot from SAN, only then is OSDATA supported on a SAN device. When you boot from USB/SD, OSDATA will need to reside on a locally attached device. In other words, the answer to the original question is: no you cannot boot ESXi from an SD card and place OSDATA on your SAN. Again, for details, read this excellent document.

Project Monterey, Capitola, and NVIDIA announcements… What does this mean?

Duncan Epping · Nov 8, 2021 ·

At VMworld, there were various announcements and updates around projects that VMware is working on. Three of these announcements/projects received a lot of attention at VMworld and from folks attending. These announcements were Project Monterey, Project Capitola, and the NVIDIA + VMware AI Ready Platform. For those who have not followed these projects and/or announcements, they are all about GPUs, DPUs, and Memory. The past couple of weeks I have been working on developing new content for some upcoming VMUGs, these projects most likely will be part of those presentations.

When I started looking at the projects I looked at them as three separate unrelated projects, but over time I realized that although they are separate projects they are definitely related. I suspect that over time these projects, along with other projects like Radium, will play a crucial role in many infrastructures out there. Why? Well, because I strongly believe that data is at the heart of every business transformation! Analytics, artificial intelligence, machine learning, it is what most companies over time will require and use to grow their business, either through the development of new products and services or through the acquisition of new customers in innovative ways. Of course, many companies already use these technologies extensively.

This is where things like the work with NVIDIA, Project Monterey, and Project Capitola come into play. The projects will enable you to provide a platform to your customers that will enable them to analyze data, learn from the outcome, and then develop new products, services, be more efficient, and/or expand the business in general. When I think about analyzing data or machine learning, one thing that stands out to me is that today close to 50% of all AI/ML projects never go into production. This is for a variety of reasons, but key reasons these projects fail typically are Complexity, Security/Privacy, and Time To Value.

This is why the collaboration between VMware and NVIDIA is extremely important. The ability to deploy a VMware/NVIDIA certified solution should mitigate a lot of risks, not just from a hardware point of view, but of course also from a software perspective, as this work with NVIDIA is not just about a certified hardware stack, but also the software stack that is typically required for these environments will come with it.

So what has Monterey and Capitola to do with all of this? If you look at some of Frank’s posts on the topic of AI and ML, it becomes clear that there are a couple of obvious bottlenecks in most infrastructures when it comes to AI/ML workloads. Some of these bottlenecks have been bottlenecks for storage and VMs in general as well for the longest time. What these bottlenecks are? Data movement and memory. This is where Monterey and Capitola come into play. Project Monterey is VMware’s project around SmartNICs (or DPUs as some vendors call them). SmartNICs/DPUs provide many benefits, but the key benefit is definitely the offloading capabilities. By offloading certain tasks to the SmartNIC, the CPU will be freed up for other tasks, which will benefit the workloads running. Also, we can expect the speed of these devices to also go up, which will allow ultimately for faster and more efficient data movement. Of course, there are many more benefits like security and isolation, but that is not what I want to discuss today.

Then lastly, Project Capitola, all about memory! Software-Defined Memory as it is called in this blog. This remains the bottleneck in most environments these days, with-or-without AI/ML, you can never have enough memory! Unfortunately, memory comes at a (high) cost. Project Capitola may not make your DIMMs cheaper, but it does aim to make the use of memory in your host and cluster more cost-efficient. Not only does it aim to provide memory tiering within a host, but it also aims to provide pooled memory across hosts, which will directly tie back to Project Monterey, as low latency and high bandwidth connections will be extremely important in that scenario! Is this purely aimed at AI/ML? No of course note, tiers of memory and pools of memory should be available to all kinds of apps. Does your app need all data in memory, but is there no “nanoseconds latency” requirement? That is where tiered memory comes into play! Do you need more memory for a workload than a single host can offer? That is where pooled memory comes into play! So many cool use cases come to mind.

Some of you may have already realized where these projects were leading too, some may not have had that “aha moment” yet, hopefully, the above helps you realizing that it is projects like these that will ensure your datacenter will be future proof and will enable the transformation of your business.

If you want learn more about some of the announcements and projects, just go to the VMworld website and watch the recordings they have available!

vSphere 7.0 U3 contains two great vCLS enhancements

Duncan Epping · Sep 28, 2021 ·

I have written about vCLS a few times, so I am not going to explain what it is or what it does (detailed blog here). I do want to talk about what is part of vSphere 7.0 U3 specifically though as I feel these features are probably what most folks have been waiting for. Starting with vSphere 7.0 U3 it is now possible to configure the following for vCLS VMs:

Preferred Datastores for vCLS VMs
Anti-Affinity for vCLS VMs with specific other VMs

I created a quick demo for those who prefer to watch videos to learn these things if you don’t skip to the text below. Oh and before I forget, a bonus enhancement is that the vCLS VMs now have a unique name, this was a very common request from customers.

Why would you need the above functionality? Let’s begin with the “preferred datastore” feature, this allows you to specify where the vCLS VMs need to be provisioned to from a storage point of view. This would be useful in a scenario where you have a number of datastores that you would prefer to avoid. Examples could be datastores that are replicated or a datastore that is only intended to be used for ISOs and templates, or maybe you prefer to provision on hybrid storage versus flash storage.

So how do you fix this? Well, it is simple, you click on your cluster object. You then click on “Configure”, and on “Datastores” under “vSphere Cluster Services”. Now you will see “VCLS Allowed”, if you click on “ADD” you now will be able to select the datastores to which these vCLS VMs should be provisioned.

Next up Anti-Affinity for vCLS. You would this feature for situations where for instance a single workload needs to be able to solely run on a host, something like SAP for instance. In order to achieve this, you can use anti-affinity rules. We are not talking about regular anti-affinity rules. This is the very first time a brand new mechanism is used on-premises. I am talking about compute policies. Compute policies have been available for VMware Cloud on AWS customers for a while, but now are also appear to be coming to on-prem customers. What does it do? It enables you to create “anti-affinity” rules for vCLS VMs and specific other VMs in your environment by creating Compute Policies and using Tags!

How does this work? Well, you go to “Policies and Profiles” and then click “Compute Policies”. Now you can click “ADD” and create a policy. You now select “Anti Affinity with vSphere Cluster Services (vCLS) VMs”. Then you select the Tag you created for the VMs that should not run on the same hosts as the vCLS VMs, and then you click create. The vCLS VM Scheduler will then ensure that the vCLS VMs will not run on the same hosts as the tagged VMs. If there’s a conflict, the vCLS Scheduler will move away the vCLS VMs to other hosts within the cluster. Let’s reiterate that, the vCLS VMs will be vMotioned to another host in your cluster, the tagged VMs will not be moved!

Hope that helps!