VMware

vSphere 6.7 announced!

Duncan Epping · Apr 17, 2018 ·

It is that time of the year again, a new vSphere release announcement! (For those interested in what’s new for vSAN make sure to read my other post.) vSphere 6.7, what’s in a name / release? Well a bunch of stuff, and I am not going to address all of the new functionality as the list would simply be too long. So this list features what I think is worth mentioning and discussing.

vSphere Client (HTML-5) is about 95% feature complete
Improved vCenter Appliance monitoring
Improved vCenter Backup Management
ESXi Single Reboot Upgrades
ESXi Quick Boot
4K Native Drive Support
Max Virtual Disks increase from 60 to 256
Max ESXi number of Devices from 512 to 1024
Max ESXi paths to Devices from 2048 to 4096
Support for RDMA
vSphere Persistent Memory
DRS initial placement improvements

Note that there’s a whole bunch of stuff missing from this list, for instance there were many security enhancements, but I don’t see the point of me pretending to be an expert on that topic, while I know some of the top experts will have a blog out soon.

Not sure what I should tell about the vSphere Client (h5) at this point. Everyone has been waiting for this, and everyone has been waiting for it to reach ~90/95% feature complete. And we are there. I have been using it extensively for the past 12 months and I am very happy with how it turned out. I think the majority of you will be very very happy with what you will see and with the overall experience. It just feels fast(er) and seems more intuitive.

When it comes to management and monitoring of the vCenter Appliance (https://ip of vcenter:5480) there are a whole bunch of improvements. For me personally the changes in the monitoring tab are very useful and also the services tab is useful. Now you can immediately see when a particular disk is running out of space, as shown in the screenshot below. And you can for instance restart a particular service in the “Services” tab.

Next is vCenter Backup Management, a lot of people have been asking for this. We introduced Backup and Recovery of the appliance a while ago, very useful, but unfortunately it didn’t provide a scheduling mechanism. Sure you could create a script that would do this for you on a regular cadence, but not everyone wants to bother with that. Now in the Appliance Management UI you can simply create a schedule for backup. This is one of those small enhancements, which to me is a big deal! I’m sure that Emad or Adam will have a blog out soon on the topic of vCenter enhancements, so make sure to follow their blogs.

Another big deal is the fact that we shaved off a reboot for major upgrades. As of 6.7 you now only have 1 reboot with ESXi. Again, a tiny thing going from 2 back to 1, but when you have servers taking 10-15 minutes to go through the reboot process and you have dozens to of servers to reboot it makes Single Reboot ESXi Upgrades a big thing. For those on 6.5 right now, you will be able to enjoy the single reboot experience when upgrading to 6.7!

One feature I have personally been waiting for is ESXi Quick Boot. I saw a demo of this last year at our internal R&D conference at VMware and I was impressed. I don’t think many people at that stage saw the importance of the feature, but I am glad it made it in to the release. So what is it? Well basically it is a way to restart the hypervisor without going through the physical hardware reboot process. This means that you are now removing that last reboot, of course this only applies when your used server hardware supports it. Note that with the first release only a limited set of servers will support it, nevertheless this is a big thing. Not just for reboots, but also for upgrades / updates. A second ESXi memory image can be created and updated and when rebooting simply switched over to the latest and greatest instead of doing a full reboot. It will save, again, a lot time. I looked at a pre-GA build and noticed the following platforms are supported, this should be a good indication:

Of course you can also see if the host is supported in the vSphere Client, I found it in the Web Client but not in the H5 Client, maybe I am overlooking it, that could of course be the case.

Then up next are a bunch of core storage enhancements. First 4K Native Drive Support, very useful for those who want to use the large capacity devices. Not much else to say about it other than that it will also be supported by vSAN. I do hope that those using it for vSAN do take the potential performance impact in to account. (High capacity, Low IOPS >> low iops per GB!) Up next is the increase of a bunch of “max values“. Number of virtual disks going from 60 to 256 virtual disks for PVSCSI. And on top of that the number of Paths and Devices is also going up. Number of devices doubled from 512 to 1024 per host, and so has the number of paths as it is going from 2048 to 4096. Some of our largest customers will definitely appreciate that!

Then there’s also the support for RDMA, which is great for applications requiring extremely low latency and very high bandwidth! Note that when RDMA is used most of the ESXi Network stack is skipped, and when used in pass-through mode this also means that vMotion is not available. So that will only be useful for scale-out applications which have their own load balancing and high availability functionality. For those who can tolerate a bit more latency a paravirtualized RDMA adaptor will be available, you will need HW version 13 for this though.

vSphere Persistent Memory is something that I was definitely excited about. Although there aren’t too many supported server configurations, or even persistent memory solutions, it is something that introduces new possibilities. Why? Well this will provide you performance much higher than SSD at a cost which is lower than DRAM. Think less than 1 microsecond of latency. Where nanoseconds is for DRAM and Flash typically is low milliseconds under load. I have mentioned this in a couple of my sessions so far, NVDIMM will be big, which is the name commonly used for Persistent Memory. For those planning on buying persistent memory, do note that your operating system also needs to understand how to use it. There is a Virtual NVDIMM device in vSphere 6.7 and if the Guest OS has support for it then it will be able to use this byte addressable device. I believe a more extensive blog about vSphere Persistent Memory and some of the constraints will appear on the Virtual Blocks blog soon, so keep an eye on that as well. Cormac already has his fav new 6.7 features up on his blog, make sure to read that as well.

And last but not least, there was a significant improvement done in the initial placement process for DRS. Some of this logic was already included in 6.5, but only worked when HA was disabled. As of 6.7 it is also available when HA is enabled, making it much more likely that you will be able to benefit from the 3x decrease in time that it takes for the initial placement process to complete. A big big enhancements in the DRS space. I am sure though that Frank Denneman will have more to say about this.

What’s new vSAN 6.7

Duncan Epping · Apr 17, 2018 ·

As most of you have seen, vSAN 6.7 just released together with vSphere 6.7. As such I figured it was time to write a “what’s new” article. There are a whole bunch of cool enhancements and new features, so let’s create a list of the new features first, and then look at them individually in more detail.

HTML-5 User Interface support
Native vRealize Operations dashboards in the HTML-5 client
Support for Microsoft WSFC using vSAN iSCSI
Fast Network Failovers
Optimization: Adaptive Resync
Optimization: Witness Traffic Separation for Stretched Clusters
Optimization: Preferred Site Override for Stretched Clusters
Optimization: Efficient Resync for Stretched Clusters
New Health Checks
Optimization: Enhanced Diagnostic Partition
Optimization: Efficient Decomissioning
Optimization: Efficient and consistent storage policies
4K Native Device Support
FIPS 140-2 Level 1 validation

Yes, that is a relatively long list indeed. Lets take a look at each of the features. First of all, HTML-5 support. I think this is something that everyone has been waiting for. The Web Client was not the most loved user interface that VMware produced, and hopefully the HTML-5 interface will be viewed as a huge step forward. I have played with it extensively over the past 6 months and I must say that it is very snappy. I like how we not just ported over all functionality, but also looked if workflows could be improved and if presented information/data made sense in each and every screen. This also however does mean that new functionality from now on will only be available in the HTML-5 client, so use this going forward. Unless of course the functionality you are trying to access isn’t available yet, but most of it should be! For those who haven’t seen it yet, here’s a couple of screenshots… ain’t it pretty? 😉

For those who didn’t notice, but in the above screenshot you actually can see the swap file, and the policy associated with the swap file, which is a nice improvement!

The next feature is native vROps dashboards for vSAN in the H5 client. I found this very useful in particular. I don’t like context switching and this feature allows me to see all of the data I need to do my job in a single user interface. No need to switch to the VROps UI, but instead vSphere and vSAN dashboards are now made available in the H5 client. Note that it needs the VROps Client Plugin for the vCenter H5 UI to be installed, but that is fairly straight forward.

Next up is support for Microsoft Windows Server Failover Clustering for the vSAN iSCSI service. This is very useful for those running a Microsoft cluster. Create and iSCSI Target and expose it to the WSFC virtual machines. (Normally people used RDMs for this.) Of course this is also supported with physical machines. Such a small enhancement, but for customers using Microsoft clustering a big thing, as it now allows you to run those clusters on vSAN without any issues.

Next are a whole bunch of enhancements that have been added based on customer feedback of the past 6-12 months. Fast Network Failovers was one of those. Majority of our customers have a single vmkernel interface with multiple NICs associated with them, some of our customers have a setup where they create two vmkernel interfaces on different subnets, each with a single NIC. What that last group of customers noticed is that in the previous release we waited 90 seconds before failing over to the other vmkernel interface (tcp time out) when a network/interface had failed. In the 6.7 release we actually introduce a mechanism that allows us to failover fast, literally within seconds. So a big improvement for customers who have this kind of network configuration (which is very similar to the traditional A/B Storage Fabric design).

Adaptive Resync is an optimization to the current resync function that is part of vSAN. If a failure has occurred (host, disk, flash failure) then data will need to be resynced to ensure that the impacted objects (VMs, disks etc) are brought in to compliance again with the configured policy. Over the past 12 months the engineering team has worked hard to optimize the resync mechanism as much as possible. In vSAN 6.6.1 a big jump was already made by taking VM latency in to account when it came to resync bandwidth allocation, and this has been further enhanced in 6.7. In 6.7 vSAN can calculate the total available bandwidth, and ensures Quality Of Service for the guest VMs prevails by allocating those VMs 80% of the available bandwidth and limiting the resync traffic to 20%. Of course, this only applies when congestion is detected. Expect more enhancements in this space in the future.

A couple of release ago we introduced Witness Traffic Separation for 2 Node configurations, and in 6.7 we introduce the support for this feature for Stretched Clusters as well. This is something many Stretched vSAN customers have asked for. It can be configured through the CLI only at this point (esxcli) but that shouldn’t be a huge problem. As mentioned previously, what you end up doing is tagging a vmknic for “witness traffic” only. Pretty straight forward, but very useful:

esxcli vsan network ip set -i vmk<X> -T=witness

Another enhancement for stretched clusters is Preferred Site Override. It is a small enhancements, but in the past when the preferred site failed and returned for duty but would only be connected to the witness, it could happen that the witness would bind itself directly to the preferred site. This by itself would result in VMs becoming unavailable. This Preferred Site Override functionality would prevent this from happening. It will ensure that VMs (and all data) remains available in the secondary site. I guess one could also argue that this is not an enhancement, but much more a bug fix. And then there is the Efficient Resync for Stretched Clusters feature. This is getting a bit too much in to the weeds, but essentially it is a smarter way of bringing components up to the same level within a site after the network between locations has failed. As you can imagine 1 location is allowed to progress, which means that the other location needs to catch up when the network returns. With this enhancement we limit the bandwidth / resync traffic.

And as with every new release, the 6.7 release of course also has a whole new set of Health Checks. I think the Health Check has quickly become the favorite feature of all vSAN Admins, and for a good reason. It makes life much easier if you ask me. In the 6.7 release for instance we will validate consistency in terms of host settings and if an inconsistency is found report this. We also, when downloading the HCL details, will only download the differences between the current and previous version. (Where in the past we would simply pull the full json file.) There are many other small improvements around performance etc. Just give it a spin and you will see.

Something that my team has been pushing hard for (thanks Paudie) is the Enhanced Diagnostic Partition. As most of you know when you install / run ESXi there’s a diagnostic partition. This diagnostic partition unfortunately was a fixed size, with the current release when upgrading (or installing green field) ESXi will automatically resize the diagnostic partition. This is especially useful for large memory host configurations, actually useful for vSAN in general. No longer do you need to run a script to resize the partition, it will happen automatically for you!

Another optimization that was released in vSAN 6.7 is called “Efficient Decomissioning“. And this is all about being smarter in terms of consolidating replicas across hosts/fault domains to free up a host/fault domain to allow for maintenance mode to occur. This means that if a component is striped, for other reasons then policy, they may be consolidated. And the last optimization is what they refer to as Efficient and consistent storage policies. I am not sure I understand the name, as this is all about the swap object. Per vSAN 6.7 it will be thin provisioned by default (instead of 100% reserved), and also the swap object will now inherit the policy assigned to the VM. So if you have FTT=2 assigned to the VM, then you will have not two but three components for the swap object, still thin provisioned so it shouldn’t really change the consumed space in most cases.

Then there are the two last items on the list: 4K Native Device Support and FIPS 140-2 Level 1 validation. I think those speak for itself. 4K Native Device Support has been asked for by many customers, but we had to wait for vSphere to support it. vSphere supports it as of 6.7, so that means vSAN will also support it Day 0. The VMware VMkernel Cryptographic Module v1.0 has achieved FIPS 140-2, vSAN leverages the same module for vSAN Encryption. Nice collaboration by the teams, which is now showing the big benefit.

Anyway, there’s more work to do today, back to my desk and release the next article. Oh, and if you haven’t seen it yet, Virtual Blocks also has a blog and there’s a nice podcast on the topic of 6.7 as well.

Doing maintenance on a Two-Node (Direct Connect) vSAN configuration

Duncan Epping · Mar 13, 2018 ·

I was talking to a partner and customer last week at a VMUG. They were running a two node (direct connect) vSAN configuration and had some issues during maintenance which were, to them, not easy to explain. What they did is they placed the host which was in the “preferred fault domain” in to maintenance mode. After they placed that host in to maintenance mode the link between the two hosts for whatever reason failed. After they rebooted the host in the preferred host it connected back to the witness but at this point in time the connection between the hosts had not returned yet. This confused vSAN and that resulted in the scenario where the VMs in the secondary fault domain were powered off. As you can imagine an undesired effect.

This issue is solved in the near future in a new version of vSAN, but for those who need to do maintenance on a two-node (direct connect) configuration (or a full site maintenance in a stretched environment) I would highly recommend the following simple procedure. This will need to be done when doing maintenance on the host which is in the “preferred fault domain”:

Change the preferred fault domain
- Under vSAN, click Fault Domains and Stretched Cluster.
- Select the secondary fault domain and click the Mark Fault Domain as preferred for Stretched Cluster icon
Place the host in to maintenance mode
Do your maintenance

Fairly straight forward, but important to remember…

Changing advanced vSphere FT related settings, is that supported?

Duncan Epping · Feb 1, 2018 ·

This week I received a question around changing the values for vSphere FT related advanced settings. This customer is working on an environment where uptime is key. Of course the application layer is one side, but they also want to have additional availability from an infrastructure perspective. Which means vSphere HA and vSphere FT are key.

They have various VMs they need to enable FT on, these are vSMP VMs (meaning in this case dual CPU). Right now each host is limited to 4 FT VMs and at most 8 vCPUs, this is being controlled by two advanced settings called “das.maxftvmsperhost” and “das.maxFtVCpusPerHost”. The values for these are, obviously, 4 and 8. The question was: can I edit these and still have a supported configuration? Also, why 4 and 8?

I spoke to the product team about this and the answer is: yes, you can safely edit these. These values were set based on typical bandwidth and resource constraints customers have. An FT VM easily consumes between 1-3Gbps of bandwidth, meaning that if you dedicate a 10Gbps link to it you will fit roughly 4 VMs. I say roughly as of course the workload matters: CPU, Memory and IO pattern.

If you have a 40Gbps NIC, and you have plenty of cores and memory you could increase those max numbers for FT VMs per host and FT vCPUs. However, it must be noted that if you run in to problems VMware GSS may request you to revert back to the default just to ensure the issues that occur aren’t due to this change as VMware tests with the default values.

UPDATE to this content can be found here: https://www.yellow-bricks.com/2022/11/18/can-you-exceed-the-number-of-ft-enabled-vcpus-per-host-or-number-of-ft-enabled-vcpus-per-vm/

vSAN Adaptive Resync, what does it do?

Duncan Epping · Jan 18, 2018 ·

I am starting to get some more questions about vSAN Adaptive Resync lately. This was introduced a while back, but is also available in the latest versions of vSAN through vSphere 6.5 Patch 02. As a result various folks have started to look at it and are starting to wonder what it is. Hopefully by now everyone understands what resync traffic is and when you see resync traffic. The easiest example of course is a host failure. If a host has failed and there’s sufficient disk space and there’s additional hosts available to make the impacted VMs compliant with their policy again then vSAN will resync the data.

Resync aims to finish the creation of these new components asap, simple reason for this is availability. The longer the resync takes, the longer you are at risk. I think that makes sense right? In some cases however it may occur that when VMs are very busy and resync is happening that VM observed latency goes through the roof. We already had a manual throttling mechanism for when this situation occurs, but of course preferably vSAN should throttle resync traffic properly for you. This is what vSAN Adaptive Resync does.

So how does that work? Well, when the high watermark is reached for VM latency then vSAN will cut the bandwidth of resync in half. Next vSAN will check if the VM latency is below the low watermark, if not then it will cut resync traffic in half again. It does this until the latency is below the low watermark. When the latency is below the low watermark then vSAN will increase the bandwidth of resync traffic granularly until the low watermark is reached and stay at that level. (Some official info can be found in this kb, and this virtual blocks blog.)

Hope that helps,