vSphere

What is new for vMotion in vSphere 6.0?

Duncan Epping · Feb 5, 2015 ·

vMotion is probably my favourite VMware feature ever. It is one of those features which revolutionized the world and just when you think they can’t really innovate anymore they take it to a whole new level. So what is new?

Cross vSwitch vMotion
Cross vCenter vMotion
Long Distance vMotion
vMotion Network improvements
- No requirement for L2 adjacency any longer!
vMotion support for Microsoft Clusters using physical RDMs

That is a nice long list indeed. Lets discuss each of these new features one by one and lets start at the top with Cross vSwitch vMotion. Cross vSwitch vMotion basically allows you to do what the name tells you. It allows you to migrate virtual machines between different vSwitches. Not just from vSS to vSS but also from vSS to vDS and vDS to vDS. Note that vDS to vSS is not supported. This is because when migrating from vDS metadata of the VM is transferred as well and the vSwitch does not have this logic and cannot handle the metadata. Note that the IP Address of the VM that you are migrating will not magically change, so you will need to make sure both the source and the destination portgroup belong to the same layer 2 network. All of this is very useful during for instance Datacenter Migrations or when you are moving VMs between clusters for instance or are migrating to a new vCenter instance even.

Next on the list is Cross vCenter vMotion. This is something that came up fairly frequent when talking about vMotion, will we ever have the ability to move a VM to a new vCenter Server instance? Well as of vSphere 6.0 this is indeed possible. Not only can you move between vCenter Servers but you can do this with all the different migration types there are: change compute / storage / network. You can even do it without having a shared datastore between the source and destination vCenter aka “shared nothing migration. This functionality will come in handy when you are migrating to a different vCenter instance or even when you are migrating workloads to a different location. Note, it is a requirement for the source and destination vCenter Server to belong to the same SSO domain. What I love about this feature is that when the VM is migrated things like alarms, events, HA and DRS settings are all migrated with it. So if you have affinity rules or changed the host isolation response or set a limit or reservation it will follow the VM!

My personal favourite is Long Distance vMotion. When I say long distance, I do mean long distance. Remember that the max tolerated latency was 10ms for vMotion? With this new feature that just went up to 150ms. Long distance vMotion uses socket buffer resizing techniques to ensure that migrations succeed when latency is high. Note that this will work with any storage system, both VMFS and NFS based solutions are fully supported. (** was announced with 100ms, but updated to 150ms! **)

Then there are the network enhancements. First and foremost, vMotion traffic is now fully supported over an L3 connection. So no longer is there the need for L2 adjacency for your vMotion network, I know a lot of you have asked for this and I am happy to be able to announce it. On top of that. You can now also specify which VMkernel interface should be used for migration of cold data. It is not something many people are aware off, but depending on the type of migration you are doing and the type of VM you are migrating it could be in previous versions that the Management Network was used to transfer data. (Frank Denneman described this scenario in this post.) For this specific scenario it is now possible to define a VMkernel interface for “Provisioning traffic” as shown in the screenshot below. This interface will be used for, and let me quote the documentation here, “Supports the traffic for virtual machine cold migration, cloning, and snapshot creation. You can use the provisioning TPC/IP stack to handle NFC (network file copy) traffic during long-distance vMotion. NFC provides a file-type aware FTP service for vSphere, ESXi uses NFC for copying and moving data between datastores.”

Full support for vMotion of Microsoft Cluster virtual machines is also newly introduced in vSphere 6.0. Note that these VMs will need to use physical RDMs and only supported with Windows 2008, 2008 R2, 2012 and 2012 R2. Very useful if you ask me when you need to do maintenance or you have resource contention of some kind.

That was it for now… There is some more stuff coming with regards to vMotion but I cannot disclose that yet unfortunately.

What’s new for HA in vSphere 6.0?

Duncan Epping · Feb 4, 2015 ·

Instead of one generic post with a bunch of data I picked a couple of features and dug a little bit deeper, today I will be discussing what is new for HA in vSphere 6.0. Lets start with a list and then look at the features / enhancements individually:

Support for Virtual Volumes – With Virtual Volumes a new type of storage entity is introduced in vSphere 6.0.
VM Component Protection – This allows HA to respond to a scenario where the connection to the virtual machine’s datastore is impacted temporarily or permanently.
- “Response for Datastore with All Paths Down”
- “Response for Datastore with Permanent Device Loss”
Increased scale – Cluster limit has grown from 32 to 64 hosts and to a max of 8000 VMs per cluster
Registration of “HA Disabled” VMs on hosts after failure

Lets start with support for Virtual Volumes. It may sound like this is a given but as the whole concept of a VMFS volume no longer exists with Virtual Volumes and VMs have “virtual volumes” instead of VMDKs you can imagine that some work was needed to allow for HA to restart virtual machines stored on a VVOL enabled storage system.

VM Component Protection (VMCP) is in my opinion THE big thing that got added to vSphere HA. What this feature basically allows you to do is protect yourself against storage failures. There are two types of failures VMCP will respond to and those are PDL and APD. Before we look at some of the details, I want to point out that configuring is extremely simple… Just one tickbox to enable it.

In the case of a PDL (permanent device loss), this is something HA already was capable of doing when configured through the command line, a VM will be restarted instantly when a PDL signal is issued by the storage system. For an APD (all paths down) this is a bit different. A PDL more or less indicates that the storage device does not expect the device to return any time soon. An APD is more of an unknown situation, it may return… it may not… and no clue how long it takes. With vSphere 5.1 some changes were introduced to the way APD is handled by the hypervisor in this mechanism is leveraged by HA to allow for a response. (Cormac wrote an excellent post about this APD handling here.) When an APD occurs a timer starts. After 140 seconds the APD is declared and the device is marked as APD time out. When the 140 seconds has passed HA will start counting. The HA time out is 3 minutes. When the 3 minutes has passed HA can restart the virtual machine, but you can configure VMCP to respond differently if you want it to. You could for instance specify that events are issued that a PDL or APD has occurred. You can also specify how aggressively HA needs to try to restart VMs that are impacted by an APD. Note that aggressive / conservative refers to the likelihood of HA being able to restart VMs. When set to “conservative” HA will only restart the VM that is impacted by the APD if it knows another host can restart it. In the case of “aggressive” HA will try to restart the VM even if it doesn’t know the state of the other hosts, which could lead to a situation where your VM is not restarted as there is no host that has access to the datastore the VM is located on. It is also good to know that if the APD is lifted and access to the storage is restored during the total of roughly 5 minutes and 20 seconds it would take to reboot the VM, that HA will not do anything unless you explicitly configure it do so. This is where the “Response for APD recovery after APD timeout” comes in to play.

Increased scale is pretty straight forward, from 32 to 64 hosts and a total of 8000 VMs per cluster. I don’t know too many customers hitting this boundaries but I do come across a request like this occasionally. So if you want to grow your cluster, you can now do so. Do note that you may hit other limits like the LUN limit or the VM limit or…

Registration of HA Disabled VMs after a failure is a feature I have requested a long time ago. I am glad to see this made it in to the release. Basically when you have HA disabled on a specific VM this feature will make sure that the VM gets registered on another host after a failure. This will allow you to easily power-on that VM when needed without needed to manually re-register it yourself. Note, HA will not do a power-on of the VM but it will just register it for you.

That was it for now…

What is new for Virtual SAN 6.0?

Duncan Epping · Feb 3, 2015 ·

vSphere 6.0 was just announced and with it a new version of Virtual SAN. I don’t think it is needed to introduce Virtual SAN as I have written many many articles about it in the last 2 years. Personally I am very excited about this release as it adds some really cool functionality if you ask me, so what is new for Virtual SAN 6.0?

Support for All-Flash configurations
Fault Domains configuration
Support for hardware encryption and checksum (See HCL)
New on-disk format
- High performance snapshots / clones
- 32 snapshots per VM
Scale
- 64 host cluster support
- 40K IOPS per host for hybrid configurations
- 90K IOPS per host for all-flash configurations
- 200 VMs per host
- 8000 VMs per Cluster
- up to 62TB VMDKs
Default SPBM Policy
Disk / Disk Group serviceability
Support for direct attached storage systems to blade (See HCL)
Virtual SAN Health Service plugin

That is a nice long list indeed. Let my discuss some of these features a bit more in-depth. First of all “all-flash” configurations as that is a request that I have had many many times. In this new version of VSAN you can point out which devices should be used for caching and which will serve as a capacity tier. This means that you can use your enterprise grade flash device as a write cache (still a requirement) and then use your regular MLC devices as the capacity tier. Note that of course the devices will need to be on the HCL and that they will need to be capable of supporting 0.2 TBW per day (TB written) over a period of 5 years. For a drive that needs to be able to sustain 0.2 TBW per day, this means that over 5 years it needs to be capable of 365TB of writes. So far tests have shown that you should be able to hit ~90K IOPS per host, that is some serious horsepower in a big cluster indeed.

Fault Domains is also something that has come up on a regular basis and something I have advocated many times. I was pleased to see how fast the VSAN team could get it in to the product. To be clear, no this is not a stretched cluster solution… but I would see this as the first step, but that is my opinion and not VMware’s. This Fault Domain feature will allow you to specify fault domains per rack and then when you provision a new virtual machine VSAN will make sure that the components of the objects are placed in different fault domains.

In this case when you do it per rack then even a full rack failure would not impact your virtual machine availability. Very cool indeed. The nice thing about the fault domain feature also is that it is very simple to configure. Literally a couple of clicks in the UI, but you can also use RVC or host profiles to configure it if you want to. Do note that you will need 6 hosts at a minimum for Fault Domains to make sense.

Then of course there is the scalability. Not just the 64 host cluster support but also the 200 VMs per host is a great improvement. Of course there is also the improvements around snapshot and cloning which can be attributed to the new on-disk format and the different snapshotting mechanism that is being used, less then 2% performance impact when going up to 32 levels deep is what we have been waiting for. Fair to say that this is where the acquisition of Virsto is coming in to play, and I think we can expect to see more. Also, the components number has gone up. The max number of components used to be 3000 and is now increased to 9000.

Then there is the support for blade systems with direct attached storage systems… this is very welcome, I had many customers asking for this. Note that as always the HCL is leading, so make sure to check the HCL before you decide to purchase equipment to implement VSAN in a blade environment. Same applies to hardware encryption and checksums, it is fully supported but make sure your components are listed with support for this functionality on the HCL! As far as I know the initial release will have 2 supported systems on there, one IBM system and I believe the Dell FX platform.

All of the operational improvements that were introduced around disk serviceability and being able to tag a device as “local / remote / SSD” are the direct result of feedback from customers and passionate VSAN evangelists internally at VMware. Also for instance pro-active rebalancing is now possible through RVC. If you add a host or remove a host and want to even out the nodes from a capacity point of view then a simple RVC command will allow you to do this. But also for instance the “resync” details can now be found in the UI, something I am very happy about as that will help people during PoCs not to run in to the scenario where they introduce new failures while VSAN is recovering from previous failures.

Last one I want to mention is the Virtual SAN Health Service plugin. This is a separately developed Web Client plugin that will provide in-depth information about Virtual SAN. I gave it a try a couple of weeks ago and now have it running in my environment, impressed with what is in there and great to see this type of detail straight in the UI. I expect that we will see various iterations in the upcoming year.

vSphere 6.0 finally announced!

Duncan Epping · Feb 3, 2015 ·

Today Pat Gelsinger and Ben Fathi announced vSphere 6.0. (if you missed it you can still sign up for other events) I know many of you have been waiting on this and are ready to start your download engines but please note that this is just the announcement of GA… the bits will follow shortly. I figured I would do a quick post which details what is in vSphere 6.0 / what is new.There were a lot of announcements today, but I am just going to cover vSphere 6.0 and VSAN. I have some more detailed posts to come so I am not gonna go in to a lot of depth in this post, I just figured I would post a list of all the stuff that is in the release… or at least that I am aware off, some stuff wasn’t broadly announced.

vSphere 6
- Virtual Volumes
  - Want “Virtual SAN” alike policy based management for your traditional storage systems? That is what Virtual Volumes will bring in vSphere 6.0. If you ask me this is the flagship feature in this release.
- Long Distance vMotion
- Cross vSwitch and vCenter vMotion
- vMotion of MSCS VMs using pRDMs
- vMotion L2 adjacency restrictions are lifted!
- vSMP Fault Tolerance
- Content Library
- NFS 4.1 support
- Instant Clone aka VMFork
- vSphere HA Component Protection
- Storage DRS and SRM support
- Storage DRS deep integration with VASA to understand thin provisioned, deduplicated, replicated or compressed datastores!
- Network IO Control per VM reservations
- Storage IOPS reservations
- Introduction of Platform Services Controller architecture for vCenter
  - SSO, licensing, certificate authority services are grouped and can be centralized for multiple vCenter Server instances
- Linked Mode support for vCenter Server Appliance
- Web Client performance and usability improvements
- Max Config:
  - 64 hosts per cluster
  - 8000 VMs per cluster
  - 480 CPUs per host
  - 12TB of memory
  - 1000 VMs per host
  - 128 vCPUs per VM
  - 4TB RAM per VM
- vSphere Replication
  - Compression of replication traffic configurable per VM
  - Isolation of vSphere Replication host traffic
- vSphere Data Protection now includes all vSphere Data Protection Advanced functionality
  - Up to 8TB of deduped data per VDP Appliance
  - Up to 800 VMs per VDP Appliance
  - Application level backup and restore of SQL Server, Exchange, SharePoint
  - Replication to other VDP Appliances and EMC Avamar
  - Data Domain support
Virtual SAN 6
- All flash configurations
- Blade enablement through certified JBOD configurations
- Fault Domain aka “Rack Awareness”
- Capacity planning / “What if scenarios”
- Support for hardware-based checksumming / encryption
- Disk serviceability (Light LED on Failure, Turn LED on/off manually etc)
- Disk / Diskgroup maintenance mode aka evacuation
- Virtual SAN Health Services plugin
- Greater scale
  - 64 hosts per cluster
  - 200 VMs per host
  - 62TB max VMDK size
  - New on-disk format enables fast cloning and snapshotting
  - 32 VM snapshots
  - From 20K IOPS to 40K IOPS in hybrid configuration per host (2x)
  - 90K IOPS with All-Flash per host

As you can see a long list of features and products that have been added or improved. I can’t wait until the GA release is available. In the upcoming days I will post some more details on some of the above listed features as there is no point in flooding the blogosphere even more with similar info.

New fling released: VM Resource and Availability Service

Duncan Epping · Feb 2, 2015 ·

I have the pleasure of announcing a brand new fling that was released today. This fling is called “VM Resource and Availability Service” and is something that I came up with during a flight to Palo Alto while talking to Frank Denneman. When it comes to HA Admission Control the one thing that always bugged me was why it was all based on static values. Yes it is great to know my VMs will restart, but I would also like to know if they will receive the resources they were receiving before the fail-over. In other words, will my user experience be the same or not? After going back and forth with engineering we decided that this could be worth exploring further and we decided to create a fling. I want to thank Rahul(DRS Team), Manoj and Keith(HA Team) for taking the time and going to this extend to explore this concept.

Something which I think is also unique is that this is a SaaS based solution, it allows you to upload a DRM dump and then you can simulate failure of one or more hosts from a cluster (in vSphere) and identify how many:

VMs would be safely restarted on different hosts
VMs would fail to be restarted on different hosts
VMs would experience performance degradation after restarted on a different host

With this information, you can better plan the placement and configuration of your infrastructure to reduce downtime of your VMs/Services in case of host failures. Is that useful or what? I would like to ask everyone to go through the motion, and of course to provide feedback if you feel this is useful information or not. You can leave feedback on this blog post or the fling website, we are aiming to monitor both.

For those who don’t know where to find the DRM dump, Frank described it in his article on the drmdiagnose fling, which I also recommend trying out! There is also a readme file with a bit more in-depth info!

vCenter server appliance: /var/log/vmware/vpx/drmdump/clusterX/
vCenter server Windows 2003: %ALLUSERSPROFILE%\Application Data\VMware\VMware VirtualCenter\Logs\drmdump\clusterX\
vCenter server Windows 2008: %ALLUSERSPROFILE%\VMware\VMware VirtualCenter\Logs\drmdump\clusterX\

So where can you find it? Well that is really easy, no downloads as I said… fully ran as a service:

Open hasimulator.vmware.com to access the web service.
Click on “Simulate Now” to accept the EULA terms, upload the DRM dump file and start the simulation process.
Click on the help icon (at the top right corner) for a detailed description on how to use this service.