I have the pleasure of announcing a brand new fling that was released today. This fling is called “VM Resource and Availability Service” and is something that I came up with during a flight to Palo Alto while talking to Frank Denneman. When it comes to HA Admission Control the one thing that always bugged me was why it was all based on static values. Yes it is great to know my VMs will restart, but I would also like to know if they will receive the resources they were receiving before the fail-over. In other words, will my user experience be the same or not? After going back and forth with engineering we decided that this could be worth exploring further and we decided to create a fling. I want to thank Rahul(DRS Team), Manoj and Keith(HA Team) for taking the time and going to this extend to explore this concept.
Something which I think is also unique is that this is a SaaS based solution, it allows you to upload a DRM dump and then you can simulate failure of one or more hosts from a cluster (in vSphere) and identify how many:
- VMs would be safely restarted on different hosts
- VMs would fail to be restarted on different hosts
- VMs would experience performance degradation after restarted on a different host
With this information, you can better plan the placement and configuration of your infrastructure to reduce downtime of your VMs/Services in case of host failures. Is that useful or what? I would like to ask everyone to go through the motion, and of course to provide feedback if you feel this is useful information or not. You can leave feedback on this blog post or the fling website, we are aiming to monitor both.
For those who don’t know where to find the DRM dump, Frank described it in his article on the drmdiagnose fling, which I also recommend trying out! There is also a readme file with a bit more in-depth info!
- vCenter server appliance: /var/log/vmware/vpx/drmdump/clusterX/
- vCenter server Windows 2003: %ALLUSERSPROFILE%\Application Data\VMware\VMware VirtualCenter\Logs\drmdump\clusterX\
- vCenter server Windows 2008: %ALLUSERSPROFILE%\VMware\VMware VirtualCenter\Logs\drmdump\clusterX\
So where can you find it? Well that is really easy, no downloads as I said… fully ran as a service:
- Open hasimulator.vmware.com to access the web service.
- Click on “Simulate Now” to accept the EULA terms, upload the DRM dump file and start the simulation process.
- Click on the help icon (at the top right corner) for a detailed description on how to use this service.
This is an interesting tool, I like that it is SaaS based.
Here is a powerCli one-liner to get the cluster name and the directory name for the cluster folder:
get-cluster | select Name,@{Name=”DirectoryName”; Expression={$_.Id.split(‘-‘)[2].replace(‘c’,’cluster’)} }
Cool thanks. I have a feature request outstanding as well to allow to get the DRM dump through the API, which will make it even easier.
Glad I could help with à dump file. Was à lot of werk:)
Bedankt ouwuh 🙂
I’m a bit confused by the results I get. We have a vMSC cluster with 14 hosts that is not over provisioned and has ~30% free resources (pRAM cluster 5,3 TB, vRAM 3,8 TB VMs). We know (or thought to know) that not all VMs would survive the loss of 50% host resources (7 hosts), but we need only ~100 VMs to run or be restarted in case of a site failure. The fling now tells me that even in case of loss of 7 hosts, all 329 VMs would continue to run (or can be restarted) with minimal performance impact.
I know that we don’t right size VMs, many of them have too much vRAM. But this is something I can not change.
Does the fling also consider the time ballooning takes to free vRAM? I would guess that if we really lose 7 hosts in that cluster, ballooning would be so slow that hypervisor swapping would start really fast and that we would see a massive performance impact.
14 hosts in cluster
7 simulated host failures
14 avg. # of compatible hosts
14 avg. # of compatible hosts (no rules)
329 VMs in cluster
0 total VMs failed to restart
0 non-agent VMs failed to restart
0 agent VMs failed to restart
Then I am guessing that on average your VMs are not close to consuming of what they have been provisioned with. The Fling looks at the “demand” of the VM using the DRS data and then simulates a failure. If your VMs are only demanding 25% of what they have been provisioned with then this could be the result.
Some more details around this can be found here:
http://hasimulator.vmware.com/html/docs/readme.pdf
Thx for your reply. I’m still not sure if the result is a theoretical one. Even if our VMs are massively oversized, in case of a failure, ballooning first has to free memory – which is slow. I don’t find anything about this in the readme and https://labs.vmware.com/flings/vm-resource-and-availability-service just shows as “Host has been locked out”.
If 7 hosts fail in your cluster, and you still have physical memory available to serve all VMs active memory requirements from the surviving hosts pRAM pool why do you think it’s going to balloon? You’re only going to balloon in the event of physical ram running low on your host and they switch to a soft memory reclamation state.
I just tested this in a 8 host cluster where the fling told me that all VMs would still run with minimal performance impact if I’d disable 6 hosts. After I set the 5th host into maintenance mode, the hosts entered the hard memory state, hypervisor swapping started and I received alarms about memory usage (host and VMs). Monitoring also alerted me that some VMs were not reachable anymore. So I guess the results of this fling have to be treated with caution.
Ok, the tool is nice, but not working reliably in every case.
https://labs.vmware.com/flings/vm-resource-and-availability-service#comment-525686
“Hi Ralf,
Thanks for trying out our fling! We really appreciate your feedback!
You are right in saying that the fling doesn’t handle over-sized VMs *yet*. The problem is, it doesn’t estimate the cost associated with such high memory reclamation. We are actively working on fixing this. I will post an update when this is done.”
Ralf, The changes are in. We can taken a more conservative approach in calculating the resource impact. This should cater your needs.
I can confirm that the fling now returns mich more reasonable results. THX!