I’ve been testing with the experimental feature Virtual Machine High Availability(aka VM Failure Monitoring) for a couple of days now. I must say it just does what VMware claims in the PDF, resetting a VM within the configured time when a the heartbeat is lost. But one thing that struck me is that there’s hardly any evidence that HA did it’s job, in other words no events logged in VirtualCenter as far as I can see. Well there was an error indicating something was wrong “Remote console on w2k3-001 disconnected”. I checked several log files but could not find any decent errors until I checked the file /var/log/vmware/hostd.log. I know the PDF about this feature states “In this experimental version of Virtual Machine Failure Monitoring, no explicit notification is sent to the administrator.”, but I would at least expect some sort of error.
The following lines in the log /var/log/vmware/hostd.log indicated that VMware initiated the reset of the VM:
- Task Created : haTask-112-vim.VirtualMachine.reset-1098
- Event 61 : w2k3-001 on ESX02.esxdemo.local in ha-datacenter is reset
- State Transition (VM_STATE_ON -> VM_STATE_RESETTING)
- w2k3-001 on ESX02.esxdemo.local in ha-datacenter is powered on
- State Transition (VM_STATE_RESETTING -> VM_STATE_ON)
- Task Completed : haTask-112-vim.VirtualMachine.reset-1098
VM HA was configured with the following parameters:
- das.FailureInterval = 30 (If there’s no heartbeat received withing 30 seconds initiate restart)
- das.minUptime = 120 (VM has to be up for at least 120 seconds before HA kicks in, don’t set it to short cause it needs this time to stabilize the heartbeat)
- das.maxFailures = 40 (Maximum amount of resets within the das.maxFailureWindow, normally I would never set this above 3 but for testing I’ve set this to 40 )
- das.maxFailureWindow = 86400 ( 86400 Seconds is 1 day, see das.maxFailures)
- das.vmFailoverEnabled = true (Enable VM HA)
By the way I used the following Microsoft “hidden feature” to force a BSOD:
To enable this feature, add the following value to the registry key HKLM\System\CurrentControlSet\Services\i8042prt\Parameters
- Name: CrashOnCtrlScroll
Data Type: REG_DWORD
Value: 1
Exit Registry Editor, and then restart the computer. When holding down the right ctrl and pressing the scroll lock twice at the same time Windows will generate a BSOD and if you have setup VM HA correctly the VM will be reset within the das.FailureInterval time.
Rich says
Good to see someone experimenting with the experimental!
Rubens says
It is exactly what I looking for! Cheers