ha

vSphere 5.0 HA: Application Monitoring intro

Duncan Epping · Aug 11, 2011 ·

I don’t think anyone has blogged about App Monitoring yet so I figured I would do a “what’s new / intro” to App Monitoring in vSphere 5.0. Prior to vSphere 5 App Monitoring could only be leveraged by partners which had access to the SDK/APIs. A handful of partners leveraged those of which probably Symantec’s ApplicationHA is the best example. The “problem” with that though is that you would still need to buy a piece of software while you might have in-house development who could easily bake this into their application… well with vSphere 5 you can. I grabbed one of the latest code drops and started playing around. Note that I am not going to do an extensive article on this. Just showing what you have after installing the package. In my case I installed it on a Windows VM.

Now first of all after installing the package you will have new executable. This executable allows you to control App Monitoring offers without the need to compile a full binary yourself. This new command, vmware-appmonitoring.exe, takes the following arguments, which are not coincidentally similar to the functions I will show in a second:

Enable
Disable
markActive
isEnabled
getAppStatus

When running the command the following output is presented:

C:\VMware-GuestAppMonitorSDK\bin\win32>vmware-appmonitor.exe
Usage: vmware-appmonitor.exe {enable | disable | markActive | isEnabled | getApp Status}

Now I guess most parameters speak for itself. “Enable” will allow you to switch on App Monitoring and “Disable” turns it off again. “IsEnabled” will give you the current status, is it on or off? The “getAppStatus” tells what the status is of your app, is it healthy and has it been sending heartbeats regularly, well than the result will be green if there is a real issue than it will be red. (There’s also gray which means HA just picked up on the VM it’s status needs to be cleared and monitoring should be started soon) Now the one that is most important is “markActive”. This parameter needs to be called at least every 30 seconds. This is the heartbeat parameter. In other words “markActive” is what informs HA that the application is still alive!

I am sure that as soon as William Lam gets his hands on the package he will go wild and release a bunch of scripts which will allow you to enhance resiliency for application/service. These parameters can also be used by your development team, but in the form of a function. The Application Awareness API allows for anyone to talk to it using different types of languages like C++ and Java for instance. Currently there are 6 functions defined:

VMGuestAppMonitor_Enable()
Enables Monitoring
VMGuestAppMonitor_MarkActive()
Mark application as active, recommend to call this at least every 30 seconds
VMGuestAppMonitor_Disable()
Disable Monitoring
VMGuestAppMonitor_IsEnabled()
Returns status of Monitoring
VMGuestAppMonitor_GetAppStatus()
Returns the current application status recorded for the application
VMGuestAppMonitor_Free()
Frees the result of the VMGuestAppMonitor_GetAppStatus() call

These functions could be used by your development team to enhance resiliency in a simple way. This is just the start however, I personally would like to see some sort of rolling patch process added on top and for instance the ability to restart service or have a partial VM failure. Or even the hint the hypervisor that there is a partial failure and request a vMotion to a different host to validate if that solves the problem… If you feel there’s something that needs to be added to App Monitoring let me know and I’ll make sure the PM/Dev Team reads this thread.

** disclaimer: some of this info was taken from the vSphere 5.0 Technical Deepdive book **

vSphere 5 Coverage

Duncan Epping · Aug 6, 2011 ·

I just read Eric’s article about all the topics he covered around vSphere 5 over the last couple of weeks and as I just published the last article I had prepared I figured it would make sense to post something similar. (Great job by the way Eric, I always enjoy reading your articles and watching your videos!) Although I did hit roughly 10.000 unique views on average per day the first week after the launch and still 7000 a day currently I have the feeling that many were focused on the licensing changes rather then all the new and exciting features that were coming up, but now that the dust has somewhat settled it makes sense to re-emphasize them. Over the last 6 months I have been working with vSphere 5 and explored these features, my focus for most of those 6 months was to complete the book but of course I wrote a large amount of articles along the way, many of which ended up in the book in some shape or form. This is the list of articles I published. If you feel there is anything that I left out that should have been covered let me know and I will try to dive in to it. I can’t make any promises though as with VMworld coming up my time is limited.

Once again if there it something you feel I should be covering let me know and I’ll try to dig in to it. Preferably something that none of the other blogs have published of course.

vSphere 5.0 HA: Changes in admission control

Duncan Epping · Aug 3, 2011 ·

I just wanted to point out a couple of changes for HA in vSphere 5.0 with regards to admission control. Although they might seem minor they are important to keep in mind when redesigning your environment. Lets just discuss each of the admission control policies and list the changes underneath.

Host failures cluster tolerates
Still uses the slot algorithm. Major change here is that you can have a value larger than 4 hosts. The 4 host limit was imposed by the Primary/Secondary node concept. As this constraint has been lifted it is now possible to select a value up to 31. So in the case of a 16 host cluster you can set the value to 15. (Yes you could even set it to 31 as the UI doesn’t limit you but that wouldn’t make sense would it…) Another change is the default slotsize for CPU. The default slotsize used to be 256MHz. This has been decreased to 32MHz.
Percentage as cluster resources reserved
This admission control policy has been overhauled and it is now possible to select a percentage for both CPU and Memory separately. In other words you can set CPU to 30% and Memory to 25%. The algorithm hasn’t changed and this is still my preferred admission control policy!
Specify Failover host
Allows you to select multiple hosts instead of just 1. So for instance in an 8 host cluster you can specify two as designated failover hosts. These hosts will not be used during normal operations, keep this in mind!

For more details on admission control I would like to refer to the HA deepdive (not updated to 5.0 yet) or my book on vSphere 5.0 Clustering which contains many examples of how to correctly set the percentage for instance.

VMworld Session: vSphere Clustering Q&A

Duncan Epping · Aug 1, 2011 ·

We need your help for our VMworld session “VSP1682 – vSphere Clustering Q&A”. In order to ensure we can fill up the full 60 minutes we want to have a couple of questions ready in case no one in the audience has a question. Although I doubt that will be the case, it is better to be prepared than to stare at each-other for 50 minutes. So please help us out and submit some questions about HA, DRS and/or Storage DRS.

Our session is on Monday morning at 08:00 so if you haven’t yet, register today. By the way, Frank has another session which is DRS/Resource Management Deepdive… definitely worth attending, it is VSP3116 and on Monday at 11:30 and Thursday at 10:30 (sold out). Make sure to attend one of those. I’ve seen a preview of the slidedeck and it will be worth it. Another E P I C session will be VSP1956 on Monday at 13:00. It is the ESXi Quiz, yes… Death to Powerpoint. At this session you will see vExperts taking on VMware employees and in a knowledge quiz!

Loading…

HA Architecture Series – Advanced Settings (5/5)

Duncan Epping · Jul 28, 2011 ·

When doing some research for the vSphere Clustering Technical Deepdive book I stumbled across something which was very surprising and difficult to grasp at first. I figured explaining it in a short article was the best approach. Many of you have read the HA deepdive article or the book and know that das.failuredetectiontime is probably the most commonly used advanced setting when configuring HA. There have been all sorts of recommendations and best practices flying around of which many were blatantly confusing to be honest. As stated in the previous article das.failuredetectiontime was no longer needed and has been deprecated. Did anything else change from an advanced settings perspective? Have advanced settings been added or removed. Here the new list:

das.ignoreInsufficientHbDatastore – 5.0 only
Suppress the host config issue that the number of heartbeat datastores is less than das.heartbeatDsPerHost. Default value is “false”. Can be configured as “true” or “false”.
das.heartbeatDsPerHost – 5.0 only
The number of required heartbeat datastores per host. The default value is 2; value should be between 2 and 5.
das.failuredetectiontime – 4.1 and prior
Number of milliseconds, timeout time, for isolation response action (with a default of 15000 milliseconds). Pre-vSphere 4.0 it was a general best practice to increase the value to 60000 when an active/standby Service Console setup was used. This is no longer needed. For a host with two Service Consoles or a secondary isolation address a failuredetection time of 15000 is recommended.
das.isolationaddress[x] – 5.0 and prior
IP address the ESX hosts uses to check on isolation when no heartbeats are received, where [x] = 0‐9. (see screenshot below for an example) VMware HA will use the default gateway as an isolation address and the provided value as an additional checkpoint. I recommend to add an isolation address when a secondary service console is being used for redundancy purposes.
das.usedefaultisolationaddress – 5.0 and prior
Value can be “true” or “false” and needs to be set to false in case the default gateway, which is the default isolation address, should not or cannot be used for this purpose. In other words, if the default gateway is a non-pingable address, set the “das.isolationaddress0” to a pingable address and disable the usage of the default gateway by setting this to “false”.
das.isolationShutdownTimeout – 5.0 and prior
Time in seconds to wait for a VM to become powered off after initiating a guest shutdown, before forcing a power off.
das.allowNetwork[x] – 5.0 and prior
Enables the use of port group names to control the networks used for VMware HA, where [x] = 0 – ?. You can set the value to be ʺService Console 2ʺ or ʺManagement Networkʺ to use (only) the networks associated with those port group names in the networking configuration.
das.bypassNetCompatCheck – 4.1 and prior
Disable the “compatible network” check for HA that was introduced with ESX 3.5 Update 2. Disabling this check will enable HA to be configured in a cluster which contains hosts in different subnets, so-called incompatible networks. Default value is “false”; setting it to “true” disables the check.
das.ignoreRedundantNetWarning – 5.0 and prior
Remove the error icon/message from your vCenter when you don’t have a redundant Service Console connection. Default value is “false”, setting it to “true” will disable the warning. HA must be reconfigured after setting the option.
das.vmMemoryMinMB – 5.0 and prior
The minimum default slot size used for calculating failover capacity. Higher values will reserve more space for failovers. Do not confuse with “das.slotMemInMB”.
das.slotMemInMB – 5.0 and prior
Sets the slot size for memory to the specified value. This advanced setting can be used when a virtual machine with a large memory reservation skews the slot size, as this will typically result in an artificially conservative number of available slots.
das.vmCpuMinMHz – 5.0 and prior
The minimum default slot size used for calculating failover capacity. Higher values will reserve more space for failovers. Do not confuse with “das.slotCpuInMHz”.
das.slotCpuInMHz – 5.0 and prior
Sets the slot size for CPU to the specified value. This advanced setting can be used when a virtual machine with a large CPU reservation skews the slot size, as this will typically result in an artificially conservative number of available slots.
das.sensorPollingFreq – 4.1 and prior
Set the time interval for HA status updates. As of vSphere 4.1, the default value of this setting is 10. It can be configured between 1 and 30, but it is not recommended to decrease this value as it might lead to less scalability due to the overhead of the status updates.
das.perHostConcurrentFailoversLimit – 5.0 and prior
By default, HA will issue up to 32 concurrent VM power-ons per host. This setting controls the maximum number of concurrent restarts on a single host. Setting a larger value will allow more VMs to be restarted concurrently but will also increase the average latency to recover as it adds more stress on the hosts and storage.
das.config.log.maxFileNum – 5.0 only
Desired number of log rotations.
das.config.log.maxFileSize – 5.0 only
Maximum file size in bytes of the log file.
das.config.log.directory – 5.0 only
Full directory path used to store log files.
das.maxFtVmsPerHost – 5.0 and prior
The maximum number of primary and secondary FT virtual machines that can be placed on a single host. The default value is 4.
das.iostatsinterval (VM Monitoring) – 5.0 and prior
The I/O stats interval determines if any disk or network activity has occurred for the virtual machine. The default value is 120 seconds.
das.failureInterval (VM Monitoring) – 5.0 and prior
The polling interval for failures. Default value is 30 seconds.
das.minUptime (VM Monitoring) – 5.0 and prior
The minimum uptime in seconds before VM Monitoring starts polling. The default value is 120 seconds.
das.maxFailures (VM Monitoring) – 5.0 and prior
Maximum number of virtual machine failures within the specified “das.maxFailureWindow”, If this number is reached, VM Monitoring doesn’t restart the virtual machine automatically. Default value is 3.
das.maxFailureWindow (VM Monitoring) – 5.0 and prior
Minimum number of seconds between failures. Default value is 3600 seconds. If a virtual machine fails more than “das.maxFailures” within 3600 seconds, VM Monitoring doesn’t restart the machine.
das.vmFailoverEnabled (VM Monitoring) – 5.0 and prior
If set to “true”, VM Monitoring is enabled. When it is set to “false”, VM Monitoring is disabled.

Please note that this is the full list that I am aware of today, over time I will add / remove where and when applicable.