5.0

VMFS-5 LUN Sizing

Duncan Epping · Jul 29, 2011 ·

I had a question on my old VMFS LUN Sizing article I did back in 2009… The question was how valid the used formula and values still were in today’s environment especially considering VMFS-5 is around the corner. It is a very valid question so I decided to take my previous article and rewrite it. Now one thing to keep in mind though is that I tried to make it usable for generic consumption and you will still need to figure out things yourself as I simply don’t have all info needed to make it cookie-cutter, but I guess this is as close as it can get.

Parameters:

MinSize = 1.2GB
MaxVMs = 40
SlackSpace = 20%
AvgSizeVMDK = 30GB
AvgDisksVMs = 2
AvgMemSize = 3GB

Before I will drop the formula I want to explain the MaxVMs parameter. You will need to figure out how many IOps your LUN can handle first, for a hint check this article. But besides IOps you will also beed to take burst room into account and of course the RTO defined for this environment:

((IOpsPerLUN – 20%) / AVGIOpsPerVM) ≤ (MaxVMsWithinRTO)

Keep in mind that the article I pointed out just a second ago is geared towards worst case numbers, so no cache or other benefits. Secondly I subtracted 20% which is room for bursting. Now this is by no means a best practice and this number will need to be tweaked based on the size of your LUN and the total amount of IOps you LUN can handle. For instance when you are using 8 SATA spindles that 20% might only be 80 IOps, depending on the raid level used, in the case of SAS it could be 280 IOps with just 8 spindles and that is a huge difference. Anyway I leave that up to you to decide but I used 20% headroom for both disk space (for snapshots and the memory overhead swap files) and performance, just to keep it simple. The second part of this one is MaxVMsWithinRTO. In short make sure that you can recover the number of VMs on the datastore within the defined recovery time objective (RTO). You don’t want to find yourself in a situation where the RTO is 4hrs but the total amount of time for the restore is 24 hours.

Formula, aaahhh yes here we go. Now note that I did not take traditional constraints around “SCSI Reservations Conflicts” into account as with VMFS -5 and VAAI SCSI Locking Offload these are lifted. If you have an array which doesn’t support the ATS primitive make sure you take this into account as well. Although the SCSI locking mechanism has been improved over the last years it could still limit you when you have a lot of power-on events, vMotion events etc.

(((MaxVMs * AvgDisksVMs) * AvgSizeVMDK) + ( MaxVMs * AvgMemSize)) + SlackSpace ≥ MinSize

Lets use the numbers defined in the parameters above and do the math:

(((40 * 2) * 30GB) + (40 * 3GB)) + 20% = (2400GB + 120GB) * 1.2 = 3024 GB

I hope this helps making your storage design decisions. One thing to keep in mind of course is that most storage arrays have optimal configurations for LUN sizes in terms of performance. Depending on your IOps requirements you might want to make sure that these align.

HA Architecture Series – Advanced Settings (5/5)

Duncan Epping · Jul 28, 2011 ·

When doing some research for the vSphere Clustering Technical Deepdive book I stumbled across something which was very surprising and difficult to grasp at first. I figured explaining it in a short article was the best approach. Many of you have read the HA deepdive article or the book and know that das.failuredetectiontime is probably the most commonly used advanced setting when configuring HA. There have been all sorts of recommendations and best practices flying around of which many were blatantly confusing to be honest. As stated in the previous article das.failuredetectiontime was no longer needed and has been deprecated. Did anything else change from an advanced settings perspective? Have advanced settings been added or removed. Here the new list:

das.ignoreInsufficientHbDatastore – 5.0 only
Suppress the host config issue that the number of heartbeat datastores is less than das.heartbeatDsPerHost. Default value is “false”. Can be configured as “true” or “false”.
das.heartbeatDsPerHost – 5.0 only
The number of required heartbeat datastores per host. The default value is 2; value should be between 2 and 5.
das.failuredetectiontime – 4.1 and prior
Number of milliseconds, timeout time, for isolation response action (with a default of 15000 milliseconds). Pre-vSphere 4.0 it was a general best practice to increase the value to 60000 when an active/standby Service Console setup was used. This is no longer needed. For a host with two Service Consoles or a secondary isolation address a failuredetection time of 15000 is recommended.
das.isolationaddress[x] – 5.0 and prior
IP address the ESX hosts uses to check on isolation when no heartbeats are received, where [x] = 0‐9. (see screenshot below for an example) VMware HA will use the default gateway as an isolation address and the provided value as an additional checkpoint. I recommend to add an isolation address when a secondary service console is being used for redundancy purposes.
das.usedefaultisolationaddress – 5.0 and prior
Value can be “true” or “false” and needs to be set to false in case the default gateway, which is the default isolation address, should not or cannot be used for this purpose. In other words, if the default gateway is a non-pingable address, set the “das.isolationaddress0” to a pingable address and disable the usage of the default gateway by setting this to “false”.
das.isolationShutdownTimeout – 5.0 and prior
Time in seconds to wait for a VM to become powered off after initiating a guest shutdown, before forcing a power off.
das.allowNetwork[x] – 5.0 and prior
Enables the use of port group names to control the networks used for VMware HA, where [x] = 0 – ?. You can set the value to be ʺService Console 2ʺ or ʺManagement Networkʺ to use (only) the networks associated with those port group names in the networking configuration.
das.bypassNetCompatCheck – 4.1 and prior
Disable the “compatible network” check for HA that was introduced with ESX 3.5 Update 2. Disabling this check will enable HA to be configured in a cluster which contains hosts in different subnets, so-called incompatible networks. Default value is “false”; setting it to “true” disables the check.
das.ignoreRedundantNetWarning – 5.0 and prior
Remove the error icon/message from your vCenter when you don’t have a redundant Service Console connection. Default value is “false”, setting it to “true” will disable the warning. HA must be reconfigured after setting the option.
das.vmMemoryMinMB – 5.0 and prior
The minimum default slot size used for calculating failover capacity. Higher values will reserve more space for failovers. Do not confuse with “das.slotMemInMB”.
das.slotMemInMB – 5.0 and prior
Sets the slot size for memory to the specified value. This advanced setting can be used when a virtual machine with a large memory reservation skews the slot size, as this will typically result in an artificially conservative number of available slots.
das.vmCpuMinMHz – 5.0 and prior
The minimum default slot size used for calculating failover capacity. Higher values will reserve more space for failovers. Do not confuse with “das.slotCpuInMHz”.
das.slotCpuInMHz – 5.0 and prior
Sets the slot size for CPU to the specified value. This advanced setting can be used when a virtual machine with a large CPU reservation skews the slot size, as this will typically result in an artificially conservative number of available slots.
das.sensorPollingFreq – 4.1 and prior
Set the time interval for HA status updates. As of vSphere 4.1, the default value of this setting is 10. It can be configured between 1 and 30, but it is not recommended to decrease this value as it might lead to less scalability due to the overhead of the status updates.
das.perHostConcurrentFailoversLimit – 5.0 and prior
By default, HA will issue up to 32 concurrent VM power-ons per host. This setting controls the maximum number of concurrent restarts on a single host. Setting a larger value will allow more VMs to be restarted concurrently but will also increase the average latency to recover as it adds more stress on the hosts and storage.
das.config.log.maxFileNum – 5.0 only
Desired number of log rotations.
das.config.log.maxFileSize – 5.0 only
Maximum file size in bytes of the log file.
das.config.log.directory – 5.0 only
Full directory path used to store log files.
das.maxFtVmsPerHost – 5.0 and prior
The maximum number of primary and secondary FT virtual machines that can be placed on a single host. The default value is 4.
das.iostatsinterval (VM Monitoring) – 5.0 and prior
The I/O stats interval determines if any disk or network activity has occurred for the virtual machine. The default value is 120 seconds.
das.failureInterval (VM Monitoring) – 5.0 and prior
The polling interval for failures. Default value is 30 seconds.
das.minUptime (VM Monitoring) – 5.0 and prior
The minimum uptime in seconds before VM Monitoring starts polling. The default value is 120 seconds.
das.maxFailures (VM Monitoring) – 5.0 and prior
Maximum number of virtual machine failures within the specified “das.maxFailureWindow”, If this number is reached, VM Monitoring doesn’t restart the virtual machine automatically. Default value is 3.
das.maxFailureWindow (VM Monitoring) – 5.0 and prior
Minimum number of seconds between failures. Default value is 3600 seconds. If a virtual machine fails more than “das.maxFailures” within 3600 seconds, VM Monitoring doesn’t restart the machine.
das.vmFailoverEnabled (VM Monitoring) – 5.0 and prior
If set to “true”, VM Monitoring is enabled. When it is set to “false”, VM Monitoring is disabled.

Please note that this is the full list that I am aware of today, over time I will add / remove where and when applicable.

HA Architecture Series – Datastore Heartbeating (3/5)

Duncan Epping · Jul 26, 2011 ·

**disclaimer: Some of the content has been taken from the vSphere 5 Clustering Technical Deepdive book**

The first time I was playing around with 5.0 and particularly HA I noticed a new section in the UI called Datastore Heartbeating.

Those familiar with HA prior to vSphere 5.0 probably know that virtual machine restarts were always initiated, even if only the management network of the host was isolated and the virtual machines were still running. As you can imagine, this added an unnecessary level of stress to the host. This has been mitigated by the introduction of the datastore heartbeating mechanism. Datastore heartbeating adds a new level of resiliency and allows HA to make a distinction between a failed host and an isolated / partitioned host. Isolated vs Partitioned is explained in Part 2 of this series.

Datastore heartbeating enables a master to more correctly determine the state of a host that is not reachable via the management network. The new datastore heartbeat mechanism is only used in case the master has lost network connectivity with the slaves to validate whether the host has failed or is merely isolated/network partitioned. As shown in the screenshot above two datastores are automatically selected by vCenter. You can rule out specific volumes if and when required or even make the selection yourself. I would however recommend to let vCenter decide.

As mentioned by default it will select two datastores. It is possible however to configure an advanced setting (das.heartbeatDsPerHost) to allow for more datastores for datastore heartbeating. I can imagine this is something that you would do when you have multiple storage devices and want to pick a datastore from each, but generally speaking I would not recommend configuring this option as the default should be sufficient for most scenarios.

How does this heartbeating mechanism work? HA leverages the existing VMFS filesystem locking mechanism. The locking mechanism uses a so called “heartbeat region” which is updated as long as the lock on a file exists. In order to update a datastore heartbeat region, a host needs to have at least one open file on the volume. HA ensures there is at least one file open on this volume by creating a file specifically for datastore heartbeating. In other words, a per-host a file is created on the designated heartbeating datastores, as shown in the screenshot below. HA will simply check whether the heartbeat region has been updated.

If you are curious which datastores have been selected for heartbeating. Just go to your summary tab on your cluster and click “Cluster Status”, the 3 tab “Heartbeat Datastores” will reveal it.

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

HA Architecture Series – Primary nodes? (2/5)

Duncan Epping · Jul 25, 2011 ·

**disclaimer: Some of the content has been taken from the vSphere 5 Clustering Technical Deepdive book**

As mentioned in an earlier post vSphere High Availability has been completely overhauled… This means some of the historical constraints have been lifted and that means you can / should / might need to change your design or implementation.

What I want to discuss today is the changes around the Primary / Secondary node concept that was part of HA prior to vSphere 5.0. This concept basically limited you in certain ways… For those new to VMware /vSphere, in the past there was a limit of 5 primary nodes. As a primary node was a requirement to restart virtual machines you always wanted to have at least 1 primary node available. As you can imagine this added some constraints around your cluster design when it came to Blades environments or Geo-Dispersed clusters.

vSphere 5.0 has completely lifted these constraints. Do you have a Blade Environment and want to run 32 hosts in a cluster? You can right now as the whole Primary/Secondary node concept has been deprecated. HA uses a new mechanism called the Master/Slave node concept. This concept is fairly straight forward. One of the nodes in your cluster becomes the Master and the rest become Slaves. I guess some of you will have the question “but what if this master node fails?”. Well it is very simple, when the master node fails an election process is initiated and one of the slave nodes will be promoted to master and pick up where the master left off. On top of that, lets take the example of a Geo-Dispersed cluster, when the cluster is split in two sites due to a link failure each “partition” will get its own master. This allows for workloads to be restarted even in a geographically dispersed cluster when the network has failed….

What is this master responsible for? Well basically all the tasks that the primary nodes used to have like:

restarting failed virtual machines
exchanging state with vCenter
monitor the state of slaves

As mentioned when a master fails a election process is initiated. The HA master election takes roughly 15 seconds. The election process is simple but robust. The host that is participating in the election with the greatest number of connected datastores will be elected master. If two or more hosts have the same number of datastores connected, the one with the highest Managed Object Id will be chosen. This however is done lexically; meaning that 99 beats 100 as 9 is larger than 1. That is a huge improvement compared to what is was like in 4.1 and prior isn’t it?

For those wondering which host won the election and became the master, go to the summary tab and click “Cluster Status”.

Isolated vs Partitioned

As this is a change in behavior I do want to briefly discuss the difference between an Isolation and a Partition. First of all, a host is considered to be either Isolated or Partitioned when it loses network access to a master but has not failed. To help explain the difference the states and the associated criteria below:

Isolated
- Is not receiving heartbeats from the master
- Is not receiving any election traffic
- Cannot ping the isolation address
Partitioned
- Is not receiving heartbeats from the master
- Is receiving election traffic
- (at some point a new master will be elected at which the state will be reported to vCenter)

In the case of an Isolation, a host is separated from the master and the virtual machines running on it might be restarted, depending on the selected isolation response and the availability of a master. It could occur that multiple hosts are fully isolated at the same time. When multiple hosts are isolated but can still communicate amongst each other over the management networks, it is called s a network partition. When a network partition exists, a master election process will be issued so that a host failure or network isolation within this partition will result in appropriate action on the impacted virtual machine(s).

Black & White version currently $ 21.56 on amazon

Duncan Epping · Jul 22, 2011 ·

Someone just notified me through a comment on this blog that Amazon currently has the book on sale for $ 21.56. This is the black&white edition of vSphere 5 Clustering Technical Deepdive and can be found here. Just wanted to let you guys know so you could benefit from the $ 8.5 discount. There are 6 reviews and ratings currently, if you feel like it and already finished it help us out and post a review and rating as a self-published book we can use all the help we can get.