ESX

Fling: ESX System Analyzer

Duncan Epping · Nov 30, 2011 ·

When I joined Tech Marketing in February of this year my first task literally was the ESX System Analyzer. I was part of the team who developed the specs and test the app, but the main driving force behind the tool was my colleague Kyle Gleed (@VMwareESXi).

The tool / fling was designed specifically to help people migrate from ESX to ESXi and to smoothen the transition especially in those environments where the Service Console was customized over the years. If you haven’t migrated yet, and want to make the jump to a lean and mean hypervisor I suggest to take a look at this fling and analyze your environment to help with planning the transition!

Source: VMware Labs

The ESX System Analyzer is a tool designed to help administrators plan a migration from ESX to ESXi. It analyzes the ESX hosts in your environment and, for each host, collects information on factors that pertain to the migration process:

Hardware compatibility with ESXi

VMs registered on the ESX host, as well as VMs located on the host’s local disk

Modifications to the Service Console

RPMs which have been added or removed

Files which have been added

Users and cronjobs which have been added

This tool also provides summary information for the whole existing environment

Version of VMware Tools and Virtual Hardware for all VMs

Version of Filesystem for all datastores

By having this information, administrators can determine what tasks need to be done prior to the migration. Examples include:

Relocate VMs from local datastores to shared datastores

Make note of what agent software has been added to the host and obtain the equivalent agentless version

Replace cronjobs with equivalent remote scripts written with PowerCLI or vCLI

Disk.SchedNumReqOutstanding the story

Duncan Epping · Jun 23, 2011 ·

There has been a lot of discussion in the past around Disk.SchedNumReqOutstanding and what the value should be and how it relates to the Queue Depth. Jason Boche wrote a whole article about when Disk.SchedNumReqOutstanding (DSNRO) is used and when not and I guess I would explain it as follows:

When two or more virtual machines are issuing I/Os to the same datastore Disk.SchedNumReqOutstanding will limit the amount of I/Os that will be issued to the LUN.

So what does that mean? It took me a while before I fully got it, so lets try to explain it with an example. This is basically how the VMware IO scheduler (Start-Time Fair Scheduling aka SFQ) works.

You have set your queue depth for your HBA to 64 and a virtual machine is issuing I/Os to a datastore. As it is just a single VM up to 64 IOs will then end up in the device driver immediately. In most environments however LUNs are shared by many virtual machines and in most cases these virtual machines should be treated equally. When two or more virtual machines issue I/O to the same datastore DSNRO kicks in. However it will only throttle the queue depth when the VMkernel has detected that the threshold of a certain counter is reached. The name of this counter is Disk.SchedQControlVMSwitches and by default it is set to 6, meaning that the VMkernel will need to have detected 6 VM switches when handling I/O before it will throttle the queue down to the value of Disk.SchedNumReqOutstanding, by default 32. (VM Switches means that it will need to detect 6 times that the selected I/O is not coming from the same VM as the previous I/O.)

The reason the throttling happens is because the VMkernel cannot control the order of the I/Os that have been issued to the driver. Just imagine you have a VM A issuing a lot of I/Os and another, VM B, issuing just a few I/Os. VM A would end up using most of the full queue depth all the time. Every time VM B issues an I/O it will be picked quickly by the VMkernel scheduler (which is a different topic) and sent to the driver as soon as another one completes from there, but it will still get behind the 64 I/Os already in the driver, which will add significantly to it’s I/O latency. By limiting the amount of outstanding requests we will allow the VMkernel to schedule VM B’s I/O sooner in the I/O stream from VM A and thus we reduce the latency penalty for VM B.

Now that brings us to the second part of all statements out there, should we really set Disk.SchedNumReqOutstanding to the same value as your queue depth? Well in the case you want your I/Os processed as quickly as possible without any fairness you probably should. But if you have mixed workloads on a single datastore, and wouldn’t want virtual machines to incur excessive latency just because a single virtual machine issues a lot if I/Os, you probably shouldn’t.

Is that it? No not really, there are several questions that remain unanswered.

What about sequential I/O in the case of Disk.SchedNumReqOutstanding?
How does the VMkernel know when to stop using Disk.SchedNumReqOutstanding?

Lets tackle the sequential I/O question first. The VMkernel will allow by default to issue up to 8 sequential commands (controlled by Disk.SchedQuantum) from a VM in a row even when it would normally seem more fair to take an I/O from another VM. This is done in order not to destroy the sequential-ness of VM workloads because I/Os that happen to sectors nearby the previous I/O are handled by an order of magnitude (10x is not unusual when excluding cache effects or when caches are small compared to the disk size) faster than an I/O to sectors far away. But what is considered to be sequential? Well if the next I/O is less than 2000 sectors away from the current the I/O it is considered to be sequential (controlled by Disk.SectorMaxDiff).

Now if for whatever reason one of the VMs becomes idle you would more than likely prefer your active VM to be able to use the full queue depth again. This is what Disk.SchedQControlSeqReqs is for. By default Disk.SchedQControlSeqReqs is set to 128, meaning that when a VM has been able to issue 128 commands without any switches Disk.SchedQControlVMSwitches will be reset to 0 again and the active VM can use the full queue depth of 64 again. With our example above in mind, the idea is that if VM B is issuing very rare IOs (less than 1 in every 128 from another VM) then we still let VM B pay the high penalty on latency because presumably it is not disk bound anyway.

To conclude, now that the coin has finally dropped on Disk.SchedNumReqOutstanding I strongly feel that the advanced settings should not be changed unless specifically requested by VMware GSS. Changing these values can impact fairness within your environment and could lead to unexpected behavior from a performance perspective.

I would like to thank Thor for all the help he provided.

das.failuredetection time and relationship with isolation response

Duncan Epping · May 27, 2011 ·

I had this question coincidentally two times of the last 3 weeks and I figured that it couldn’t hurt explaining it here as well. The question on the VMTN community was as follows:

on 13 sec: a host which hears from none of the partners will ping the isolation address
on 14 sec: if no reply from isolation address it will trigger the isolation response
on 15 sec: the host will be declared dead from the remaining hosts, this will be confirmed by pinging the missing host
on 16 sec: restarts of the VMs will begin

My first question is: Do all these timings come from the das.failuredetectiontime? That is, if das.failuredetectiontime is set to e.g. 30000 (30 sec) then on the 28th second a potential isolated host will try to ping the isolation address and do the Isolation Response action at 29 second?

Or is the Isolation Response timings hardcoded and always happens at 13 sec?

My second question, if the answer is Yes on above, why is the recommendation to increase das.failuredetectiontime to 20000 if having multiple Isolation Response addresses? If the above is correct then this would make to potential isolated host to test its isolation addresses at 18th second and the restart of the VMs will begin at 21 second, but what would be the gain from this really?

To which my answer was very short fortunately:

Yes, the relationship between these timings is das.failuredetectiontime.

Increasing the das.failuredetectiontime is usually recommended when an additional das.isolationaddress is specified. the reason for this is that the “ping” and the “result of the ping” needs time and by added 5 seconds to the failure detection time you allow for this test to complete correctly. After which the isolation response could be triggered.

After having a discussion on VMTN about this and giving it some thought and bouncing my thoughts with the engineers I came to the conclusion that the recommendation to increase das.failuredetectiontime with 5 seconds when multiple isolation addresses are specified is incorrect. The sequence is always as follows regardless of the value of das.failuredetectiontime:

The ping will always occur at “das.failuredetectiontime -2”
The isolation response is always triggered at “das.failuredetectiontime -1”
The fail-over is always initiated at “das.failuredetectiontime +1”

The timeline in this article explains the process well.

Now, this recommendation to increase das.failuredetectiontime was probably made in times where many customers were experiencing network issues. Increasing the time decreases the chances of running in to an issue where VMs are powered down due to a network outage. Sorry about all the confusion and unclear recommendations.

What if you were to design your own server…

Duncan Epping · May 18, 2011 ·

Lately I have been thinking about the future of servers and more specifically the design around servers. Servers are more and more heading towards these massive beasts with all sorts of options that many might not need, but end up paying for as they are already bolted on. On the other hand you have these massive blade chassis that will allow for 10 / 14 blades, whatever your vendor decides is a nice form factor. While thinking about that I wondered why we have the 1U and 2U servers stuffed with options and the possibility to add disks when all we actually want, in many cases, is to run ESXi as a hypervisor. Even if we want to have local disks do we really need a 2U server?

After doing some research on the internet I bumped into something which I thought was a cool concept. Although it isn’t was I envisioned, it is close enough to share with you. I haven’t seen these types of servers used for virtualization so far and I wonder why that is. There are multiple vendors with offerings like these but I wanted to point out the following two as they offer more than others in my opinion and are VMware Certified. These servers are traditionally used in HPC environments (High-performance computing), but if you look at what they offer they could be suitable for virtualization as well. They are very dense but don’t bring along the requirement to buy a full chassis if you just need 3 or 4 servers. Of course you cannot directly compare them to blade servers and chassis, but think about the possibilities for a second and I will expand on that as well in a second.

Now in this case, the Super Micro 2U Twin2 has 4 nodes. Each node has a set of 6 SAS drives to its disposal and can hold up to 192GB or RAM. On top of that it can hold 2 Intel Nehalem/Westmere CPUs and has an Infiniband 20Gbps on board. This by itself is a very cool concept, but what if we would simplify it? These servers typically have:

Expansion slots
Sata / sas controllers
- Disks
- CD/DVD
Multiple 1GbE links
IPMI Lan port

But do we really need all of that? Wouldn’t a fully stripped down server make more sense for a virtualized environment? Do we really need a Sata/SAS controller? Do we need a CD/DVD Drive? Do we need multiple 1Gbe links plus 20GbE Infiniband and on top of that an IPMI Lan port? What if someone would come out with a server that wasn’t geared towards HPC but to virtualization. Yes we have seen many vendors taking their traditional servers and positioning them as Virtualization Ready but are they? So what would I like to see?

Well for starters I kinda like the form factor above, but I would like to see one without those disks. In most environments there will be shared storage available so there is no need for local disks. It would be nice if they had an on-board dual SD slot, allowing for ESXi to be installed locally. So what if someone could crank out, maybe someone already did if so let me know, a configuration like this:

2U “Chassis”
Max 4 nodes
Each node supporting max 2 sockets
Each node supporting 192GB (probably overkill)
Single 10GbE CNA
Single IPMI LAN port
SAS/SATA controllers

But what if we could go even more crazy like that, kinda like what Dell developed with their C5125 Microservers, what if you could host 12 Server nodes in 3u? Would that be something that you would be interested in? Yes, you might be limited to a single processor but without the requirement for a disk and lets say 96GB of memory max it should be possible. Yes I understand their would be implications to a design like that, but that is not the point right now.

I don’t design hardware or servers, but it seems to me that many options have been explored for all kinds of workloads but we haven’t reached the full potential for virtualization. Out in the field we see many people creating home labs with barebone casings, we see people running very stripped down configurations but when you walk into a random datacenter you see DL380’s, Dell R710s etc fully stocked with all bells and whistles while half of these features are not used. Wouldn’t dense and virtualization purpose built servers be nice? Seamicro created a nice solution with 512 servers in a 10 Us, but the CPUs are not powerful enough unfortunately for our purpose. Still I feel there are opportunities out their to really innovate, to lower the cost, lower the chances of failure and to ease management and maintenance!

Which server vendor out there is going to take the next step?

Mythbusters: ESX/ESXi caching I/O?

Duncan Epping · Apr 7, 2011 ·

We had a discussion internally about ESX/ESXi caching I/Os. In particular this discussion was around caching of writes as a customer was concerned about consistency of their data. I fully understand that they are concerned and I know in the past some vendors were doing write caching however VMware does not do this for obvious reasons. Although performance is important it is worthless when your data is corrupt / inconsistent. Of course I looked around for data to back this claim up and bust this myth once and for all. I found a KB article that acknowledges this and have a quote from one of our VMFS engineers.

Source Satyam Vaghani (VMware Engineering)
ESX(i) does not cache guest OS writes. This gives a VM the same crash consistency as a physical machine: i.e. a write that was issued by the guest OS and acknowledged as successful by the hypervisor is guaranteed to be on disk at the time of acknowledgement. In other words, there is no write cache on ESX to talk about, and so disabling it is moot. So that’s one thing out of our way.

Source – Knowledge Base
VMware ESX acknowledges a write or read to a guest operating system only after that write or read is acknowledged by the hardware controller to ESX. Applications running inside virtual machines on ESX are afforded the same crash consistency guarantees as applications running on physical machines or physical disk controllers.