ha

HA and ESX 3.0.x log file flooding

Duncan Epping · Mar 6, 2009 ·

VMware just released a new KB article. The article is on VMware HA and the log files it can possibly generate when restarting of a VM occurs.

The article contains an extensive description of the sympoms but the most important one is this one:

For any virtual machine that VMware HA is wrongly trying to start, it also generates thousands or tens of thousands of vmware-x.log files (one created every few minutes). Contents of each log file shows the virtual machine starting up then failing to start. Despite these logs, the virtual machines are actually pingable and running, and vm-support -x shows them as available running virtual machines on the host.

The resolution for the problem is the following:

Stop and start HA on the cluster. (This stops the flooding)
# find . -name ‘-*.log’ | xargs rm
This commands removes all log files. Make sure you run this from /vmfs/volumes!

Changed advanced option in HA not working?

Duncan Epping · Feb 24, 2009 ·

VMware just release this KB article which is definitely one you should read if you use the VMware HA advanced option. In short: When you edit the HA advanced options they do not take effect until you re-enable HA on the cluster.

Be aware of this when implementing or troubleshooting HA it can cost you a lot of time.

Revised: Service Console Redundancy

Duncan Epping · Feb 17, 2009 ·

I have been requested by several people to do an update of my original Service Console Redundancy article. Although personally, I am still of the opinion that the three options stated in the article are still valid I have rewritten them and dropped one option, as now a days the majority of companies now have a decent infrastructure with vlan’s. [Read more…] about Revised: Service Console Redundancy

VMware HA or VMware SRM, what should I use?

Duncan Epping · Feb 12, 2009 ·

I was just reading up on VMTN and noticed this great topic. For some reason there are a lot of people that don’t see the difference between HA and SRM. I suggest reading the full topic and especially Jay Judkowitz’s replies and Smoggy’s reply, both are Subject Matter Experts on SRM and explained the topic starter what the differences are and when to use it. Here’s an outtake of the discussion which captures the essence of the answer in my opinion:

With SRM, you get a much more well defined failover.
- The VMs start in a specified order
- You can set some VMs to be started serially with others starting in parallel
- You can designate VMs at the recovery site to suspend to make room for recovery VMs
- You can have callout scripts and predefined breakpoints to make sure that critical non-VMware activity is done at the right time and place
- You can set the resource pool at the remote site (with the same size or different as the source resource pool) so that you get a predictable and defined QOS on CPU and memory
Once you have that well defined failover plan, you can test it and audit the results
- Testing will automatically snap the recovery LUNs so you can power on the recovery VMs without interrupting replication
- You can specify a test network at the second site that SRM will automatically put the recovery VMs on during a test so that they do not interfere with the running VMs
- You can therefore do non-disruptive DR testing any time without warning. The recovery plan executes the same as for failover, but in a “test bubble” where storage and network IO are safely segregated away from production work.
- There is a test results page for the recovery plan which lists all test runs, how long they took and how successful they were. From this page, you can drill down to each test run and see exactly what steps succeeded and failed and how long they took to run.
- With the history page, you can grade your organization over time. With the detailed reports, you can troubleshoot specific runs.

I suggest that if you’re looking into Business Continuity / Disaster Recovery and you’ve got questiosn on what/where/when/how with SRM you visit the VMTN forums… these guys really know what they are talking about and can really help you understanding what BC/DR is about.

Blades and HA / Cluster design

Duncan Epping · Feb 9, 2009 ·

After reading Aaron‘s excellent articles(1, 2) on Scott Lowe’s Blog I remembered a discussion I had with a couple of co-workers. The discussion was about VMware HA Cluster design in Blade Environments.

The thing that started this discussing was an HA “problem” that occurred at a customer site. This specific customer had 2 Blade chassis to avoid a single point of failure in his virtual environment. All blade servers were joined in one big cluster to get the most out of the environment in terms of Distributed Resource Scheduling.

Unfortunately for this customer at one point in time one of his blade chassis failed. In other words, power off on the chassis, all blades gone at the same time. The firs thing that comes to mind is: HA will kick in and the VM’s will be up and running within no-time. [Read more…] about Blades and HA / Cluster design