First of all need me start by thanking everyone who attended our session at VMworld Copenhagen. First session filled up quick and 5 minutes before we were supposed to start they had to close the doors as the place was packed. I can tell you that is the best compliment you can get! I know a bunch of people took pictures of the session, if you did we would appreciate it if you could sent me a copy! (Eric Sloof shot a video, thank Eric!)
There is something that was discussed during the presentation and actually mentioned on the very last slide which I wanted to share with all of you and that is around some of the HA futures. Now I am not going to fully elaborate on these as I don’t want to get into any NDA related issues, but I will try to add a bit more detail as soon as I have the whole video of the session. (I need to know the boundaries.)
- All New Architecture, a single lightweight HA agent process
- Eliminate concept of “Primaries”
- Storage heartbeating as backup communication channel
- Automatic resolution of network partitions
- VMs still protected during partitions, no “fighting” for VM control
- Greater scalability, extensible
- Ability to deal with any number of simultaneous host failures
- New lightweight communication model
- All state required to recover from any failure is persisted
- Improved isolation actions (VMs left running and restarted as needed via storage subsystem monitoring)
- No dependencies on DNS
All the people rounding up after the session with questions (Thanks Jannie Hanekom!) …
And of course a big thanks to Eric Sloof for this picture:
daniel says
Sweet! That pretty much sums up all the issues I currently see with HA, hopefully the process of enabling HA will also fail less frequently with this new architecture. Any chance you can tell us the timeframe for a release on this? A month? A year?
Fred Peterson says
“Storage heartbeating as backup communication channel”
L.D.O. this should have been that way from day one. All they had to do was take a single que from IBM’s HACMP.
Marek says
Duncan, you said that in case you want to make second service console on “storage” vswitch, then you have to increase this isolation timeout value to 20000 and still provide das.IsolationAddress IP address to something like NFS/ISCSI target IP, is that correct? As I provided second service console IP without this das.isolation ip and it works normally, so I don’t know where is the drawback?
Duncan Epping says
If you do not create a secondary isolation address that means that your storage network will need to be routable in order to be able to ping the default gateway across that interface! Now what you could do of course is specify a second isolation address which basically is your nfs/iscsi target as when you ping that and it replies your VMs are more than likely also okay!
Rubens Sanches says
Eliminate concept of “Primaries” will be a great change. Specially for design of environments using blade servers. Today there is a limitation of four primary servers and you have to build clusters spreaded accross the others Blade Chassis to avoid that those four primary servers get together in the same Blade Chassi. Tks
Rob says
Hi Duncan,
Can you elaborate on “Automatic Resolution of Network Partitions”. Not sure what you are getting at there?
Rubens Sanches says
Oops! Two topics above I acctualy want to say FIVE Primaries servers, but only four servers can fail at the same time. Sorry by that mistake.
Duncan says
@Marek : There is only a drawback when your Secondary Service Console can’t ping the isolation address as BOTH paths will be used to verify for network isolation. Hence the reason I would recommend to use a secondary isolation address which for instance could be your iSCSI/NFS device to keep it simple.
Duncan says
@Rob: As soon as I have spoken with the right people I will elaborate a bit more about what is coming. I will more than likely meet up with the HA engineers in a couple of weeks.
Marek says
Duncan, Thanks for your info. Will do as you suggested. WHy that concerned me is that I made a test (I don’t know if my storage network is routable or not) – I assigned second SC on storage vswitch without specifying isolation address. Then to verify I reconfigured main SC providing wrong gateway (thus disallowing proper work), and I was constantly pinging one of VMs that was running on that host. Then what happened is that host became “Not Responding”, but VM was pingable all the time (until I fixed this SC). Thus I thought: why do I need to specify this isolation address. But it can also be, that this storage network is reachable (how to verify that?).
stony007_de says
Hi Duncan, thank you for that fantastic HA-Session in Copenhagen.
After the session, we could change some words about the Max. Hosts in a ESX-“Metro-CLuster”!
For the monitoring of actual HA Nodes in the cluster i can use the “ln” command in the aam cli! You said, there is an powerCli script on Yellow-Bricks where you dumped this output! if possible, please post the link, because i cant find it.
thank you!
Duncan says
http://www.virtu-al.net/2009/10/28/powercli-listing-cluster-primary-ha-nodes/
Brandon says
You look too young to have a hairline like that. At least you do the right thing and go vin-diesel :). I also like that microphone, its FBI meets Secret Service cool.
In other news, these changes to HA look great to me. Primary and Secondary nodes are a problem in clusters which mix blades and physical servers. It’s often a design flaw that has to be accepted by the customer that the blade chassis itself could be a single point of failure, because they are too cheap to buy it any other way. HP’s C Class can hold 16. I see it all the time.
Duncan Epping says
vin diesel? don’t tell me I look like shrek.
Brandon says
LMAO. I didn’t mean you look like vin-diesel. You can shave your head like him without looking like him in the face. Shrek, ahaha thats pretty good :).
How To Get Popular On Youtube says
Great blog overall! Your article was very helpful. I have numerous sites bookmarked relevant to this, but your blog will be top of the list! Thanks a lot