Last week I was helping someone on the VMTN community forums. They were hitting what appeared to be strange HA behavior. After some standard questions this person told me that all VMs were powered down after a network outage. Sounds like a familiar problem? Yes I can hear most of you think: Isolation response set to “power off” and no proper network redundancy?
Well yes and no. They had the isolation response indeed configured to “power off” all VMs when the host is isolated. They did however have proper network redundancy, so how on earth did this happen? With 2 physical NICs and 2 physical switches and only 1 being impacted this should not have happened right?!?
Wrong! In this case the fail-over from a “vmkernel” perspective worked fine. The first “path” went down, so the second was used for this management vmkernel. All VMs were up and running until this point, and they remained running until… network connection was restored and the vmnic returned to the original physical NIC. Meaning that the mac address that showed up on port 1 popped up on port 2 and then went back to 1 again. The switch was not impressed and went through the spanning tree process and traffic was blocked instantly as a result of it. Now when traffic is blocked bad things can happen, especially when you configure HA to “power off” VMs. Basically what caused this issue to happen was the fact the spanning tree was not set to the recommended “port fast”, more details here.
I knew instantly that this was the reason for this problem, not because I know stuff about HA but because I had seen this many times in the past while testing environments I configured and designed. Not just testing after implementing a new infrastructure, but also testing after making changes to an infrastructure or introducing a new version / feature. I guess this kind of comes back to the “disaster” scenario as well, test it if you want to know if it works as expected. Just a simple example, I want to introduce QoS for my vMotion network and make changes to my physical network. Now what? How do I test these changes? How many times do I run through my test scenarios? What kind of “problems” do I introduce during my tests?
So I guess by now some might wonder why on earth I brought this up… well the problem above could have been prevented by simply testing the infrastructure when implemented and after changes have been introduced, and maybe even on a regular basis. If HA / Networking was tested properly, those VMs would not have been powered off…
Jason Boche says
Indeed this should be one of the many tests performed in an operational readiness checklist before a cluster is put into production.
Remy Zandwijk says
Duncan, the “details here” link leads to a page which does not talk about “port fast”, but “TPS” and “Nehalem” instead.
Duncan Epping says
should be fixed
Steven White says
Which switch are you referring to when you say “The switch was not impressed”? If the 2 pNICs are connected to 2 pSwitches, neither of those two access switches would have any knowledge of MACs seen by the other switch, they would only know that a MAC flapped between two ports in it’s own MAC-address table, which is not disallowed, unless port-security is used.
The port-security feature can prevent MAC flapping, but would be unlikely to be setup such that it would prevent a failback only, and not the 1st failover.
The VMware KB article seems to be confusing port-fast concepts of bypassing port state transition order with the concepts of Topology Change Notifications. Cisco describes TCNs in older STP here: http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a0080094797.shtml And even the VMware KB links to Cisco’s coverage of TCNs in newer RSTP here: http://www.cisco.com/en/US/tech/tk389/tk621/technologies_white_paper09186a0080094cfa.shtml
The KB article’s statement that forwarding tables are dumped due to a STP event, are incorrect as a description of how Cisco deals with these situations (I’m assuming the use of Cisco switches due to common industry use, and the link from the KB). Moreover, were RSTP in use on these switches, Cisco states that edge-ports transition directly to forwarding (no 50 second countdown timer on the servers pNIC port), and that TCNs are not sent upstream as a result of transitions on edge-ports.
I think it is more likely that a STP event of some other type was taking place. Perhaps when the 1st switch was brought back online, they were not careful with the root bridge priority settings within their STP trees, and the reconvergence process temporarily blocked access to the 2nd switch altogether. If the customer’s root bridge is a 4000 or 6000 series Cisco, the 1st TCN article describes a possible investigation path to determine the actual cause of the STP event.