It was not until they deployed a VM (Virtual Machine) within one of the new VMware ESXi hosts connected to the new infrastructure using a couple of QLogic CNA cards, that we had an issue.
This VM had erratic connectivity to a Windows server running IBM’s TSM (Tivoli Storage Manager) located in the same Vlan. The TSM server was connected to Cat4506-01 via NIC teaming, aggregating 4 network adapters.
The ESXi hosting the new VM was connected to Fex103 in N5k-1 via its CNA-1 and directly to N5K-2 via its CNA-2. Both links were being aggregated using vPC from the Nexus 5000 point of view and a regular IP hash from the ESXi perspective.
The windows TSM server was able to talk to everything else in that same Vlan. However, the VM seemed to be able to talk to everything EXCEPT for that TSM server, but only from time to time (or so we were told).
We did some troubleshooting ourselves and found out that the VM only failed to coomunicate with the TSM server when the packets flowed through CNA-2. The test was simple:
- When only CNA-1 was UP, communication always worked.
- When only CNA-2 was UP, communication never worked.
- When both CNAs were up, it worked for a bit, but then pings failed from the VM to the TSM server.
We decided then to open a ticket with Cisco TAC because, as usual, this was an urgent issue because it was holding back the deployment of more VMs and also because we thought that Cisco TAC would give us a quick and more precise insight into our problem.
The first thing they told us was that we should either connect both CNAs of the ESXi host to the N5Ks or both to N2Ks, but not a mix. We knew that the vPC supported topologies don’t allow connecting the 2 NICs of a server to the same N2K or to the same N5K if you are doing aggregation, but we were not aware that the way we connected the ESX servers was not supported. There’s actually no white paper or document provided by Cisco publicly that tells you not to connect it this way: NIC1 1 to a Nexus 5000 and NIC 2 to a Nexus 2000.
We asked then if such a configuration was even supported then. They avoided giving a direct answer, kept repeating that it was not Cisco’s best practices and that we should consider modifying the topology.
Even when Cisco TAC never said that it was a design requirement for vPC to work in our scenario, they suggested connecting the 2 Catalyst 4506 in vPC mode to the Nexus 5000 switches. This was unfortunately not an option because the 4506s had no more available 10 GB ports and buying another 10 GB module was way too expensive. We needed another solution.
TAC then suggested moving only this VLAN’s L3 and STP root to the vPC domain. We considered that option and even prepared the configuration changes, but we decided to go over the configs again and test a little more, as changing the logical design of our customer’s network was never the goal of this project and it was something not to take on lightly.
I asked them if there wasn’t any other way we could proceed. Doing what TAC requested implied sending someone into a remote data center and freeze for some hours while we ran tests and captured traffic with god knows how many SPAN sessions. They said there was no other way…
It turned out that, given the STP bridge priorities on this VLAN, the link between N5K-2 and Cat 4506-2 was in blocking STP state… This explained why the traffic ALWAYS failed when the ESXi host decided to use its CNA card connecting to N5K-2.
We cleared the static MAC entry at the N5Ks and problem solved. No need of wasting time, money and resources in sending someone to get endless sniffer traces.
- Always triple check ALL your configurations before thinking you are facing a weird issue. Most of the times you will be facing a simple, yet sometimes hard to find human error.
- Always have someone else, including Cisco TAC do it as well.
- This topology works, even if it’s not within Cisco’s best practices.