We received a complaint from a customer that claimed that one of the adapters of one of their UCS B440 M2 blade servers was flapping.
They have 4 UCS 5108 chassis, which are populated with many B200 and B440 servers. There was no service disruption as their B440s have each 2 mezzanine adapters.
The VMs inside the ESXi host installed in the blade did NOTsuffer any impact. The vNICs created at the second CNA card took over the flow of traffic whenever the issue occurred. The vNIC in Fabric A did not move to Fabric B because the vNICs in the Service Profile associated to this blade server were not configured with”Hardware Failover”, as recommended by VMware.
Error message on UCS manager:
They experienced several errors like the following in one of the CNA cards of one of their B440 M2 blades. The messages under “Admin – Faults” in the UCSM were like these:
- “Adapter host interface 3/3/1/1 link state; down”
- “Virtual interface 1794 link state is down”
- “ether VIF 1794 on server 3/3 of switch B down, reason: Error disabled”
Number of occurrences: 8
Similar messages appeared for 3/3/1/2, 3/3/1/3 and 3/3/1/4. Which means that all virtual interfaces related to the first mezzanine adapter of this blade were experiencing the same issue: they all seemed to go down at the same time.
The number of occurrences suggested that the adapter had been flapping more than we had noticed.
A week before the time they reported this to us, they had the same issue, but it recovered itself after 15 minutes. They had also noticed this behavior twice in that last month.
However, it had not always been like that. There were also 3 times that the adapter did not recover automatically and its link state remained down. They had to re-acknowledge the hardware (the adapter itself) for UCSM and vCenter to detect that the adapters are up and working correctly. No changes had been done in that last month.
The issue seemed to be only with Ethernet connectivity, not with FC (Fibre Channel). There were no faults related to the SAN.
Diagnostics from Cisco TAC
We opened a ticket with Cisco TAC and provided the “ucsm” and “chassis” show techs.
They went through the logs and found out that there were multiple link flaps on the first adapter, but none at the second one; it looked clean.
They were also able to see that there were no link flaps from the IOM (a FEX 2208), which suggested that there was a possible fault with the first adapter in this B440 blade and the mezzanine card needs to be replaced.
We created an RMA for the mezzanine adapter (a VIC 1240 card).
We also had Cisco send a field engineer to the customer’s remote data center to replace the card within a scheduled maintenance window.
The problem was solved. No more adapter flapping since then.