Linked Out
Recently we have been fighting with a nasty problem. It started around new years, and had gradually been getting worse.
We use Mellanox 6036 switches for our cloud. They're Infiniband switches, but we run most of the ports in ethernet mode, since you can do that if you have the license. That part is actually pretty cool.
We started noticing short outages in our system. Just random, hard to find problems. A volume going to read only mode here, a rabbit problem there. Until a bunch of nodes dropped out for a while. Unsurprisingly, they were all connected to one switch.
So to the switch logs we go! Hmm, the links on the uplinks from the affected leaf switch to the core switch had just suddenly gone down. And come back up again. And gone down. And this had been going on for a while.
Checking the other switch logs we noticed that the links were flapping a lot on other ports too. After we added some monitoring for this, we noticed it's a really widespread problem. Links were flapping left, right and center. Across all switches and a lot of ports. Mostly short few second flaps, but sometimes longer.
Mar 6 23:41:38 switchname portd[6403]: TID 1208134288: [portd.NOTICE]: portd_handle_trap: fd=15, type=2, data=0x10ccce38
Mar 6 23:41:38 switchname portd[6403]: TID 1208134288: [portd.NOTICE]: sx_api_host_ifc_recv got new trap ,trap id : [8] s
ource log port : [66304] , ifindex: [67] , port is : [DOWN]
Mar 6 23:41: switchname portd[6403]: TID 1208134288: [portd.NOTICE]: portd_handle_trap: fd=15, type=2, data=0x10ccce38
Mar 6 23:41:38 switchname portd[6403]: TID 1208134288: [portd.NOTICE]: sx_api_host_ifc_recv got new trap ,trap id : [8] s
ource log port : [66816] , ifindex: [69] , port is : [DOWN]
Mar 6 23:41:38 switchname issd[6515]: TID 1426372672: [issd.NOTICE]: NPAPI_NOTICE: notice VlanIvrUpdateVlanIfOperStatus po
rt 45, oper 2
Mar 6 23:41:38 switchname portd[6403]: TID 1208134288: [portd.NOTICE]: portd_handle_trap: fd=15, type=2, data=0x10ccce38
Mar 6 23:41:38 switchname portd[6403]: TID 1208134288: [portd.NOTICE]: sx_api_host_ifc_recv failed with error 101 - Driver
's Return Status is Non-Zero
Mar 6 23:41:38 switchname portd[6403]: Interface Eth 1/11 changed state to DOWN
And after a while...
Mar 6 23:41:45 switchname portd[6403]: Interface Eth 1/11 changed state to UP
This is what it looked like in my head.
This was really strange. We tried to go through all logs and debug info, Mellanox got involved to try to help us sort this out. The flapping ports didn't have much in common, a bunch of different cable types, it was uplink ports and host ports and different nics and everything. The temperature was fine, the firmware was new, we were simply stumped. We changed one of the cables, that didn't help the situation.
So, after a lot of debugging and "uhm"-ing and "ahm"-ing we disabled the most flapping leaf to core switch uplink port (we had redundant uplink connections). BAM. No more flapping. All other ports immediately stopped flapping. We replaced the cable and re-enabled the port. No flaps. One single bad cable in the fabric had caused ~30-50 ports to randomly flap (I assume when traffic had gone through the bad cable, and the moon, the sun and jupiter were aligned). This should not happen, but apparently it can.
So, a freak problem, but for any other people out there, gf you have similar issues, replace the worst cable(s) and see where that gets you.
Geek. Product Owner @CSCfi