We had a mysterious issue in our network that caused certain SSH sessions and HTTPS/TLS sessions to fail intermittently. Some machines were unable to communicate at all while other machines could occasionally and sporadically establish a connection that would fail at inopportune times.
I performed a comprehensive analysis of our networking infrastructure and router configurations and captured PCAP files to gather enough data to root cause the problem. The core problem was an MTU mis-match between our gigabit network and our 100-megabit VPN tunnel.
Client side packet capture
This issue took longer to troubleshoot than I would have liked due to the specialized nature of endpoints involved. Appliances that lack a native ability to capture traffic / PCAP files contributes to less direct troubleshooting paths.
For reference, here is an image that shows the type of traffic I was seeing on a system attempting to initiate a secure session:
- [TCP Previous segment not captured] Ignored Unknown Record
- TCP RST
The above traffic is filtered to a window that shows the error state. Previously in the packet trace I can see that the TCP 3-way handshake succeeded (and succeeds EVERY time a connection attempt was made). The place that seemed to cause the most problems most consistently involves the certificate passing piece of TLS negotiation.
View from the appliance side
Eventually I was able to get a packet capture from the specialized network appliance on the other side of the connection. Thank goodness for the SharkTap! Here's what I saw on the 'other side' which helped me crack this case:
- Destination unreachable (Fragmentation needed) [MTU of next hop: 1446]
- [TCP Dup ACK 967#1] 42484 -> 443 [ACK]
- [TCP Retransmission] 443 -> 42484 [ACK]
This traffic shows that a packet of length 1514 bytes is not being allowed to pass through the gateway. Drilling into the ICMP traffic further shows a Type 3, Code 4 message which indicates that the next hop has a maximum MTU of 1446.
One thing that this traffic does not do a great job of showing is that the Don't Fragment flag is set on the IP Packet which is what causes the router to come back and tell the appliance to fragment the message. For some reason the appliance does not respect this request. I have a call with the appliance maker tomorrow to let them know- seems like their appliance should be able to respect this message!
Work-around / Proof of Concept
To verify that the MTU size is the real issue I performed an experiment on the appliance I was troubleshooting. It had enough of a linux shell to allow me to adjust the MTU on the ethernet interface via the CLI. I configured the primary interface from its default of 1500 to 1440 using this command:
ifconfig eth0 mtu 1440 up
Once I ran that command, all of the connectivity problems I'd had to the appliance got cleared up. With this knowledge in hand, I can now work with our network team to get a more lasting solution implemented in our infrastructure.