We recently identified the cause of a problem affecting one of Papertrail’s service providers. Their hosts occasionally could not establish TCP connections with a seemingly-random small set of Internet hosts.
Troubleshooting was difficult because:
The affected hosts changed. One host would oscillate between being able to establish a connection for a few hours (or days) and not, seemingly with no pattern.
I didn’t have direct access to the service provider’s hosts or the end users’ hosts. This made reproducing the problem in a controlled environment basically impossible. “Paste the output of..” and “Run this packet capture..” was what I had to work with.
- I couldn’t draw conclusions from the end users who reported the problem. Fewer than 5 users reported problems and they were biased towards inquisitive people (who investigated or reported the first occurrence) or those who experienced the problem multiple times. Presumably, some end users couldn’t connect once and just tried again later.
Among this tiny pool, there was one common theme: the remote hosts traversed
NAT. My investigation eventually led to
tcp_tw_recycle, a Linux
with far more Google results
than it deserves. Here’s why.
Linux TCP header verification
Modern Linux kernels verify that TCP header values meet certain requirements. These include “Protect Against Wrapped Sequence numbers” or PAWS, defined in RFC 1323, and RFC 6191 “Reducing the TIME-WAIT State Using TCP Timestamps.”
tcp_tw_recycle enabled, a connection’s TCP header timestamp value is
retained in cases where it otherwise would not have been kept. From
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The problem: the TCP timestamp is only tracked on a per-remote-IP basis, yet some NAT devices don’t rewrite TCP timestamps in the translation process. As a result, the Internet-facing IP of a NAT device may transmit valid packets with unrelated timestamps.
The problem we saw manifests when more than one remote host (for example, two employees on an office network) try to connect to this Linux host within a few minutes of one another. The first connection will succeed, but the second connection attempt (from the same public NAT IP) will fail. The kernel considers its timestamp invalid.
In another function, a comment hints at the difference between tracking timestamps on a per-host basis and doing so on a per-port-pair basis. From tcp_ipv4.c:
1 2 3 4 5 6 7 8 9 10
Note: I haven’t read anywhere near all of Linux’s TCP header source. If you find an error in this post, let me know.
Although that’s the problem, the root cause is poor documentation. The two places that a systems administrator is most likely to consult are the kernel IP sysctl docs, which suggests consulting “technical experts”:
1 2 3 4
.. and the tcp.7 man page, which says:
1 2 3 4
Neither of these explain what the option changes or how it interacts with NAT. I’ve submitted a patch for the man page. The changed copy warns of the possible impact and says where to learn more.
Enable fast recycling of TIME_WAIT sockets. Enabling this option is not recommended for devices communicating with the general Internet or using NAT (Network Address Translation). Since some NAT gateways pass through IP timestamp values, one IP can appear to have non-increasing timestamps. See RFC 1323 (PAWS), RFC 6191.
There’s also a 2013 BCP (“Network Address Translation Behavioral Requirements Updates”) which informs future NAT implementors of this consideration.