TCP Cheatsheet
TCP
- TCP/IP proposed by Vint Cerf and Bob Kahn in 1974
- IPv4 specifications: IP RFC 791, TCP RFC 793
- TCP is optimised for accurate rather than timely delivery
- TCP uses a three-way handshake
- Sequence numbers are picked randomly for security reasons
- SYN packet: client picks random sequence number x, sends SYN packet which may include TCP flags and options
- SYN ACK: server increments x by 1, picks random sequence number y, appends flags and options
- ACK: client increments y by 1
- Data packet sent after the ACK
- Handshake imposes 1 RTT delay and makes TCP connection establishment expensive
- TCP Fast Open (TFO) is available in Linux 3.7+
- TFO include data payload with SYN
- TFO requires cryptographic cookie and works for repeat connections only
- TCP achieves reliability through retransmission
- TCP retransmission works by the sender detecting segments that have been lost in transmission, typically identified through timeouts or receiving duplicate ACKs, and then resending those segments
- John Nagle documented congestion collapse in 1984
- Congestion collapse affects networks with asymmetric bandwidth
- Nagle’s algorithm, reduces the number of packets that need to be sent over the network
- Nagle’s algorithm works by combining a number of small outgoing messages, and sending them all at once
- As long as there is a sent packet for which the sender has received no acknowledgment, the sender should keep buffering its output until it has a full packet’s worth of output, so that output can be sent all at once
- Mechanisms were added to avoid congestion collapse: flow control, congestion control, and congestion avoidance
- Flow control
- Flow control prevents the sender overwhelming the receiver with data
- Each side of a TCP connection advertises its receive window (rwnd) which is available buffer space
- rwnd is initiated with default system settings
- Each side can advertise a smaller window, or 0 for stop
- Window scaling is specified in RFC 1323
- Original TCP spec allocated 16 bytes for rwnd of 65k max
- RFC 1323 allows window scaling option in the first SYN
- Specifies how many bits to shift-left the window size in future ACKs
- Window scaling is enabled by default in all major platforms
- Congestion control and congestion avoidance
- Prevents senders and receivers from overwhelming the network
- Mechanisms to estimate bandwidth and adapt speeds to changing network conditions
- In 1988 Van Jacobson and Micheal J. Karels documented algorithms to address these problems
- slow-start, congestion avoidance, fast retransmit, and fast recovery
- Many variants: TCP Tahoe, Reno, Vegas, New Reno, BIC, CUBIC, or Compound TCP
- Slow-start
- Measure available capacity by exchanging data
- Server initialises a new congestion window (cwnd) per TCP connection
- Set to initial conservative default. initcwnd on Linux
- cwnd is sender-side limit on amount of unacknowledged data
- cwnd is not exchanged
- RFC 6928 specifies initial cwnd as 10 segments in April 2013
- Maximum amount of data un-ACKed, in-flight for a new connection is smallest of rwnd and cwnd
- For every ACK received the server increases cwnd by 1 segment
- In other words, for every ACK received two packets can be sent. Resulting in exponential growth
- A TCP connection cannot use the full capacity of link straight away
- Slow-start restart (SSR)
- TCP implements SSR mechanism, reseting the cwnd of idle connections as conditions may have changed
- Disable SSR on the server (sysctl -w net.ipv4.tcp_slow_start_after_idle=0)
- Congestion avoidance
- TCP uses packet loss as feedback mechanism to regulate performance
- Slow-starts doubles data in flight until:
- It exceeds receiver’s rwnd
- It exceeds a system-configured threshold (ssthresh) window
- A packet is lost, at which point congestion avoidance algorithm takes over
- Packet loss indicates congested link or router
- Originally TCP used Additive Increase and Multiplicative Decrease (AIMD)
- AIMD: half the congestion window, increase by fixed amount per round-trip
- RFC 6937 specifies new algorithm, Proportional Rate Reduction (PRR).
- PRR is default in Linux 3.2+
- Bandwidth-delay product (BDP): product of data link’s capacity and end-to-end delay
- BDP is maximum amount of data that can be in-flight
- If max(rwnd,cwnd) = 16KB = 131,072 bits and RTT = 100ms == 0.1s. Max throughput = 16k/0.1 = 1.31Mbps
- Fast retransmit and fast recovery
- Fast retransmit reduces the time a sender waits before retransmitting a lost segment
- Duplicate acknowledgement is the basis for the fast retransmit mechanism
- If receiver receives a data segment that is out of order it immediately sends a duplicate ACK
- If the sender receives three duplicate ACKs it will retransmit the missing segment
- Fast recovery stops TCP using slow-start after fast retransmit
- TCP receiver sees packet loss and retransmission as delivery delay when reading from socket
- This is TCP head-of-line (HOL) blocking
- Applications don’t have to reorder and reassemble and can be simple
- The cost is unpredictable latency, commonly known as jitter
- ss is a tool to inspect statistics for open sockets
ss --options --extended --memory --processes --info
- Performance checklist
- Upgrade server kernel to latest version
- Ensure that cwnd is set to 10
- Disable slow-start after idle
- Ensure that window scaling is enabled
- Eliminate redundant transfers
- Compress transferred data
- Position servers closer to user to reduce roundtrip times
- Reuse established TCP connections whenever possible
Packet layout
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data | |U|A|P|R|S|F| | | Offset| Reserved |R|C|S|S|Y|I| Window | | | |G|K|H|T|N|N| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | Urgent Pointer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
This post is licensed under CC BY 4.0 by the author.