It can be done
通过 packetdrill 测试 TCP Nagle,继续扩展几个相关实验,包括 MSS 、Delayed ACK 以及 TLP,本次构造模拟的仍然是客户端场景。
基础脚本
基础脚本为 TCP 三次握手,构造模拟的是客户端场景,相关脚本说明详见《TCP 基础之三次握手续》。
# cat tcp_nagle_000.pkt
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <...>
+0.01 < S. 0:0(0) ack 1 win 10000 <mss 1000>
+0 > . 1:1(0) ack 1
#
TCP Nagle
TCP Nagle 算法是什么?用一句简单的话描述就是:在任意时刻,最多只能有一个未被 ACK 确认的小包。
实验测试
首先是关闭 Nagle 的情况,修改脚本,连续写入两个 1200 字节大小(超过 MSS 1000 字节)的数据。
# cat tcp_nagle_007.pkt
`ethtool -K tun0 tso off
ethtool -K tun0 gso off`
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <...>
+0.01 < S. 0:0(0) ack 1 win 10000 <mss 1000>
+0 > . 1:1(0) ack 1
+0.1 write(3, ..., 1200) = 1200
+0 write(3, ..., 1200) = 1200
+0 `sleep 1`
#
执行脚本,同时通过 tcpdump 抓取数据包,现象如下。
可以看到第一个数据段 1200 字节分成了两个数据包,第一个 1000 字节(MSS 大小)的数据包发出后,紧接着发送了第二个 200 字节的小数据包,之后第二个数据段 1200 字节同样分成了两个数据包,第三个 1000 字节(MSS 大小)和第四个 200 字节的小数据包都正常发送。
# packetdrill tcp_nagle_007.pkt
#
# tcpdump -i any -nn port 8080
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
13:54:50.989893 tun0 Out IP 192.168.6.238.41330 > 192.0.2.1.8080: Flags [S], seq 1201261092, win 64240, options [mss 1460,sackOK,TS val 216359805 ecr 0,nop,wscale 7], length 0
13:54:50.999999 tun0 In IP 192.0.2.1.8080 > 192.168.6.238.41330: Flags [S.], seq 0, ack 1201261093, win 10000, options [mss 1000], length 0
13:54:51.000023 tun0 Out IP 192.168.6.238.41330 > 192.0.2.1.8080: Flags [.], ack 1, win 64240, length 0
13:54:51.100103 tun0 Out IP 192.168.6.238.41330 > 192.0.2.1.8080: Flags [.], seq 1:1001, ack 1, win 64240, length 1000: HTTP
13:54:51.100108 tun0 Out IP 192.168.6.238.41330 > 192.0.2.1.8080: Flags [P.], seq 1001:1201, ack 1, win 64240, length 200: HTTP
13:54:51.100120 tun0 Out IP 192.168.6.238.41330 > 192.0.2.1.8080: Flags [.], seq 1201:2201, ack 1, win 64240, length 1000: HTTP
13:54:51.100121 tun0 Out IP 192.168.6.238.41330 > 192.0.2.1.8080: Flags [P.], seq 2201:2401, ack 1, win 64240, length 200: HTTP
13:54:51.313604 tun0 Out IP 192.168.6.238.41330 > 192.0.2.1.8080: Flags [.], seq 1:1001, ack 1, win 64240, length 1000: HTTP
13:54:51.765588 tun0 Out IP 192.168.6.238.41330 > 192.0.2.1.8080: Flags [.], seq 1:1001, ack 1, win 64240, length 1000: HTTP
13:54:52.101745 ? Out IP 192.168.6.238.41330 > 192.0.2.1.8080: Flags [F.], seq 2401, ack 1, win 64240, length 0
13:54:52.101774 ? In IP 192.0.2.1.8080 > 192.168.6.238.41330: Flags [R.], seq 1, ack 1, win 10000, length 0
#
之后开启 Nagle 的情况,修改脚本,仍然连续写入两个 1200 字节大小(超过 MSS 1000 字节)的数据。
# cat tcp_nagle_008.pkt
`ethtool -K tun0 tso off
ethtool -K tun0 gso off`
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <...>
+0.01 < S. 0:0(0) ack 1 win 10000 <mss 1000>
+0 > . 1:1(0) ack 1
+0.1 write(3, ..., 1200) = 1200
+0 write(3, ..., 1200) = 1200
+0 `sleep 1`
#
执行脚本,同时通过 tcpdump 抓取数据包,现象如下。
可以看到第一个数据段 1200 字节分成了两个数据包,第一个 1000 字节(MSS 大小)的数据包发出后,紧接着发送了第二个 200 字节的小数据包。之后第二个数据段 1200 字节,由于达到了 MSS 1000 字节大小,因此允许发送,仍然是分成了两个数据包,第三个 1000 字节(MSS 大小)的数据包能正常发送,但是第四个 200 字节的小数据包不能发送,因为之前的第二个 200 字节的小数据包并没有得到 ACK 确认,符合在任意时刻,最多只能有一个未被 ACK 确认的小包的规则。
# packetdrill tcp_nagle_008.pkt
#
# tcpdump -i any -nn port 8080
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
14:01:18.397880 tun0 Out IP 192.168.123.171.52806 > 192.0.2.1.8080: Flags [S], seq 3023867345, win 64240, options [mss 1460,sackOK,TS val 3176430342 ecr 0,nop,wscale 7], length 0
14:01:18.407992 tun0 In IP 192.0.2.1.8080 > 192.168.123.171.52806: Flags [S.], seq 0, ack 3023867346, win 10000, options [mss 1000], length 0
14:01:18.408027 tun0 Out IP 192.168.123.171.52806 > 192.0.2.1.8080: Flags [.], ack 1, win 64240, length 0
14:01:18.508128 tun0 Out IP 192.168.123.171.52806 > 192.0.2.1.8080: Flags [.], seq 1:1001, ack 1, win 64240, length 1000: HTTP
14:01:18.508133 tun0 Out IP 192.168.123.171.52806 > 192.0.2.1.8080: Flags [P.], seq 1001:1201, ack 1, win 64240, length 200: HTTP
14:01:18.508148 tun0 Out IP 192.168.123.171.52806 > 192.0.2.1.8080: Flags [.], seq 1201:2201, ack 1, win 64240, length 1000: HTTP
14:01:18.721612 tun0 Out IP 192.168.123.171.52806 > 192.0.2.1.8080: Flags [.], seq 1:1001, ack 1, win 64240, length 1000: HTTP
14:01:19.157595 tun0 Out IP 192.168.123.171.52806 > 192.0.2.1.8080: Flags [.], seq 1:1001, ack 1, win 64240, length 1000: HTTP
14:01:19.510161 ? Out IP 192.168.123.171.52806 > 192.0.2.1.8080: Flags [FP.], seq 2201:2401, ack 1, win 64240, length 200: HTTP
14:01:19.510180 ? In IP 192.0.2.1.8080 > 192.168.123.171.52806: Flags [R.], seq 1, ack 1, win 10000, length 0
#
Nagle ,在任意时刻,最多只能有一个未被 ACK 确认的小包,这是对于发送端的规则,而 Delayed ACK,这是对于接收端的规则。
因此,模拟发送端发送一个小数据包,使得接收端触发出 Delayed ACK 即可。
# cat tcp_nagle_009.pkt
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3000],4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 10000 <mss 1460>
+0 > S. 0:0(0) ack 1 <...>
+0.01 < . 1:1(0) ack 1 win 10000
+0 accept(3, ..., ...) = 4
+0.01 < P. 1:501(500) ack 1 win 10000
+0.01 < P. 501:601(100) ack 1 win 10000
+0.01 < P. 601:701(100) ack 1 win 10000
+0 `sleep 1`
#
执行脚本,同时通过 tcpdump 抓取数据包,现象如下。
可以看到在最后一个小数据包 100 字节发出后,由于接收端进入了 Delayed ACK 模式,因此在 40ms+ 后才发出了 ACK。
# packetdrill tcp_nagle_009.pkt
#
# tcpdump -i any -nn port 8080
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
15:54:41.716548 tun0 In IP 192.0.2.1.59119 > 192.168.18.60.8080: Flags [S], seq 0, win 10000, options [mss 1460], length 0
15:54:41.716577 tun0 Out IP 192.168.18.60.8080 > 192.0.2.1.59119: Flags [S.], seq 235607528, ack 1, win 2920, options [mss 1460], length 0
15:54:41.726660 tun0 In IP 192.0.2.1.59119 > 192.168.18.60.8080: Flags [.], ack 1, win 10000, length 0
15:54:41.736744 tun0 In IP 192.0.2.1.59119 > 192.168.18.60.8080: Flags [P.], seq 1:501, ack 1, win 10000, length 500: HTTP
15:54:41.736759 tun0 Out IP 192.168.18.60.8080 > 192.0.2.1.59119: Flags [.], ack 501, win 2420, length 0
15:54:41.746756 tun0 In IP 192.0.2.1.59119 > 192.168.18.60.8080: Flags [P.], seq 501:601, ack 1, win 10000, length 100: HTTP
15:54:41.746777 tun0 Out IP 192.168.18.60.8080 > 192.0.2.1.59119: Flags [.], ack 601, win 2320, length 0
15:54:41.756774 tun0 In IP 192.0.2.1.59119 > 192.168.18.60.8080: Flags [P.], seq 601:701, ack 1, win 10000, length 100: HTTP
15:54:41.800242 tun0 Out IP 192.168.18.60.8080 > 192.0.2.1.59119: Flags [.], ack 701, win 2220, length 0
#
所以如果发送端 Nagle 碰上了接收端 Delayed ACK 的场景,就会有一定问题,发送端受 Nagle 限制,延缓了数据发送,而接收端又受 Delayed ACK 限制,延缓了数据确认。
先说结论,TLP Loss Probe 探测包并不受到 Nagle 的约束。
可在发送端模拟,在 Nagle 的限制下,能发第一个小数据包,而第二个小数据包无法发送,之后在 Probetimeout(PTO)超时后发出 TLP Loss Probe。
# cat tcp_nagle_010.pkt
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <...>
+0.01 < S. 0:0(0) ack 1 win 10000 <mss 1000,sackOK,nop,nop>
+0 > . 1:1(0) ack 1
+0.1 write(3, ..., 100) = 100
+0 write(3, ..., 100) = 100
+0 `sleep 3`
#
执行脚本,同时通过 tcpdump 抓取数据包,现象如下。
可以看到在发送端发出第一个 100 字节的小数据包 Seq 1:101 后,因为受限于 Nagle ,并没有紧接着发出第二个 100 字节的小数据包,而在 PTO 超时前也一直没有收到接收端的 ACK 确认,所以 PTO 先超时,进行了 TLP 尾重传,此时的 Loss Probe 数据包即采用了之前受限于 Nagle 限制而无法发送的第二个小数据包 Seq 101:201,之后由于持续得不到 ACK 确认,也就在一直超时重传第一个数据包。
# packetdrill tcp_nagle_010.pkt
#
# tcpdump -i any -nn port 8080
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
16:19:44.604537 tun0 Out IP 192.168.6.14.42294 > 192.0.2.1.8080: Flags [S], seq 3925795795, win 64240, options [mss 1460,sackOK,TS val 1236130732 ecr 0,nop,wscale 7], length 0
16:19:44.614697 tun0 In IP 192.0.2.1.8080 > 192.168.6.14.42294: Flags [S.], seq 0, ack 3925795796, win 10000, options [mss 1000,sackOK,nop,nop], length 0
16:19:44.614727 tun0 Out IP 192.168.6.14.42294 > 192.0.2.1.8080: Flags [.], ack 1, win 64240, length 0
16:19:44.714849 tun0 Out IP 192.168.6.14.42294 > 192.0.2.1.8080: Flags [P.], seq 1:101, ack 1, win 64240, length 100: HTTP
16:19:44.928236 tun0 Out IP 192.168.6.14.42294 > 192.0.2.1.8080: Flags [P.], seq 101:201, ack 1, win 64240, length 100: HTTP
16:19:45.144257 tun0 Out IP 192.168.6.14.42294 > 192.0.2.1.8080: Flags [P.], seq 1:101, ack 1, win 64240, length 100: HTTP
16:19:45.588252 tun0 Out IP 192.168.6.14.42294 > 192.0.2.1.8080: Flags [P.], seq 1:101, ack 1, win 64240, length 100: HTTP
16:19:46.452286 tun0 Out IP 192.168.6.14.42294 > 192.0.2.1.8080: Flags [P.], seq 1:101, ack 1, win 64240, length 100: HTTP
16:19:47.717090 tun0 Out IP 192.168.6.14.42294 > 192.0.2.1.8080: Flags [F.], seq 201, ack 1, win 64240, length 0
16:19:47.717120 tun0 In IP 192.0.2.1.8080 > 192.168.6.14.42294: Flags [R.], seq 1, ack 1, win 10000, length 0
#
扩展一个测试脚本,延后设置 TCP_NODELAY 选项。
# cat tcp_nagle_011.pkt
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <...>
+0.01 < S. 0:0(0) ack 1 win 10000 <mss 1000,sackOK,nop,nop>
+0 > . 1:1(0) ack 1
+0.1 write(3, ..., 100) = 100
+0 write(3, ..., 100) = 100
+.01 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
+0 `sleep 3`
#
执行脚本,同时通过 tcpdump 抓取数据包,现象如下。
可以看到在发送端发出第一个 100 字节的小数据包 Seq 1:101 后,因为受限于 Nagle ,并没有紧接着发出第二个 100 字节的小数据包,但在 10ms 后设置了 TCP_NODELAY,因此立马发送了第二个 100 字节的小数据包 Seq 101:201 。但同样是在 PTO 超时前一直没有收到接收端的 ACK 确认,所以 PTO 超时进行了 TLP 尾重传,此时因为没有待发的新数据包,所以 Loss Probe 数据包采用了之前还没有收到 ACK 确认的数据包里面的最后一个数据包,即 Seq 101:201 ,再之后现象一样,由于持续得不到 ACK 确认,也就在一直超时重传第一个数据包。
# packetdrill tcp_nagle_011.pkt
#
# tcpdump -i any -nn port 8080
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
20:27:28.444538 tun0 Out IP 192.168.37.39.58472 > 192.0.2.1.8080: Flags [S], seq 3257513821, win 64240, options [mss 1460,sackOK,TS val 1484288010 ecr 0,nop,wscale 7], length 0
20:27:28.454668 tun0 In IP 192.0.2.1.8080 > 192.168.37.39.58472: Flags [S.], seq 0, ack 3257513822, win 10000, options [mss 1000,sackOK,nop,nop], length 0
20:27:28.454706 tun0 Out IP 192.168.37.39.58472 > 192.0.2.1.8080: Flags [.], ack 1, win 64240, length 0
20:27:28.554826 tun0 Out IP 192.168.37.39.58472 > 192.0.2.1.8080: Flags [P.], seq 1:101, ack 1, win 64240, length 100: HTTP
20:27:28.564880 tun0 Out IP 192.168.37.39.58472 > 192.0.2.1.8080: Flags [P.], seq 101:201, ack 1, win 64240, length 100: HTTP
20:27:28.592294 tun0 Out IP 192.168.37.39.58472 > 192.0.2.1.8080: Flags [P.], seq 101:201, ack 1, win 64240, length 100: HTTP
20:27:28.812279 tun0 Out IP 192.168.37.39.58472 > 192.0.2.1.8080: Flags [P.], seq 1:101, ack 1, win 64240, length 100: HTTP
20:27:29.268299 tun0 Out IP 192.168.37.39.58472 > 192.0.2.1.8080: Flags [P.], seq 1:101, ack 1, win 64240, length 100: HTTP
20:27:30.132260 tun0 Out IP 192.168.37.39.58472 > 192.0.2.1.8080: Flags [P.], seq 1:101, ack 1, win 64240, length 100: HTTP
#
往期推荐
原文始发于微信公众号(Echo Reply):Wireshark & Packetdrill | TCP Nagle 算法续
- 左青龙
- 微信扫一扫
-
- 右白虎
- 微信扫一扫
-
评论