好的决定来自于经验,经验来自于坏的决定。
实验目的
再次基于 packetdrill TCP 三次握手脚本,测试 Win 字段的由来。此次构造模拟的是客户端场景,而之前《TCP 三次握手之 Win 字段》中构造模拟的是服务器端。
基础脚本
# cat tcp_3hs_007.pkt
// TCP 基础之三次握手
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <...>
+0.01 < S. 0:0(0) ack 1 win 10000 <mss 1000>
+0 > . 1:1(0) ack 1
实验测试一
因为 > 表示预期协议栈会发送的数据包,所以内核协议栈自动构建发送 SYN 数据包。
+0 > S 0:0(0) <...>
// +0 本行代码执行时间相对于上一行代码的偏移时间。
// > ,表示预期协议栈会发送的数据包。
// 0:0(0) ,表示开始序号:结束序号(数据包长度)。
// <> 表示 TCP options,... 表示默认值。
模拟的是客户端场景,SYN 数据包自动构建的情况下,各字段因此无需自定义。
1.执行脚本
# packetdrill tcp_3hs_007.pkt
#
执行完成后退出。
2.捕获数据包
# tcpdump -i any -nn port 8080
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
21:52:17.517087 tun0 Out IP 192.168.204.198.50960 > 192.0.2.1.8080: Flags [S], seq 2970850519, win 64240, options [mss 1460,sackOK,TS val 4210775127 ecr 0,nop,wscale 8], length 0
21:52:17.617197 ? In IP 192.0.2.1.8080 > 192.168.204.198.50960: Flags [S.], seq 0, ack 2970850520, win 10000, options [mss 1000], length 0
21:52:17.617210 ? Out IP 192.168.204.198.50960 > 192.0.2.1.8080: Flags [.], ack 1, win 64240, length 0
21:52:17.617278 ? Out IP 192.168.204.198.50960 > 192.0.2.1.8080: Flags [F.], seq 1, ack 1, win 64240, length 0
21:52:17.617286 ? In IP 192.0.2.1.8080 > 192.168.204.198.50960: Flags [R.], seq 1, ack 1, win 10000, length 0
^C
5 packets captured
7 packets received by filter
0 packets dropped by kernel
#
通过捕获数据包,可以看到 SYN 中的 Win 值为 64240,内核是如何定义该值的,以下回顾一下《TCP 三次握手之 Win 字段》中提到的 SYN 中 Win 定值过程。
以下简述客户端 SYN Win 构建过程中的几个相关函数,包括函数 tcp_connect_init 负责初始化 TCP 连接,其中涉及调用 tcp_select_initial_window 函数进行初始化窗口。
static void tcp_connect_init(struct sock *sk)
{
...
tcp_select_initial_window(sk, tcp_full_space(sk),
tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
&tp->rcv_wnd,
&tp->window_clamp,
sock_net(sk)->ipv4.sysctl_tcp_window_scaling,
&rcv_wscale,
rcv_wnd);
...
接下来进入 tcp_select_initial_window 函数,可见,__space 来自于 tcp_full_space(sk),一般取值为 tcp_rmem 默认值的 1/2,之后再设置 space 为 MSS 值的整数倍,最后与 U16_MAX 值比较取小 ,即一般情况下会是 64240 的窗口大小。
/* Determine a window scaling and initial window to offer.
* Based on the assumption that the given amount of space
* will be offered. Store the results in the tp structure.
* NOTE: for smooth operation initial space offering should
* be a multiple of mss if possible. We assume here that mss >= 1.
* This MUST be enforced by all callers.
*/
void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
__u32 *rcv_wnd, __u32 *window_clamp,
int wscale_ok, __u8 *rcv_wscale,
__u32 init_rcv_wnd)
{
/* 确认空间大小,使其不会是负数。*/
unsigned int space = (__space < 0 ? 0 : __space);
/* If no clamp set the clamp to the max possible scaled window */
/* 如果 clamp 没有设置,则将 clamp 设置为 65535 * (2^14) = 1073741824,确保TCP窗口大小可以扩大到的理论最大值。*/
/* 之后 space 值使用min()函数取space和*window_clamp的最小值。*/
if (*window_clamp == 0)
(*window_clamp) = (U16_MAX << TCP_MAX_WSCALE);
space = min(*window_clamp, space);
/* Quantize space offering to a multiple of mss if possible. */
/* 确保 space 是 mss 的整数倍 */
if (space > mss)
space = rounddown(space, mss);
/* NOTE: offering an initial window larger than 32767
* will break some buggy TCP stacks. If the admin tells us
* it is likely we could be speaking with such a buggy stack
* we will truncate our initial window offering to 32K-1
* unless the remote has sent us a window scaling option,
* which we interpret as a sign the remote TCP is not
* misinterpreting the window field as a signed quantity.
*/
/* 根据 ipv4.sysctl_tcp_workaround_signed_windows是否设置,相应设置接收窗口大小rcv_wnd。*/
if (sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)
(*rcv_wnd) = min(space, MAX_TCP_WINDOW);
else
(*rcv_wnd) = min_t(u32, space, U16_MAX);
/* 如果指定了init_rcv_wnd的值,则设置接收窗口大小rcv_wnd的min值。
if (init_rcv_wnd)
*rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
/* 计算接收窗口 rcv_wscale。*/
*rcv_wscale = 0;
if (wscale_ok) {
/* Set window scaling on max possible window */
space = max_t(u32, space, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
space = max_t(u32, space, sysctl_rmem_max);
space = min_t(u32, space, *window_clamp);
*rcv_wscale = clamp_t(int, ilog2(space) - 15,
0, TCP_MAX_WSCALE);
}
/* Set the clamp no higher than max representable value */
/* 根据计算出的接收窗口扩大系数rcv_wscale来限制window_clamp的最大值。*/
(*window_clamp) = min_t(__u32, U16_MAX << (*rcv_wscale), *window_clamp);
}
EXPORT_SYMBOL(tcp_select_initial_window);
实验测试二
而 < S. 也就是 SYN/ACK 属于构造的数据包,各字段需自行定义,包括 ack、win、mss 等,之前也做过相关说明,不再复述,其中 win 必须定义,mss 可省略。
以下再次回顾一下《TCP 三次握手之 Win 字段》中提到的 SYN/ACK 中 Win 定值过程。简述服务器端 SYN/ACK Win 构建过程中的几个函数,涉及 tcp_v4_conn_request -> tcp_conn_request -> tcp_openreq_init_rwin 。tcp_openreq_init_rwin 函数如下,其主要功能在于选择函数 tcp_select_initial_window 所需的参数,再调用其初始化接收窗口相关信息 。
void tcp_openreq_init_rwin(struct request_sock *req,
const struct sock *sk_listener,
const struct dst_entry *dst)
{
struct inet_request_sock *ireq = inet_rsk(req);
const struct tcp_sock *tp = tcp_sk(sk_listener);
/* 调用 tcp_full_space 函数获取监听套接字的接收缓冲区总大小,赋值给full_space */
int full_space = tcp_full_space(sk_listener);
u32 window_clamp;
__u8 rcv_wscale;
u32 rcv_wnd;
int mss;
/* 计算mss,基于目标路径的通告mss和监听套接字的限制。*/
mss = tcp_mss_clamp(tp, dst_metric_advmss(dst));
/* 读取监听套接字的window_clamp值。*/
window_clamp = READ_ONCE(tp->window_clamp);
/* Set this up on the first call only */
/* 如果有window_clamp值就用它,否则用目标路径的Window大小,作为请求套接字的窗口限制值。*/
req->rsk_window_clamp = window_clamp ? : dst_metric(dst, RTAX_WINDOW);
/* limit the window selection if the user enforce a smaller rx buffer */
/* 如果用户锁定设置了较小的接收缓冲区大小,那么需要限制窗口选择在该缓冲区大小之内。*/
if (sk_listener->sk_userlocks & SOCK_RCVBUF_LOCK &&
(req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0))
req->rsk_window_clamp = full_space;
/* bpf 设置窗口相关。*/
rcv_wnd = tcp_rwnd_init_bpf((struct sock *)req);
if (rcv_wnd == 0)
rcv_wnd = dst_metric(dst, RTAX_INITRWND);
else if (full_space < rcv_wnd * mss)
full_space = rcv_wnd * mss;
/* tcp_full_space because it is guaranteed to be the first packet */
tcp_select_initial_window(sk_listener, full_space,
mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
&req->rsk_rcv_wnd,
&req->rsk_window_clamp,
ireq->wscale_ok,
&rcv_wscale,
rcv_wnd);
ireq->rcv_wscale = rcv_wscale;
}
EXPORT_SYMBOL(tcp_openreq_init_rwin);
实验测试三
对于 SYN 中 Win 值测试,首先尝试修改 tcp_rmem 的大小为 65536,也就是设置 full_space 的值为 tcp_rmem 的 1/2 ,即 32768 。
tcp_rmem 默认值 131072
# sysctl -a | grep tcp_rmem
net.ipv4.tcp_rmem = 4096 131072 6291456
#
tcp_rmem 修改值 65536
# sysctl -q net.ipv4.tcp_rmem="4096 65536 6291456"
# sysctl -a | grep tcp_rmem
net.ipv4.tcp_rmem = 4096 65536 6291456
#
packetdrill 继续尝试执行脚本,tcpdump 捕获结果可以看到 SYN 中 win 32120。设置 space 为 MSS 值的整数倍,即 32120,最后与 U16_MAX 值比较取小 ,仍为 32120。
# packetdrill tcp_3hs_007.pkt
#
# tcpdump -i any -nn port 8080
20:27:51.528905 tun0 Out IP 192.168.242.224.48478 > 192.0.2.1.8080: Flags [S], seq 1698096268, win 32120, options [mss 1460,sackOK,TS val 776046799 ecr 0,nop,wscale 7], length 0
20:27:51.629058 ? In IP 192.0.2.1.8080 > 192.168.242.224.48478: Flags [S.], seq 0, ack 1698096269, win 10000, options [mss 1000], length 0
20:27:51.629086 ? Out IP 192.168.242.224.48478 > 192.0.2.1.8080: Flags [.], ack 1, win 32120, length 0
20:27:51.629194 ? Out IP 192.168.242.224.48478 > 192.0.2.1.8080: Flags [F.], seq 1, ack 1, win 32120, length 0
20:27:51.629208 ? In IP 192.0.2.1.8080 > 192.168.242.224.48478: Flags [R.], seq 1, ack 1, win 10000, length 0
实验测试四
继续 SYN 中 Win 值测试,首先恢复 tcp_rmem 的大小为 131072,通过修改 init_rcv_wnd 值来影响 rcv_wnd 的取值,取 init_rcv_wnd * mss 小值。
通过 packetdrill pkt 文件中修改 initrwnd 值为 8,执行脚本后,tcpdump 捕获结果可以看到 SYN 中 win 11680,因为 init_rcv_wnd * mss 为 8 * 1460 = 11680 ,rcv_wnd 即为 11680 。
# cat tcp_3hs_win_005.pkt
`ip route change 192.0.2.0/24 dev tun0 initrwnd 8`
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <...>
+.1 < S. 0:0(0) ack 1 win 10000 <mss 1000>
+0 > . 1:1(0) ack 1
#
# packetdrill tcp_3hs_win_005.pkt
#
# tcpdump -i any -nn port 8080
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
20:58:00.716238 tun0 Out IP 192.168.136.195.38074 > 192.0.2.1.8080: Flags [S], seq 4007123868, win 11680, options [mss 1460,sackOK,TS val 4260117824 ecr 0,nop,wscale 7], length 0
20:58:00.816389 ? In IP 192.0.2.1.8080 > 192.168.136.195.38074: Flags [S.], seq 0, ack 4007123869, win 10000, options [mss 1000], length 0
20:58:00.816417 ? Out IP 192.168.136.195.38074 > 192.0.2.1.8080: Flags [.], ack 1, win 11680, length 0
20:58:00.816516 ? Out IP 192.168.136.195.38074 > 192.0.2.1.8080: Flags [F.], seq 1, ack 1, win 11680, length 0
20:58:00.816529 ? In IP 192.0.2.1.8080 > 192.168.136.195.38074: Flags [R.], seq 1, ack 1, win 10000, length 0
往期推荐
原文始发于微信公众号(Echo Reply):Wireshark & Packetdrill | TCP 三次握手之 Win 字段续
- 左青龙
- 微信扫一扫
-
- 右白虎
- 微信扫一扫
-
评论