背景
线上某个中台服务在资源创建时如果执行过程超长,就会触发服务端响应 ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
,导致资源创建失败。
排查过程
1、抓包分析
在出现这个问题之初是因为近期更了新版本才出现了这个问题,回滚旧版本后问题消失,因此本次的排查是在部署新版本后进行抓包分析。
基本信息:
服务网关(enovy):
- 10.20.30.93
- 10.20.30.237
服务端:
- 10.20.30.52
中台服务:
- 10.20.30.73
请求链路:
- 客户端(页面) -> DPVS(外网) -> Nginx -> 中台服务 -> DPVS(内网) -> Nginx -> 服务网关 -> 服务端
即”用户”在中台页面发起资源创建后,会先经过外网 DPVS 透传至 Nginx 并转发至”中台服务”,再由”中台服务”发起 HTTP 请求向内网 DPVS 透传至 Nginx 并转发至”服务网关”,最终由”服务网关”路由到目标”服务端”发起资源创建。
抓包详情如下:
当请求服务端正常接口时表现如下:
从上面几个完整连接过程来看,基本都在 HTTP 响应后,如果超过60s没有继续请求就会触发 FIN,ACK
(结束连接)。
当请求服务端耗时超长的接口时表现如下:
而从上面的请求过程来看,当请求超过 90s 没有响应后,就会触发网关 FIN,ACK
(结束连接),但因为此时服务端没有回应 FIN,ACK
进而导致网关在 45s 后触发 RST
(强制中断)。
已知信息:
- 服务端与服务网关的 TCP Keep-alive 是每 15s 一次,由服务端主动发起
- HTTP正常响应后 60s 的闲置连接就会触发主动中断,由服务端主动发起
- HTTP未能立即响应(响应超时)后 90s 就会触发异常中断,由网关主动发起
如果同时拿”客户端”与”服务端”的抓包结果对比观察:
可以看出客户端和服务端都是在 90s 分别收到来自网关和 DPVS 的 RST 报文,此时有理由怀疑网关或 DPVS 存在超时相关设置问题。
2、检查enovy信息
当前配置
admin: access_log_path: /log/admin_access.log address: socket_address: { address: 0.0.0.0, port_value: 1023 } static_resources: listeners: - name: listener_0 address: socket_address: { address: 0.0.0.0, port_value: 1024 } filter_chains: - filters: - name: envoy.http_connection_manager config: upgrade_configs: - upgrade_type: websocket stat_prefix: ingress_http idle_timeout: 600s codec_type: AUTO access_log: - name: envoy.file_access_log config: path: "/log/info.log" route_config: name: local_route # 已忽略其余配置 virtual_hosts: - name: web_service domains: ["*"] # 已忽略其余配置 routes: - match: prefix: "/upstream_server/" route: cluster: upstream_server_apiserver_service timeout: 6000s http_filters: - name: envoy.router clusters: - name: upstream_server_apiserver_service connect_timeout: 5s type: LOGICAL_DNS lb_policy: ROUND_ROBIN hosts: [ { socket_address: { address: upstream_server-apiserver-service, port_value: 9113 } } ] common_http_protocol_options: idle_timeout: 180s
https://www.bookstack.cn/read/envoyproxy-1.10/144d336a80f65388.md
排查几项超时参数都是远大于 90s,与本次问题无关,初步排除网关问题。
3、本地复现
在初步排除服务网关后,于是尝试在本地通过模拟请求的方式复现并排查问题触发时机,本次分别使用 curl 与 Python requests 进行对比测试。(因为线上使用的是 Python requests)
requests版本:
- 线上:
- 2.22.0
- 本地:
- 2.23.0
- 2.31.0 (测试时最新版: https://pypi.org/project/requests/#history)
省略测试过程。
测试结果:
在 curl 测试中,无论域名请求 DPVS VIP 还是直接请求 Nginx,都无法复现请求超过 90s 后因为 RST 原因导致连接异常断开的现象。测试结果均为 300s 服务端未响应而触发的客户端请求超时(408 Rquest Timeout),但这种连接中断不是因为 Nginx 响应 RST 而触发异常断开,而是正常的 “TCP挥手” 流程。
在 Python requests 测试中,当客户端直接向 Nginx 请求时,不会出现
Connection Reset
的问题,与 curl 请求结果一致;但如果目标地址为 DPVS VIP 时就会在 90s 时复现 RST 问题。requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
问题得到复现,抓包过程与线上一致不再重复贴图。
4、尝试解决
通过查找相似问题的 issue 可以看出,在代码前增加如下片段可以解决这类问题,因此对请求样例头部添加相关代码,然后再次进行请求验证:
#!/usr/bin/python3
# -*- coding: UTF-8 -*-
import requests
import json
import time
import socket
from urllib3.connection import HTTPConnection
HTTPConnection.default_socket_options = (
HTTPConnection.default_socket_options + [
(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
(socket.SOL_TCP, socket.TCP_KEEPIDLE, 45),
(socket.SOL_TCP, socket.TCP_KEEPINTVL, 10),
(socket.SOL_TCP, socket.TCP_KEEPCNT, 6)
]
)
# 省略业务代码
...
print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
response = requests.request("POST", url, headers=headers, data=payload, timeout=360)
print(response.text)
print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
客户端输出:
2023-10-20 10:37:56
stream timeout
2023-10-20 10:42:56
从结果可以看出,即便还是请求 DPVS VIP,但在添加了相关代码后不再出现 RST 问题,也能够像 curl 一样超过 90s 最终触发客户端请求超时(408 Rquest Timeout)。
至此,已经可以确认问题是由 DPVS 产生的,并且有了临时的解决方式,但仍不知根本原因。
模拟环境
为了进一步验证其根本原因,即为何添加了相关代码后能够不会复现问题,并且为什么只有请求 DPVS VIP 时才会出现这个问题,由此进行了模拟环境测试。
0. 基本信息
- 客户端: 192.168.7.91
- Nginx: 192.168.7.27
- 服务端: 192.168.7.92
- DPVS: 192.168.7.82
- VIP: 192.168.7.182
- LIP: 192.168.7.185
注:
- VIP: 作为DPVS的高可用IP,用于接收客户端请求的入口IP
- LIP: 作为DPVS的转发IP,用于转发流量时向后端(RS)通信用的Local IP,而不是用DPVS IP向后端建连
1. 客户端请求VIP-requests下表现
客户端使用 Python requests 方式请求测试域名,分别在观察几个网络阶段的表现。
客户端侧抓包
客户端 -> DPVS
客户端输出
Traceback (most recent call last): File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 803, in urlopen **response_kw, File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 537, in _make_request response = conn.getresponse() File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connection.py", line 461, in getresponse httplib_response = super().getresponse() File "/usr/local/python-3.7.11/lib/python3.7/http/client.py", line 1373, in getresponse response.begin() File "/usr/local/python-3.7.11/lib/python3.7/http/client.py", line 319, in begin version, status, reason = self._read_status() File "/usr/local/python-3.7.11/lib/python3.7/http/client.py", line 280, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/local/python-3.7.11/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) ConnectionResetError: [Errno 104] Connection reset by peer During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/python-3.7.11/lib/python3.7/site-packages/requests/adapters.py", line 497, in send chunked=chunked, File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 846, in urlopen method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2] File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/util/retry.py", line 470, in increment raise reraise(type(error), error, _stacktrace) File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/util/util.py", line 38, in reraise raise value.with_traceback(tb) File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 803, in urlopen **response_kw, File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 537, in _make_request response = conn.getresponse() File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connection.py", line 461, in getresponse httplib_response = super().getresponse() File "/usr/local/python-3.7.11/lib/python3.7/http/client.py", line 1373, in getresponse response.begin() File "/usr/local/python-3.7.11/lib/python3.7/http/client.py", line 319, in begin version, status, reason = self._read_status() File "/usr/local/python-3.7.11/lib/python3.7/http/client.py", line 280, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/local/python-3.7.11/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "clientTest.py", line 45, in <module> response = requests.request("POST", url, headers=headers, data=payload, stream=True, timeout=360) File "/usr/local/python-3.7.11/lib/python3.7/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/usr/local/python-3.7.11/lib/python3.7/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "/usr/local/python-3.7.11/lib/python3.7/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "/usr/local/python-3.7.11/lib/python3.7/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
DPVS侧抓包
客户端 -> DPVS
DPVS -> Nginx
Nginx侧抓包
DPVS -> Nginx
请求日志:
192.168.7.185:1026 80 - [21/Oct/2023:15:20:40 +0800] "POST /test HTTP/1.1" 499 0 434 "-" "python-requests/2.27.1" - 192.168.7.92:51020 - 90.817 90.818 opstest.nestealin.com "- - -" "192.168.7.185"
从上述抓包表现可以看出,在客户端请求后 90s 左右,DPVS分别向客户端、服务端发送了 RST 致使连接中断。
2. 客户端请求VIP-curl下表现
客户端侧抓包
DPVS侧抓包
客户端 -> DPVS
DPVS -> Nginx
Nginx侧抓包
将本次测试请求的抓包过程与前一次使用 requests 方法对比可以发现,客户端在使用 curl 请求的第 60s 时会发送一次 TCP Keep-Alive 请求,而客户端使用 Python requests 方法请求时全程没有发起 TCP Keep-Alive 请求;而这种”请求保活”的方式也是客户端在使用 curl 时连接过程得以延续的关键。
3. 客户端直接请求Nginx
在客户端下继续使用 requests 向 Nginx 直接发起请求。
客户端侧抓包
Nginx侧抓包
当客户端直接向 Nginx 请求时,请求过程仍旧没有发送,但不会触发 Connection Reset
能够完成 HTTP 请求。
此时可以确认本次”异常断开”的行为是与 DPVS 有直接关系。
4. 客户端增加Keep-Alive
在客户端下继续使用 requests 向 DPVS VIP 发起请求,但本次在代码中添加”临时方法”,实现定期发送 Keep-Alive 请求,并抓取请求过程。
客户端侧抓包
DPVS侧抓包
客户端 -> DPVS
DPVS -> Nginx
Nginx侧抓包
可见,在客户端开启 Keep-Alive 后,DPVS 也不会主动终止 TCP 连接,进而确认本次异常与 DPVS 的连接管理有关。(缩小问题范围)
5. 源码分析
通过源码分析可以大致看出在 DPVS 的 TCP 连接管理中是有一个定时器来实现超时处理的,再结合 dpvs/src/ipvs/ip_vs_proto_tcp.c at master · iqiyi/dpvs 初始化方法可以看出”监听状态”的超时参数是由 /etc/dpvs.conf
中的 ipvs_defs.tcp.timeout.established
定义的。
接下来为了进一步验证配置有效性,通过修改相关文件增加日志输出:
# 以下均为编辑tcp_send_rst方法
vi src/ipvs/ip_vs_proto_tcp.c +999
if (conn->state != DPVS_TCP_S_ESTABLISHED) {
RTE_LOG(WARNING, IPVS, "%s: only RST in ESTABLISHED.\n", __func__);
return EDPVS_OK;
}
// 方法末行增加
RTE_LOG(WARNING, IPVS, "%s: IDLE ESTABLISHED RST Sended.\n", __func__);
return EDPVS_OK;
}
重新编译并重启DPVS
cd /usr/local/dpvs-1.9.0
export DPVS_PATH=/usr/local/dpvs-1.9.0
export PKG_CONFIG_PATH=$DPVS_PATH/dpdk/dpdklib/lib64/pkgconfig:/usr/lib64/pkgconfig
export LIBDPDKPC_PATH=$DPVS_PATH/dpdk/dpdklib/lib64/pkgconfig/libdpdk.pc
make -j 20
make install -j 20
待编译通过后重启 DPVS 及 Keepalived 。
6. 效果验证
客户端直接请求VIP
curl 192.168.7.182 -v
# 因为没有携带域名请求不会命中测试配置,会立刻得到Nginx响应,连接终止
DPVS日志
# 待请求完成后约7s出现如下日志 tailf /data/logs/dpvs/dpvs.log IPVS: tcp_send_rst: only RST in ESTABLISHED. IPVS: tcp_send_rst: only RST in ESTABLISHED.
7s后进行连接回收是受限于
ipvs_defs.tcp.timeout.time_wait
参数,默认值为7
此处验证日志输出有效。
接下来开始请求测试域名
python3 clientTest.py
DPVS日志
# 待请求发起后90s出现如下日志 tailf /data/logs/dpvs/dpvs.log IPVS: tcp_send_rst: IDLE ESTABLISHED RST Sended. IPVS: tcp_send_rst: IDLE ESTABLISHED RST Sended.
- 代表没能正常断开连接,而是让 DPVS 双向删除连接状态
Nginx日志
192.168.7.185:1046 80 - [23/Oct/2023:14:08:13 +0800] "POST /test HTTP/1.1" 499 0 434 "-" "python-requests/2.31.0" - 192.168.7.92:51020 - 90.640 90.641 opstest.nestealin.com "- - -" "192.168.7.185"
- 此时客户端触发
Connection Reset
异常复现问题,中断的时长为 90s
- 此时客户端触发
进而修改 DPVS 配置,延长 TCP 超时时间
vim /etc/dpvs.conf
修改内容如下:
仅展示关键配置
! ipvs config
ipvs_defs {
...
tcp {
! defence_tcp_drop
timeout {
...
established 90
established 100
...
}
}
...
}
对 IPVS 中 TCP established 状态的空闲连接超时设置从 90s 调高至 100s
并对 DPVS 执行配置重加载
kill -SIGHUP ${DPVS_PID}
客户端重新发起测试请求
DPVS日志
tailf /data/logs/dpvs/dpvs.log # Reload日志简要输出,能够确认配置加载生效 CFG_FILE: load dpvs configuation file at 2023/10/23 14:40:11 Monday IPVS: tcp_timeout_established = 100 # 新的测试请求过程中,依旧因为"空闲"连接导致DPVS主动断开 IPVS: tcp_send_rst: IDLE ESTABLISHED RST Sended. IPVS: tcp_send_rst: IDLE ESTABLISHED RST Sended.
Nginx日志
192.168.7.185:1048 80 - [23/Oct/2023:14:28:24 +0800] "POST /test HTTP/1.1" 499 0 434 "-" "python-requests/2.31.0" - 192.168.7.92:51020 - 101.077 101.076 opstest.nestealin.com "- - -" "192.168.7.185"
虽然本次客户端仍会触发 Connection Reset
异常,但中断的时长已经变为 101s,基本与 DPVS 配置调整的预期一致,证明配置有效。
小结
- curl 请求比 Python 的 requests 方法在客户端请求第 60s 时多发送了一次 TCP Keep-Alive 请求,让 TCP 连接保活、请求过程得以继续
- DPVS 对 TCP 连接有设置定时器,对各种阶段的连接进行单独处理,最终由
/etc/dpvs.conf
配置中的ipvs_defs.tcp.timeout
相关参数独立控制
关于408问题
在解决客户端 Connection Reset
问题之后,继续出现了下方的 Read timed out
问题:
2023-10-24 13:02:49
Traceback (most recent call last):
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/python-3.7.11/lib/python3.7/http/client.py", line 1373, in getresponse
response.begin()
File "/usr/local/python-3.7.11/lib/python3.7/http/client.py", line 319, in begin
version, status, reason = self._read_status()
File "/usr/local/python-3.7.11/lib/python3.7/http/client.py", line 280, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/python-3.7.11/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 727, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/util/retry.py", line 410, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 428, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/urllib3/connectionpool.py", line 353, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='192.168.7.182', port=80): Read timed out. (read timeout=None)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "clientTest.py", line 46, in <module>
response = requests.request("POST", url, headers=headers, data=payload)
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/local/python-3.7.11/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.7.182', port=80): Read timed out. (read timeout=None)
这是由于客户端在请求时设置了超时时间,去掉 timeout
参数后即可解决问题。
...
# 原代码
response = requests.request("POST", url, headers=headers, data=payload, timeout=360)
# 更改后代码,去除timeout参数即可
response = requests.request("POST", url, headers=headers, data=payload)
...
总结
1. 出现RST的原因
客观原因
- 在 DPVS 中存在管理 TCP 连接的定时器,而对”监听状态” ( Established ) 的 TCP 连接默认设置为 90s 超时,即如果在此期间没有任何 TCP 报文传输,DPVS 就会分别向客户端与 RS 发送 RST 报文中断连接 ( 出现 RST 问题的原因1 )
- 在 Python requests 方法中默认不会实现 TCP Keep-Alive 机制,需要额外增加
HTTPConnection.default_socket_options
实现”连接保活” ( 出现 RST 问题的原因2 ) - 因为 DPVS 存在 TCP 空流超时设置( 原因1 ),并且 Python requests 默认没有发送 TCP 心跳( 原因2 ),导致如有请求连接超过 90s 未主动断开的,就会触达 DPVS 默认超时时间,随后 DPVS 会主动发送 RST 报文中断连接
- Python requests 是依赖 urllib3 实现的 TCP 建连,但并没有向上暴露相关修改参数,使得用户无法在使用 requests 时直接管理 TCP 连接
主观原因
- “中台服务”创建资源是串行执行,只有等到”服务端”资源完成创建后这个请求才算完成
- “服务端”的资源创建流程通常需要 5、6 分钟甚至更多,导致一般的超时阈值无法覆盖整个执行过程
- 综上,再结合”客观原因”而触发了创建中的 RST 问题
2. 为什么只在本次版本发布后出现该问题
- 原来”中台服务”有对 RST 异常进行捕获并重试( “服务端”接口已做幂等校验 ),而在本次发版后因为框架升级忽略了相关处理,未能”捕获并重试”进而引发相关问题
3. 出现客户端请求超时(408 Rquest Timeout)的原因
- 客户端有设置
timeout
主动超时参数,导致到了超时阈值后主动断开连接
Reference
HTTP route management - HTTP route - 《Envoy Proxy 1.10 Documentation》
https://requests.readthedocs.io/en/stable/user/advanced/#keep-alive
Connection reset by peer when sending POST · Issue #4937 · psf/requests · GitHub
Connection reset by peer · Issue #212 · edeng23/binance-trade-bot · GitHub
https://urllib3.readthedocs.io/en/latest/user-guide.html#retrying-requests
连接中止: RemoteDisconnected (“没有响应的远程结束关闭连接”)在发送邮件时大约5分钟不活动之后-腾讯云开发者社区-腾讯云
python 环境配置 requests 模块使用,打开tcp keepalive解决请求长时间返回时tcp连接断开问题
http.client — HTTP protocol client — Python 3.12.0 documentation