Envoy/Istio 連接生命週期與臨界異常 —— 不知所謂的連接 REST Detail - istio,linux-kernel,socket,tcp MarkZhu Blog

簡介

本文目標：説明 Envoy 連接控制相關參數作用。以及在臨界異常情況下的細節邏輯。目標是如何減少連接異常而引起的服務訪問失敗，提高服務成功率。

近期為解決一個生產環境中的 Istio Gateway 連接偶爾 Reset 問題，研究了一下 Envoy/Kernel 在 socket 連接關閉上的事。其中包括 Envoy 的連接管理相關參數和 Linux 系統網絡編程的細節。寫本文以備忘。

原文：https://blog.mygraphql.com/zh/posts/cloud/envoy/connection-life/

引

封面簡介：《硅谷(Silicon Valley)》

《硅谷》是一部美國喜劇電視連續劇，由邁克·賈奇、約翰·阿爾舒勒和戴夫·克林斯基創作。它於 2014 年 4 月 6 日在 HBO 首播，並於 2019 年 12 月 8 日結束，共 53 集。該系列模仿了硅谷的技術行業文化，重點講述了創建一家名為 Pied Piper 的初創公司的程序員 Richard Hendricks，並記錄了他在面對來自更大實體的競爭時維持公司的努力。

該系列因其寫作和幽默而受到好評。它獲得了無數榮譽提名，包括連續五次獲得黃金時段艾美獎傑出喜劇系列提名。

我在 2018 年看過這部連續劇。當時英文水平有限，能聽懂的只有 F*** 的單詞。但透過屏幕，還是可以感受到一羣有創業激情的如何各展所能，去應對一個又一個挑戰的過程。在某種程度上，滿足我這種現實世界無法完成的願望。

劇中一個經典場景是玩一種可以拋在空中後，隨機變紅或藍的玩具球。玩家得到藍色，就算贏。叫：

SWITCH PITCH BALL

Based on a patented ‘inside-out’ mechanism, this lightweight ball changes colors when it is flipped in the air. The Switch Pitch entered the market in 2001 and has been in continuous production since then. The toy won the Oppenheim Platinum Award for Design and has been featured numerous times on HBO’s Silicon Valley.

好了，寫篇技術文章需要那麼長的引子嗎？是的，這文章有點長和枯燥，正所謂 TL;DR 。

大家知道，所有重網絡的應用，包括 Envoy 在內，都有玩隨機 SWITCH PITCH BALL 的時候。隨機熵可以來源於一個特別慢的對端，可以來源於特別小的網絡 MTU，或者是特別大的 HTTP Body，一個特別長的 Http Keepalive 連接，甚至一個實現不規範的 Http Client。

Envoy 連接生命週期管理

摘自我的：https://istio-insider.mygraphql.com/zh_CN/latest/ch2-envoy/connection-life/connection-life.html

Upstream/Downstream 連接解藕

HTTP/1.1 規範有這個設計：
HTTP Proxy 是 L7 層的代理，應該和 L3/L4 層的連接生命週期分開。

所以，像從 Downstream 來的 Connection: Close 、 Connection: Keepalive 這種 Header， Envoy 不會 Forward 到 Upstream 。 Downstream 連接的生命週期，當然會遵從 Connection: xyz 的指示控制。但 Upstream 的連接生命週期不會被 Downstream 的連接生命週期影響。即，這是兩個獨立的連接生命週期管理。

Github Issue: HTTP filter before and after evaluation of Connection: Close header sent by upstream#15788 説明了這個問題：
This doesn't make sense in the context of Envoy, where downstream and upstream are decoupled and can use different protocols. I'm still not completely understanding the actual problem you are trying to solve?

連接超時相關配置參數

圖：Envoy 連接 timeout 時序線

用 Draw.io 打開

idle_timeout

(Duration) The idle timeout for connections. The idle timeout is defined as the period in which there are no active requests. When the idle timeout is reached the connection will be closed. If the connection is an HTTP/2 downstream connection a drain sequence will occur prior to closing the connection, see drain_timeout. Note that request based timeouts mean that HTTP/2 PINGs will not keep the connection alive. If not specified, this defaults to 1 hour. To disable idle timeouts explicitly set this to 0.

Warning

Disabling this timeout has a highly likelihood of yielding connection leaks due to lost TCP FIN packets, etc.

If the overload action “envoy.overload\_actions.reduce\_timeouts” is configured, this timeout is scaled for downstream connections according to the value for HTTP\_DOWNSTREAM\_CONNECTION\_IDLE.

max_connection_duration

(Duration) The maximum duration of a connection. The duration is defined as a period since a connection was established. If not set, there is no max duration. When max_connection_duration is reached and if there are no active streams, the connection will be closed. If the connection is a downstream connection and there are any active streams, the drain sequence will kick-in, and the connection will be force-closed after the drain period. See drain\_timeout.

Github Issue: http: Allow upper bounding lifetime of downstream connections #8302

Github PR: add max_connection_duration: http conn man: allow to upper-bound downstream connection lifetime. #8591

Github PR: upstream: support max connection duration for upstream HTTP connections #17932

Github Issue: Forward Connection:Close header to downstream#14910
For HTTP/1, Envoy will send a Connection: close header after max_connection_duration if another request comes in. If not, after some period of time, it will just close the connection.

https://github.com/envoyproxy...

Note that max_requests_per_connection isn't (yet) implemented/supported for downstream connections.

For HTTP/1, Envoy will send a Connection: close header after max_connection_duration （且在 drain_timeout 前） if another request comes in. If not, after some period of time, it will just close the connection.

I don't know what your downstream LB is going to do, but note that according to the spec, the Connection header is hop-by-hop for HTTP proxies.

max_requests_per_connection

(UInt32Value) Optional maximum requests for both upstream and downstream connections. If not specified, there is no limit. Setting this parameter to 1 will effectively disable keep alive. For HTTP/2 and HTTP/3, due to concurrent stream processing, the limit is approximate.

Github Issue: Forward Connection:Close header to downstream#14910

We are having this same issue when using istio (istio/istio#32516). We are migrating to use istio with envoy sidecars frontend be an AWS ELB. We see that connections from ELB -> envoy stay open even when our application is sending Connection: Close. max_connection_duration works but does not seem to be the best option. Our applications are smart enough to know when they are overloaded from a client and send Connection: Close to shard load.

I tried writing an envoy filter to get around this but the filter gets applied before the stripping. Did anyone discover a way to forward the connection close header?

drain_timeout - for downstream only

Envoy Doc%20The)

(Duration) The time that Envoy will wait between sending an HTTP/2 “shutdown notification” (GOAWAY frame with max stream ID) and a final GOAWAY frame. This is used so that Envoy provides a grace period for new streams that race with the final GOAWAY frame. During this grace period, Envoy will continue to accept new streams.

After the grace period, a final GOAWAY frame is sent and Envoy will start refusing new streams. Draining occurs both when:

a connection hits the idle timeout
- 即系連接到達 idle_timeout 或 max_connection_duration後，都會開始 draining 的狀態和drain_timeout計時器。對於 HTTP/1.1，在 draining 狀態下。如果 downstream 過來請求，Envoy 都在響應中加入 Connection: close header。
- 所以只有連接發生 idle_timeout 或 max_connection_duration後，才會進入 draining 的狀態和drain_timeout計時器。
or during general server draining.

The default grace period is 5000 milliseconds (5 seconds) if this option is not specified.

https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining

By default, the HTTP connection manager filter will add “Connection: close” to HTTP1 requests(筆者注：By HTTP Response), send HTTP2 GOAWAY, and terminate connections on request completion (after the delayed close period).

我曾經認為， drain 只在 Envoy 要 shutdown 時才觸發。現在看來，只要是有計劃的關閉連接（連接到達 idle_timeout 或 max_connection_duration後），都應該走 drain 流程。

delayed_close_timeout - for downstream only

(Duration) The delayed close timeout is for downstream connections managed by the HTTP connection manager. It is defined as a grace period after connection close processing has been locally initiated during which Envoy will wait for the peer to close (i.e., a TCP FIN/RST is received by Envoy from the downstream connection) prior to Envoy closing the socket associated with that connection。

即系在一些場景下，Envoy 會在未完全讀取完 HTTP Request 前，就回寫 HTTP Response 且希望關閉連接。這叫 服務端過早關閉連接(Server Prematurely/Early Closes Connection)。這時有幾種可能情況：

downstream 還在發送 HTTP Reqest 當中(socket write)。
或者是 Envoy 的 kernel 中，還有 socket recv buffer 未被 Envoy user-space 進取。通常是 HTTP Conent-Lentgh 大小的 BODY 還在內核的 socket recv buffer 中，未完整加載到 Envoy user-space

這兩種情況下，如果 Envoy 調用 close(fd) 去關閉連接， downstream 均可能會收到來自 Envoy kernel 的 RST 。最終 downstream 可能不會 read socket 中的 HTTP Response 就直接認為連接異常，向上層報告異常：Peer connection rest。

詳見：{doc}connection-life-race 。

為緩解這種情況，Envoy 提供了延後關閉連接的配置。希望等待 downstream 完成 socket write 的過程。讓 kernel socket recv buffer 數據都加載到 user space 中。再去調用 close(fd)。

NOTE: This timeout is enforced even when the socket associated with the downstream connection is pending a flush of the write buffer. However, any progress made writing data to the socket will restart the timer associated with this timeout. This means that the total grace period for a socket in this state will be <total_time_waiting_for_write_buffer_flushes>+<delayed_close_timeout>.

即系，每次 write socket 成功，這個 timer 均會被 rest.

Delaying Envoy’s connection close and giving the peer the opportunity to initiate the close sequence mitigates(緩解) a race condition that exists when downstream clients do not drain/process data in a connection’s receive buffer after a remote close has been detected via a socket write(). 即系，可以緩解 downsteam 在 write socket 失敗後，就不去 read socket 取 Response 的情況。

This race leads to such clients failing to process the response code sent by Envoy, which could result in erroneous downstream processing.

If the timeout triggers, Envoy will close the connection’s socket.

The default timeout is 1000 ms if this option is not specified.

Note:

To be useful in avoiding the race condition described above, this timeout must be set to at least <max round trip time expected between clients and Envoy>+<100ms to account for a reasonable “worst” case processing time for a full iteration of Envoy’s event loop>.

Warning:

A value of 0 will completely disable delayed close processing. When disabled, the downstream connection’s socket will be closed immediately after the write flush is completed or will never close if the write flush does not complete.

需要注意的是，為了不影響性能，delayed_close_timeout 在很多情況下是不會生效的：

Github PR: http: reduce delay-close issues for HTTP/1.1 and below #19863

Skipping delay close for:

HTTP/1.0 framed by connection close (as it simply reduces time to end-framing)

as well as HTTP/1.1 if the request is fully read (so there's no FIN-RST race)。即系如果

Addresses the Envoy-specific parts of #19821
Runtime guard: envoy.reloadable_features.skip_delay_close

同時出現在 Envoy 1.22.0 的 Release Note 裏：

http: avoiding delay-close for:

HTTP/1.0 responses framed by connection: close

as well as HTTP/1.1 if the request is fully read.

This means for responses to such requests, the FIN will be sent immediately after the response. This behavior can be temporarily reverted by setting envoy.reloadable_features.skip_delay_close to false. If clients are seen to be receiving sporadic partial responses and flipping this flag fixes it, please notify the project immediately.

Envoy 連接關閉後的競態條件

摘自我的：https://istio-insider.mygraphql.com/zh_CN/latest/ch2-envoy/connection-life/connection-life-race.html

由於下面使用到了 socket 一些比較底層和冷門的知識點。如 close socket 的一些臨界狀態和異常邏輯。如果不太瞭解，建議先閲讀我寫的：

《Mark’s DevOps 雜碎》中《Socket Close/Shutdown 的臨界狀態與異常邏輯》一文。

Envoy 與 Downstream/Upstream 連接狀態不同步

以下大部分情況，算是個發生可能性低的 race condition。但，在大流量下，再少的可能性也是有遇到的時候。Design For Failure 是程序員的天職。

Downstream 向 Envoy 關閉中的連接發送請求

Github Issue: 502 on our ALB when traffic rate drops#13388
Fundamentally, the problem is that ALB is reusing connections that Envoy is closing. This is an inherent(固有) race condition with HTTP/1.1.
You need to configure the ALB max connection / idle timeout to be < any envoy timeout.

To have no race conditions, the ALB needs to support max_connection_duration and have that be less than Envoy's max connection duration. There is no way to fix this with Envoy.

本質上是：

Envoy 調用 close(fd) 關閉了 socket。同時關閉了 fd。
- 如果 close(fd) 時：
  - kernel 的 socket recv buffer 有數據未加載到 user-space ，那麼 kernel 會發送 RST 給 downstream。原因是這數據是已經 TCP ACK 過的，而應用卻丟棄了。
  - 否則，kernel 發送 FIN 給 downstream.
- 由於關閉了 fd，這注定了如果 kernel 還在這個 TCP 連接上收到 TCP 數據包，就會丟棄且以 RST 迴應。
Envoy 發出了 FIN
Envoy socket kernel 狀態更新為 FIN_WAIT_1 或 FIN_WAIT_2。

對於 Downstream 端，有兩種可能：

Downstream 所在 kernel 中的 socket 狀態已經被 Envoy 發過來的 FIN 更新為 CLOSE_WAIT 狀態，但 Downstream 程序(user-space)中未更新（即未感知到 CLOSE_WAIT 狀態）。
Downstream 所在 kernel 因網絡延遲等問題，還未收到 FIN。

所以 Downstream 程序 re-use 了這個 socket ，併發送 HTTP Request(假設拆分為多個 IP 包) 。結果都是在某個 IP 包到達 Envoy kernel 時，Envoy kernel 返回了 RST。於是 Downstream kernel 在收到 RST 後，也關閉了socket。所以從某個 socket write 開始均會失敗。失敗説明是類似 Upstream connection reset. 這裏需要注意的是， socket write() 是個異步的過程，不會等待對端的 ACK 就返回了。

一種可能是，某個 write() 時發現失敗。這更多是 http keepalive 的 http client library 的行為。或者是 HTTP Body 遠遠大於 socket sent buffer 時，分多 IP 包的行為。
一種可能是，直到 close() 時，要等待 ACK 了，才發現失敗。這更多是非 http keepalive 的 http client library 的行為。或者是 http keepalive 的 http client library 的最後一個請求時的行為。

從 HTTP 層面來看，有兩種場景可能出現這個問題：

服務端過早關閉連接(Server Prematurely/Early Closes Connection)。

Downsteam 在 write HTTP Header 後，再 write HTTP Body。然而，Envoy 在未讀完 HTTP Body 前，就已經 Write Response 且 close(fd) 了 socket。這叫 服務端過早關閉連接(Server Prematurely/Early Closes Connection)。別以為 Envoy 不會出現未完全讀完 Request 就 write Response and close socket 的情況。最少有幾個可能性：
- 只需要 Header 就可以判斷一個請求是非法的。所以大部分是返回 4xx/5xx 的 status code。
- HTTP Request Body 超過了 Envoy 的最大限制 max_request_bytes
這時，有兩個情況：
- Downstream 的 socket 狀態可能是 CLOSE_WAIT。還可以 write() 的狀態。但這個 HTTP Body 如果被 Envoy 的 Kernel 收到，由於 socket 已經執行過 close(fd) ， socket 的文件 fd 已經關閉，所以 Kernel 直接丟棄 HTTP Body 且返回 RST 給對端（因為 socket 的文件 fd 已經關閉，已經沒進程可能讀取到數據了）。這時，可憐的 Downstream 就會説：Connection reset by peer 之類的錯誤。
- Envoy 調用 close(fd) 時，kernel 發現 kernel 的 socket buffer 未被 user-space 完全消費。這種情況下， kernel 會發送 RST 給 Downstream。最終，可憐的 Downstream 就會在嘗試 write(fd) 或 read(fd) 時説：Connection reset by peer 之類的錯誤。
  見：Github Issue: http: not proxying 413 correctly#2929
```
+----------------+      +-----------------+
|Listner A (8000)|+---->|Listener B (8080)|+----> (dummy backend)
+----------------+      +-----------------+
```
  This issue is happening, because Envoy acting as a server (i.e. listener B in @lizan's example) closes downstream connection with pending (unread) data, which results in TCP RST packet being sent downstream.
  
  Depending on the timing, downstream (i.e. listener A in @lizan's example) might be able to receive and proxy complete HTTP response before receiving TCP RST packet (which erases low-level TCP buffers), in which case client will receive response sent by upstream (413 Request Body Too Large in this case, but this issue is not limited to that response code), otherwise client will receive 503 Service Unavailable response generated by listener A (which actually isn't the most appropriate response code in this case, but that's a separate issue).
  
  The common solution for this problem is to half-close downstream connection using ::shutdown(fd_, SHUT_WR) and then read downstream until EOF (to confirm that the other side received complete HTTP response and closed connection) or short timeout.

減少這種 race condition 的可行方法是：delay close socket。 Envoy 已經有相關的配置：delayed_close_timeout%20The)

Downstream 未感知到 HTTP Keepalive 的 Envoy 連接已經關閉，re-use 了連接。

上面提到的 Keepalive 連接複用的時候。Envoy 已經調用內核的 close(fd) 把 socket 變為 FIN_WAIT_1/FIN_WAIT_2 的狀態，且已經發出 FIN。但 Downstream 未收到，或已經收到但應用未感知到，且同時 reuse 了這個 http keepalive 連接來發送 HTTP Request。在 TCP 協議層面看來，這是個 half-close 連接，未 close 的一端的確是可以發數據到對端的。但已經調用過 close(fd) 的 kernel (Envoy端) 在收到數據包時，直接丟棄且返回 RST 給對端（因為 socket 的文件 fd 已經關閉，已經沒進程可能讀取到數據了）。這時，可憐的 Downstream 就會説：Connection reset by peer 之類的錯誤。
- 減少這種 race condition 的可行方法是：讓 Upstream 對端配置比 Envoy 更小的 timeout 時間。讓 Upsteam 主動關閉連接。

Envoy 實現上的緩解

緩解服務端過早關閉連接(Server Prematurely/Early Closes Connection)

Github Issue: http: not proxying 413 correctly #2929

In the case envoy is proxying large HTTP request, even upstream returns 413, the client of proxy is getting 503.

Github PR: network: delayed conn close #4382，增加了 delayed_close_timeout 配置。

Mitigate client read/close race issues on downstream HTTP connections by adding a new connection
close type 'FlushWriteAndDelay'. This new close type flushes the write buffer on a connection but
does not immediately close after emptying the buffer (unlike ConnectionCloseType::FlushWrite).

A timer has been added to track delayed closes for both 'FlushWrite' and 'FlushWriteAndDelay'. Upon
triggering, the socket will be closed and the connection will be cleaned up.

Delayed close processing can be disabled by setting the newly added HCM 'delayed_close_timeout'
config option to 0.

Risk Level: Medium (changes common case behavior for closing of downstream HTTP connections)
Testing: Unit tests and integration tests added.

但上面的 PR 在緩解了問題的同時也影響了性能：

Github Issue: HTTP/1.0 performance issues #19821

I was about to say it's probably delay-close related.

So HTTP in general can frame the response with one of three ways: content length, chunked encoding, or frame-by-connection-close.

If you don't haven an explicit content length, HTTP/1.1 will chunk, but HTTP/1.0 can only frame by connection close(FIN).

Meanwhile, there's another problem which is that if a client is sending data, and the request has not been completely read, a proxy responds with an error and closes the connection, many clients will get a TCP RST (due to uploading after FIN(close(fd))) and not actually read the response. That race is avoided with delayed_close_timeout.

It sounds like Envoy could do better detecting if a request is complete, and if so, framing with immediate close and I can pick that up. In the meantime if there's any way to have your backend set a content length that should work around the problem, or you can lower delay close in the interim.

於是需要再 Fix:

Github PR: http: reduce delay-close issues for HTTP/1.1 and below #19863

Skipping delay close for:

HTTP/1.0 framed by connection close (as it simply reduces time to end-framing)

as well as HTTP/1.1 if the request is fully read (so there's no FIN-RST race)。即系如果

Addresses the Envoy-specific parts of #19821
Runtime guard: envoy.reloadable_features.skip_delay_close

同時出現在 Envoy 1.22.0 的 Release Note 裏。需要注意的是，為了不影響性能，delayed_close_timeout 在很多情況下是不會生效的：：

http: avoiding delay-close for:

HTTP/1.0 responses framed by connection: close

as well as HTTP/1.1 if the request is fully read.

This means for responses to such requests, the FIN will be sent immediately after the response. This behavior can be temporarily reverted by setting envoy.reloadable_features.skip_delay_close to false. If clients are seen to be receiving sporadic partial responses and flipping this flag fixes it, please notify the project immediately.

Envoy 向已被 Upstream 關閉的 Upstream 連接發送請求

Github Issue: Envoy (re)uses connection after receiving FIN from upstream #6815
With Envoy serving as HTTP/1.1 proxy, sometimes Envoy tries to reuse a connection even after receiving FIN from upstream. In production I saw this issue even with couple of seconds from FIN to next request, and Envoy never returned FIN ACK (just FIN from upstream to envoy, then PUSH with new HTTP request from Envoy to upstream). Then Envoy returns 503 UC even though upstream is up and operational.

Istio: 503's with UC's and TCP Fun Times

一個經典場景的時序圖：from https://medium.com/@phylake/why-idle-timeouts-matter-1b3f7d4469fe

圖中 Reverse Proxy 可以理解為 Envoy.

本質上是：

Upstream 對端調用 close(fd) 關閉了 socket。這注定了如果 kernel 還在這個 TCP 連接上收到數據，就會丟棄且以 RST 迴應。
Upstream 對端發出了 FIN
Upstream socket 狀態更新為 FIN_WAIT_1 或 FIN_WAIT_2。

對於 Envoy 端，有兩種可能：

Envoy 所在 kernel 中的 socket 狀態已經被對端發過來的 FIN 更新為 CLOSE_WAIT 狀態，但 Envoy 程序(user-space)中未更新。
Envoy 所在 kernel 因網絡延遲等問題，還未收到 FIN。

但 Envoy 程序 re-use 了這個 socket ，併發送(write(fd)) HTTP Request(假設拆分為多個 IP 包) 。

這裏又有兩個可能：

在某一個 IP 包到達 Upstream 對端時，Upstream 返回了 RST。於是 Envoy 後繼的 socket write 均可能會失敗。失敗説明是類似 Upstream connection reset.
因為 socket write 是有 send buffer 的，是個異步操作。可能只在收到 RST 的下一個 epoll event cycle 中，發生 EV_CLOSED 事件，Envoy 才發現這個 socket 被 close 了。失敗説明是類似 Upstream connection reset.

Envoy 社區在這個問題有一些討論，只能減少可能，不可能完全避免：

Github Issue: HTTP1 conneciton pool attach pending request to half-closed connection #2715
The HTTP1 connection pool attach pending request when a response is complete. Though the upstream server may already closed the connection, this will result the pending request attached to it end up with 503.

協議與配置上的應對之法：

HTTP/1.1 has this inherent timing issue. As I already explained, this is solved in practice by

a) setting Connection: Closed when closing a connection immediately and

b) having a reasonable idle timeout.

The feature @ramaraochavali is adding will allow setting the idle timeout to less than upstream idle timeout to help with this case. Beyond that, you should be using router level retries.

説到底，這種問題由於 HTTP/1.1 的設計缺陷，不可能完全避免。對於等冪的操作，還得依賴於 retry 機制。

Envoy 實現上的緩解

實現上，Envoy 社區曾經想用讓 upstream 連接經歷多次 epool event cycle 再複用的方法得到連接狀態更新的事件。但這個方案不太好：

Github PR: Delay connection reuse for a poll cycle to catch closed connections.#7159(Not Merged)

So poll cycles are not an elegant way to solve this, when you delay N cycles, EOS may arrive in N+1-th cycle. The number is to be determined by the deployment so if we do this it should be configurable.

As noted in #2715, a retry (at Envoy level or application level) is preferred approach, #2715 (comment). Regardless of POST or GET, the status code 503 has a retry-able semantics defined in RFC 7231.

但最後，是用 connection re-use delay timer 的方法去實現：

All well behaving HTTP/1.1 servers indicate they are going to close the connection if they are going to immediately close it (Envoy does this). As I have said over and over again here and in the linked issues, this is well known timing issue with HTTP/1.1.

So to summarize, the options here are to:

Drop this change
Implement it correctly with an optional re-use delay timer.

最後的方法是：

Github PR: http: delaying attach pending requests #2871(Merged)

Another approach to #2715, attach pending request in next event after onResponseComplete.

即系限制一個 Upstream 連接在一個 epoll event cycle 中，只能承載一個 HTTP Request。即一個連接不能在同一個 epoll event cycle 中被多個 HTTP Request re-use 。這樣可以減少 kernel 中已經是 CLOSE_WAIT 狀態（取到 FIN）的請求，Envoy user-space 未感知到且 re-use 來發請求的可能性。

https://github.com/envoyproxy/envoy/pull/2871/files
@@ -209,25 +215,48 @@ void ConnPoolImpl::onResponseComplete(ActiveClient& client) {
    host_->cluster().stats().upstream_cx_max_requests_.inc();
    onDownstreamReset(client);
  } else {
-    processIdleClient(client);
    // Upstream connection might be closed right after response is complete. Setting delay=true
    // here to attach pending requests in next dispatcher loop to handle that case.
    // https://github.com/envoyproxy/envoy/issues/2715
+    processIdleClient(client, true);
  }
}
一些描述：https://github.com/envoyproxy/envoy/issues/23625#issuecomment-1301108769

There's an inherent race condition that an upstream can close a connection at any point and Envoy may not yet know, assign it to be used, and find out it is closed. We attempt to avoid that by returning all connections to the pool to give the kernel a chance to inform us of FINs but can't avoid the race entirely.

實現細節上，這個 Github PR 本身還有一個 bug ，在後面修正了：
Github Issue: Missed upstream disconnect leading to 503 UC#6190

Github PR: http1: enable reads when final pipeline response received#6578

這裏有個插曲，Istio 在 2019 年是自己 fork 了一個 envoy 源碼的，自己去解決這個問題：Istio Github PR: Fix connection reuse by delaying a poll cycle. #73 。不過最後，Istio 還是迴歸原生的 Envoy，只加了一些必要的 Envoy Filter Native C++ 實現。

Istio 配置上緩解

Istio Github Issue: Almost every app gets UC errors, 0.012% of all requests in 24h period#13848

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: passthrough-retries
  namespace: myapp
spec:
  workloadSelector:
    labels:
      app: myapp
  configPatches:
  - applyTo: HTTP_ROUTE
    match:
      context: SIDECAR_INBOUND
      listener:
        portNumber: 8080
        filterChain:
          filter:
            name: "envoy.filters.network.http_connection_manager"
            subFilter:
              name: "envoy.filters.http.router"
    patch:
      operation: MERGE
      value:
        route:
          retry_policy:
            retry_back_off:
              base_interval: 10ms
            retry_on: reset
            num_retries: 2

或

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: qqq-destination-rule
spec:
  host: qqq.aaa.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        idleTimeout: 3s
        maxRetries: 3

Linux 連接關閉的臨界狀態與異常邏輯

摘自我的：https://devops-insider.mygraphql.com/zh_CN/latest/kernel/network/socket/socket-close/socket-close.html

如果你能堅持看到這裏，恭喜你，已經到戲玉了。

Socket 的關閉，聽起來是再簡單不過的事情，不就是一個 close(fd) 的調用嗎？下面慢慢道來。

Socket 關閉相關模型

在開始分析一個事情前，我習慣先為事件相關方建立模型，然後邊分析，邊完善模型。這樣分析邏輯時，就可以比較全面，且前後因果邏輯可以推演和檢查。關鍵是，模型可以重用。
研究 Socket 關閉也不例外。

圖：Socket 關閉相關模型

用 Draw.io 打開

上圖是 機器A 與 機器B 建議了 TCP socket 的情況。以 機器A 為例，分析一個模型：

自底向上層有：

sott-IRQ/進程內核態處理 IP 包接收
socket 對象
socket 對象相關的 send buffer
socket 對象相關的 recv buffer
進程不直接訪問 socket 對象，而是有個 VFS 層，以 File Descriptor(fd) 為句柄讀寫 socket
- 一個 socket 可以被多個 fd 引用
進程以一個整數作為 fd 的 id ，在操作(調用kernel) 時帶上這個 id 作為調用參數。
- 每個 fd 均有可獨立關閉的 read channel 和 write channel

看完模型的靜態元素後，説説模型的一些規則，這些規則在 kernel 程序中執行。且在下文中引用：

socket FD read channel 關閉
- socket FD read channel 關閉時，如果發現 recv buffer 中有已經 ACK 的數據，未被應用(user-space)讀取，將向對端發送 RST。詳述在這：{ref}kernel/network/kernel-tcp/tcp-reset/tcp-reset:TCP RST and unread socket recv buffer
- socket FD read channel 關閉後，如果還收到對端的數據(TCP Half close)，將丟棄，且無情地以 RST 迴應。

socket fd 的關閉

[W. Richard Stevens, Bill Fenner, Andrew M. Rudoff - UNIX Network Programming, Volume 1] 一書中，説 socket fd 有兩種關閉函數：

close(fd)
shutdown(fd)

close(fd)

[W. Richard Stevens, Bill Fenner, Andrew M. Rudoff - UNIX Network Programming, Volume 1] - 4.9 close Function

Since the reference count was still greater than 0, this call to close did not
initiate TCP’s four-packet connection termination sequence.

意思是， socket 是有 fd 的引用計數的。close 會減少引用計數。只在引用計數為 0 時，發會啓動連接關閉流程（下面描述這個流程）。

The default action of close with a TCP socket is to mark the socket as closed and return to the process immediately. The fd is no longer usable by the process: It cannot be used as an argument to read or write. But, TCP will try to send any data that is already queued to be sent to the other end, and after this occurs, the normal TCP connection termination sequence takes place.
we will describe the SO_LINGER socket option, which lets us change this default action with a TCP socket.

意思是：這個函數默認會立即返回。關閉了的 fd ，將不可以再讀寫。kernel 在後台，會啓動連接關閉流程，在所有 socket send buffer 都發送完後，最後發 FIN 給對端。

這裏暫時不説 SO_LINGER。

close(fd) 其實是同時關閉了 fd 的 read channel 與 write channel。所以根據上面的模型規則：

socket FD read channel 關閉後，如果還收到對端的數據，將丟棄，且無情地以 RST 迴應。

如果用 close(fd) 方法關閉了一個 socket後（調用返回後），對端因未收到 FIN，或雖然收到 FIN 但認為只是 Half-close TCP connection 而繼續發數據過來的話，kernel 將無情地以 RST 迴應：

圖：關閉的 socket 收到數據，以 RST 迴應 - from [UNIX Network Programming, Volume 1] - SO_LINGER Socket Option

這是 TCP 協議設計上的一個“缺陷”。FIN 只能告訴對端我要關閉出向流，卻沒方法告訴對端：我不想再收數據了，要關閉入向流。但內核實現上，是可以關閉入向流的，且這個關閉在 TCP 層面無法通知對方，所以就出現誤會了。

shutdown(fd)

作為程序員，先看看函數文檔。

##include <sys/socket.h>
int shutdown(int sockfd, int howto);

[UNIX Network Programming, Volume 1] - 6.6 shutdown Function

The action of the function depends on the value of the howto argument.

SHUT_RD
The read half of the connection is closed—No more data can be
received on the socket and any data currently in the socket receive buffer is discarded. The process can no longer issue any of the read functions on the socket. Any data received after this call for a TCP socket is acknowledged and then silently discarded.
SHUT_WR
The write half of the connection is closed—In the case of TCP, this is
called a half-close (Section 18.5 of TCPv1). Any data currently in the socket send buffer will be sent, followed by TCP’s normal connection termination sequence. As we mentioned earlier, this closing of the write half is done regardless of whether or not the socket descriptor’s reference count is currently greater than 0. The process can no longer issue any of the write functions on the socket.
SHUT_RDWR

The read half and the write half of the connection are both closed — This is equivalent to calling shutdown twice: first with SHUT_RD and then with SHUT_WR.

[UNIX Network Programming, Volume 1] - 6.6 shutdown Function

The normal way to terminate a network connection is to call the close function. But,
there are two limitations with close that can be avoided with shutdown:

close decrements the descriptor’s reference count and closes the socket only if
the count reaches 0. We talked about this in Section 4.8.

With shutdown, we can initiate TCP’s normal connection termination sequence (the four segments
beginning with a FIN), regardless of the reference count.

close terminates both directions of data transfer, reading and writing. Since a TCP connection is full-duplex, there are times when we want to tell the other end that we have finished sending, even though that end might have more data to send us(即系 Half-Close TCP connection).

意思是：shutdown(fd) 可以選擇雙工中的一個方向關閉 fd。一般來説有兩種使用場景：

只關閉出向(write)的，實現 Half close TCP
同時關閉出向與入向(write&read)

圖：TCP Half-Close shutdown(fd) 流程 - from [UNIX Network Programming, Volume 1]

SO_LINGER

[UNIX Network Programming, Volume 1] - SO_LINGER Socket Option

This option specifies how the close function operates for a connection-oriented proto-
col (e.g., for TCP and SCTP, but not for UDP). By default, close returns immediately, but if there is any data still remaining in the socket send buffer, the system will try to deliver the data to the peer. 默認的 close(fd) 的行為是：

函數立刻返回。kernel 後台開啓正常的關閉流程：異步發送在 socket send buffer 中的數據，最後發送 FIN。

如果 socket receive buffer 中有數據，數據將丟棄。

SO_LINGER ，顧名思義，就是 “徘徊” 的意思。

SO_LINGER 是一個 socket option，定義如下：

 struct linger {
int l_onoff; /* 0=off, nonzero=on */
int l_linger; /* linger time, POSIX specifies units as seconds */
};

SO_LINGER 如下影響 close(fd) 的行為：

If l_onoff is 0, the option is turned off. The value of l_linger is ignored and
the previously discussed TCP default applies: close returns immediately. 這是協默認的行為。
If l_onoff is nonzero
- l_linger is zero, TCP aborts the connection when it is closed (pp. 1019 – 1020 of TCPv2). That is, TCP discards any data still remaining in the socket send buffer and sends an RST to the peer, not the normal four-packet connection termination sequence . This avoids TCP’s TIME_WAIT state, but in doing so, leaves open the possibility of another incarnation of this connection being created within 2MSL seconds and having old duplicate segments from the just-terminated connection being incorrectly delivered to the new incarnation.（即，通過 RST 實現本地 port 的快速回收。當然，有副作用）
- l_linger is nonzero, then the kernel will linger when the socket is closed (p. 472 of TCPv2). That is, if there is any data still remaining in the socket send buffer, the process is put to sleep until either: (即，進程在調用 close(fd) 時，會等待發送成功ACK 或 timeout)
- all the data is sent and acknowledged by the peer TCP
- or the linger time expires.
If the socket has been set to nonblocking, it will not wait for the close to complete, even if the linger time is nonzero.

When using this feature of the SO_LINGER option, it is important for the application to check the return value from close, because if the linger time expires before the remaining data is sent and acknowledged, close returns EWOULDBLOCK and any remain ing data in the send buffer is discarded.(即開啓了後，如果在 timeout 前還未收到 ACK，socket send buffer 中的數據可能丟失)

close/shutdown 與 SO_LINGER 小結

[UNIX Network Programming, Volume 1] - SO_LINGER Socket Option

Function	Description
`shutdown`, `SHUT_RD`	No more receives can be issued on socket; process can still send on socket;<br/>socket receive buffer discarded; any further data received is discarded<br/>no effect on socket send buffer.
`shutdown`, `SHUT_WR` (這是大部分使用 shutdown 的場景)	No more sends can be issued on socket; process can still receive on socket;<br/>contents of socket send buffer sent to other end, followed by normal TCP connection termination (FIN); no effect on socket receive buffer.
`close`, `l_onoff` = 0<br/>(default)	No more receives or sends can be issued on socket; contents of socket send buffer sent to other end. If `descriptor reference count` becomes 0: - normal TCP connection termination (FIN) sent following data in send buffer. - socket receive buffer discarded.(即丟棄 recv buffer 未被 user space 讀取的數據。<mark>注意：對於 Linux 如果是已經 `ACK` 的數據未被 user-space 讀取，將發送 `RST` 給對端</mark>)
`close`, `l_onoff` = 1<br/>`l_linger` = 0	No more receives or sends can be issued on socket. If `descriptor reference count` becomes 0: - `RST` sent to other end; connection state set to CLOSED<br/>(no TIME_WAIT state); - socket send buffer and socket receive buffer discarded.
`close`, `l_onoff` = 1<br/>`l_linger` != 0	No more receives or sends can be issued on socket; contents of socket send buffer sent to other end. If `descriptor reference count` becomes 0:<br/>- normal TCP connection termination (FIN) sent following data in send buffer; - socket receive buffer discarded; and if linger time expires before connection CLOSED, close returns EWOULDBLOCK. and any remain ing data in the `send buffer` is discarded.(即開啓了後，如果在 timeout 前還未收到 ACK，socket send buffer 中的數據可能丟失)

不錯的擴展閲讀：

一個寫在 2016 年的文章：Resetting a TCP connection and SO_LINGER

結尾

在長期方向無大錯的前提下，長線的投入和專注不一定可以現成地解決問題和收到回報。但總會有天，會給人驚喜。致每一個技術人。

MarkZhu Blog

@labilezhu

Tags

tcp (25)

tcp-ip (21)

socket (19)

代理 (17)

istio (3)

linux-kernel (1)

envoy (1)

Stories

Envoy/Istio 連接生命週期與臨界異常 —— 不知所謂的連接 REST - Stories Detail

簡介

引

Envoy 連接生命週期管理

Upstream/Downstream 連接解藕

連接超時相關配置參數

idle_timeout

max_connection_duration

max_requests_per_connection

drain_timeout - for downstream only

delayed_close_timeout - for downstream only

Envoy 連接關閉後的競態條件

Envoy 與 Downstream/Upstream 連接狀態不同步

Downstream 向 Envoy 關閉中的連接發送請求

Envoy 實現上的緩解

緩解 服務端過早關閉連接(Server Prematurely/Early Closes Connection)

Envoy 向已被 Upstream 關閉的 Upstream 連接發送請求

Envoy 實現上的緩解

Istio 配置上緩解

Linux 連接關閉的臨界狀態與異常邏輯

Socket 關閉相關模型

相關的 TCP 協議知識

TCP Half-Close

socket fd 的關閉

close(fd)

shutdown(fd)

SO_LINGER

close/shutdown 與 SO_LINGER 小結

結尾

Add a new Comments

緩解服務端過早關閉連接(Server Prematurely/Early Closes Connection)