事故的大概流程如下:
線程數耗盡 → Docker 卡死 → CSI 重連 → 系統掛掉。這個時候就需要我們異常去重啓cvm 操作系統。
[root@k8s-master01 log]#
[root@k8s-master01 log]# grep Resource daemon.log
Nov 18 13:56:05 localhost systemd[113303]: Failed to fork: Resource temporarily unavailable
Nov 18 13:56:05 localhost systemd[113303]: Failed to fork: Resource temporarily unavailable
Nov 18 14:04:04 localhost systemd[4136]: gpgconf: error forking process: Resource temporarily unavailable
Nov 18 14:04:04 localhost systemd[4136]: gpgconf: could not gather active options from '/usr/bin/gpg-agent': Resource temporarily unavailable
Nov 18 14:09:51 localhost systemd[8911]: PAM failed: Resource temporarily unavailable
Nov 18 14:09:51 localhost systemd[8911]: user@0.service: Failed to set up PAM session: Resource temporarily unavailable
Nov 18 14:09:51 localhost systemd[8911]: user@0.service: Failed at step PAM spawning /lib/systemd/systemd: Resource temporarily unavailable
Nov 18 14:10:04 localhost systemd[4136]: PAM failed: Resource temporarily unavailable
Nov 18 14:10:04 localhost systemd[4136]: user@0.service: Failed to set up PAM session: Resource temporarily unavailable
Nov 18 14:10:04 localhost systemd[4136]: user@0.service: Failed at step PAM spawning /lib/systemd/systemd: Resource temporarily unavailable
Nov 18 14:19:02 localhost systemd[9470]: /usr/lib/systemd/user-environment-generators/90gpg-agent: fork: retry: Resource temporarily unavailable
Nov 18 14:19:03 localhost systemd[9470]: /usr/lib/systemd/user-environment-generators/90gpg-agent: fork: retry: Resource temporarily unavailable
Nov 18 14:23:02 localhost dockerd[4447]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Nov 18 14:23:03 localhost dockerd[2699]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Nov 18 14:23:03 localhost dockerd[2699]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Nov 18 14:23:03 localhost systemd[1]: docker.service: Failed to fork: Resource temporarily unavailable
Nov 18 14:23:03 localhost systemd[1]: docker.service: Failed to run 'start' task: Resource temporarily unavailable
Nov 18 14:23:31 localhost systemd[1]: blk-availability.service: Failed to fork: Resource temporarily unavailable
Nov 18 14:23:31 localhost systemd[1]: blk-availability.service: Failed to run 'stop' task: Resource temporarily unavailable
Nov 18 14:25:02 localhost kubelet[1601]: I1118 14:25:02.377330 1601 docker_service.go:263] Docker Info: &{ID:FXVG:DYHB:GCQQ:7UPM:SKM2:CEB7:JAR5:H5OP:M2VN:YJEH:SV4J:64QL Containers:103 ContainersRunning:52 ContainersPaused:0 ContainersStopped:51 Images:467 Driver:overlay2 DriverStatus:[[Backing Filesystem extfs] [Supports d_type true] [Native Overlay Diff true]] SystemStatus:[] Plugins:{Volume:[local] Network:[bridge host ipvlan macvlan null overlay] Authorization:[] Log:[awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog]} MemoryLimit:true SwapLimit:false KernelMemory:true KernelMemoryTCP:false CPUCfsPeriod:true CPUCfsQuota:true CPUShares:true CPUSet:true PidsLimit:false IPv4Forwarding:true BridgeNfIptables:false BridgeNfIP6tables:false Debug:false NFd:23 OomKillDisable:true NGoroutines:45 SystemTime:2025-11-18T14:25:02.365472033+08:00 LoggingDriver:json-file CgroupDriver:cgroupfs NEventsListener:0 KernelVersion:4.19.0-server-amd64 OperatingSystem:UnionTech OS Server 20 Enterprise OSType:linux Architecture:x86_64 IndexServerAddress:https://index.docker.io/v1/ RegistryConfig:0xc0002750a0 NCPU:64 MemTotal:269931233280 GenericResources:[] DockerRootDir:/data/docker HTTPProxy: HTTPSProxy: NoProxy: Name:localhost.localdomain Labels:[] ExperimentalBuild:true ServerVersion:18.09.9 ClusterStore: ClusterAdvertise: Runtimes:map[runc:{Path:runc Args:[]}] DefaultRuntime:runc Swarm:{NodeID: NodeAddr: LocalNodeState:inactive ControlAvailable:false Error: RemoteManagers:[] Nodes:0 Managers:0 Cluster:<nil> Warnings:[]} LiveRestoreEnabled:true Isolation: InitBinary:docker-init ContainerdCommit:{ID:894b81a4b802e4eb2a91d1ce216b8817763c29fb Expected:894b81a4b802e4eb2a91d1ce216b8817763c29fb} RuncCommit:{ID:425e105d5a03fabd737a126ad93d62a9eeede87f-dirty Expected:425e105d5a03fabd737a126ad93d62a9eeede87f-dirty} InitCommit:{ID:fec3683 Expected:fec3683} SecurityOptions:[name=apparmor name=seccomp,profile=default] ProductLicense:Community Engine Warnings:[WARNING: No swap limit support WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled]}
Nov 18 14:25:02 localhost kubelet[1601]: I1118 14:25:02.416119 1601 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
一 、分析如下:
第一階段:系統級線程耗盡
Nov 18 13:56:05 - systemd[113303]: Failed to fork: Resource temporarily unavailable
Nov 18 14:04:04 - gpgconf: error forking process: Resource temporarily unavailable
Nov 18 14:09:51 - PAM failed: Resource temporarily unavailable
説明:系統層面已經無法創建新進程/線程
第二階段:Docker 受影響
Nov 18 14:23:02 - dockerd[4447]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Nov 18 14:23:03 - dockerd[2699]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Nov 18 14:23:03 - docker.service: Failed to fork: Resource temporarily unavailable
説明:Docker 無法創建新的線程,開始崩潰
第三階段:系統服務連鎖故障
Nov 18 14:23:31 - blk-availability.service: Failed to fork: Resource temporarily unavailable
説明:連基礎系統服務都無法運行
從 Docker Info 看出線索:
Containers:103
ContainersRunning:52
ContainersPaused:0
ContainersStopped:51
問題:節點上運行了 52個容器 + 51個停止的容器,總共103個容器!
二、運維與管理
1、立即排查線程使用
檢查當前線程情況
# 1. 查看系統線程總數和限制
echo "系統線程限制: $(cat /proc/sys/kernel/threads-max)"
echo "當前線程數: $(ps -eLf | wc -l)"
echo "PID限制: $(cat /proc/sys/kernel/pid_max)"
# 2. 查看線程使用排名
ps -eLf | awk '{print $1,$2,$3,$11}' | sort -k3 -n | uniq -c | sort -k1 -n | tail -20
# 3. 檢查Docker相關進程線程數
for pid in $(pgrep -f docker); do
echo "Docker進程 $pid: $(ps -L -p $pid | wc -l) 線程"
done
# 4. 檢查容器線程數
docker ps -q | xargs docker inspect --format='{{.State.Pid}}' 2>/dev/null | while read pid; do
if [ -n "$pid" ] && [ -d "/proc/$pid" ]; then
echo "容器PID $pid: $(ps -L -p $pid | wc -l) 線程"
fi
done
2、檢查Kubernetes Pod狀態
# 查看節點上的Pod數量
kubectl get pods --all-namespaces -o wide | grep $(hostname) | wc -l
# 查看異常Pod
kubectl get pods --all-namespaces --field-selector status.phase!=Running
# 查看頻繁重啓的Pod
kubectl get pods --all-namespaces | sort -k4 -r | head -10
三、線程泄露的分類
線程泄漏的常見元兇
1. Java 應用線程泄漏
# 檢查Java進程線程數
ps -eLf | grep java | awk '{print $11}' | sort | uniq -c | sort -nr
# 或者按進程統計
ps -eLf | awk '$11 ~ /java/ {print $2}' | sort | uniq -c | sort -nr
2. Go 應用 goroutine 泄漏
# 檢查Go進程
ps -eLf | grep -E "(go|bin/)" | awk '{print $11}' | sort | uniq -c | sort -nr
3. 網絡連接泄漏
# 檢查網絡連接數
ss -tun | wc -l
# 按進程統計連接數
ss -tunp | awk '{print $6}' | grep -v "Peer" | sort | uniq -c | sort -nr
4. 檢查特定高線程數進程
# 實時查看線程數變化
while true; do
clear
echo "=== 線程數排名 ==="
ps -eLf | awk '{print $11}' | sort | uniq -c | sort -nr | head -10
echo "=== 總線程數: $(ps -eLf | wc -l) ==="
sleep 5
done
四、緊急恢復方案
立即清理資源
# 1. 清理停止的容器(立即釋放資源)
docker container prune -f
# 2. 重啓線程泄漏最嚴重的進程
# 先找到元兇,然後有選擇地重啓
# 3. 臨時提高線程限制(緊急情況下)
echo 200000 > /proc/sys/kernel/threads-max
echo 200000 > /proc/sys/kernel/pid_max
# 4. 重啓Docker服務
systemctl restart docker
長期解決方案
# 在 /etc/sysctl.conf 中調整
echo "kernel.threads-max = 200000" >> /etc/sysctl.conf
echo "kernel.pid_max = 200000" >> /etc/sysctl.conf
echo "vm.max_map_count = 262144" >> /etc/sysctl.conf
# 生效配置
sysctl -p
根本原因:某個應用(很可能是Java或Go應用)發生線程/goroutine泄漏,導致系統線程數耗盡。
故障鏈:
應用線程泄漏 → 系統線程耗盡 → Docker無法創建線程 → 容器管理失效 → CSI插件重連 → 系統卡死
建議立即行動:
找到線程泄漏的元兇進程
清理停止的容器釋放資源
重啓有問題的應用
調整系統線程限制
先執行線程數排名命令,找到那個創建了大量線程的進程,那就是問題的根源!
運維管理:
1. 實時查看線程數排名
# 查看所有進程的線程數排名(最直接的方法)
ps -eLf | awk '{print $11}' | sort | uniq -c | sort -nr | head -20
2. 按進程統計線程數
bash
# 更清晰的進程線程數統計
ps -eTo pid,ppid,pgid,tid,user,pcpu,comm --sort=-pcpu | head -20
# 或者使用這個更簡潔的命令
ps -eL -o pid,comm,nlwp | sort -k3 -nr | head -20
輸出解釋:
pid: 進程IDcomm: 命令名nlwp: 線程數量
3. 查看具體進程的線程詳情
# 找到線程數最多的前5個進程
ps -eL -o pid,comm,nlwp | sort -k3 -nr | head -5
# 然後針對高線程數的進程深入分析
PID=$(ps -eL -o pid,comm,nlwp | sort -k3 -nr | head -1 | awk '{print $1}')
echo "檢查進程PID: $PID"
# 查看這個進程的詳細信息
ps -p $PID -o pid,ppid,comm,pcpu,pmem,vsz,rss,nlwp
# 查看這個進程的所有線程
ps -L -p $PID -o tid,pcpu,state,comm
4. 檢查容器相關進程的線程數
# 檢查Docker和容器相關進程的線程數
echo "=== Docker相關進程線程數 ==="
for proc in docker containerd kubelet; do
pids=$(pgrep $proc)
for pid in $pids; do
thread_count=$(ps -L -p $pid 2>/dev/null | wc -l)
if [ -n "$thread_count" ] && [ "$thread_count" -gt 1 ]; then
echo "$proc(PID:$pid): $((thread_count-1)) 線程"
fi
done
done
5. 按命名空間查看線程
# 查看每個容器的線程數(需要jq)
docker ps -q | while read container; do
pid=$(docker inspect $container --format '{{.State.Pid}}' 2>/dev/null)
if [ -n "$pid" ] && [ "$pid" -ne 0 ]; then
thread_count=$(ps -L -p $pid 2>/dev/null | wc -l)
container_name=$(docker inspect $container --format '{{.Name}}' | sed 's|/||')
echo "容器 $container_name (PID:$pid): $((thread_count-1)) 線程"
fi
done
6. 實時監控線程數變化
# 創建一個實時監控腳本
while true; do
clear
echo "=== 線程數實時監控 $(date) ==="
echo "總線程數: $(ps -eLf | wc -l)"
echo ""
echo "=== 線程數排名前10 ==="
ps -eL -o pid,comm,nlwp --no-headers | awk '{arr[$2]+=$3} END {for (i in arr) print arr[i], i}' | sort -nr | head -10
echo ""
echo "=== 按進程詳情 ==="
ps -eL -o pid,comm,nlwp | sort -k3 -nr | head -10
sleep 5
done
重點懷疑對象
在你的K8s環境中,重點關注這些進程:
Java應用(常見線程泄漏)
# 檢查所有Java進程的線程數
ps -eLf | grep java | awk '{print $11}' | sort | uniq -c | sort -nr
# 或者
ps -eL -o pid,comm,nlwp | grep java | sort -k3 -nr
Go應用(goroutine泄漏)
ps -eL -o pid,comm,nlwp | grep -E "(go|bin/)" | sort -k3 -nr
網絡代理或Sidecar
# Envoy, Istio, Nginx等
ps -eL -o pid,comm,nlwp | grep -E "(envoy|nginx|istio)" | sort -k3 -nr
📋 自動化排查腳本
創建一個完整的排查腳本
#!/bin/bash
echo "========== 系統線程狀態 =========="
echo "系統線程限制: $(cat /proc/sys/kernel/threads-max)"
echo "當前總線程數: $(ps -eLf | wc -l)"
echo "PID限制: $(cat /proc/sys/kernel/pid_max)"
echo ""
echo "========== 線程數排名前15 =========="
ps -eL -o pid,comm,nlwp --no-headers | awk '
{
threads[$1] = $3
comm[$1] = $2
}
END {
for (pid in threads) {
print threads[pid] " " pid " " comm[pid]
}
}' | sort -nr | head -15
echo ""
echo "========== 按進程名聚合 =========="
ps -eL -o comm,nlwp --no-headers | awk '
{
total[$1] += $2
}
END {
for (proc in total) {
print total[proc] " " proc
}
}' | sort -nr | head -10
五、 排查總結
執行順序建議:
- 先運行:
ps -eL -o pid,comm,nlwp | sort -k3 -nr | head -20 - 找到高線程數的PID後:
ps -L -p <PID> -o tid,pcpu,state,comm - 如果是容器進程:檢查對應容器的資源使用
- 實時監控:使用上面的監控腳本觀察線程數變化
通常的嫌疑犯:
- Java應用(線程池配置不當)
- Go應用(goroutine泄漏)
- 網絡代理(大量連接)
- 數據庫連接池
找到線程數異常高的進程後,我們就能針對性解決了!