博客 / 詳情

返回

記一次 .NET 某MES上位機拍照系統 內存暴漲分析

一:背景

1. 講故事

這是訓練營裏的一位朋友找到我的,説他們的系統會有偶發的內存暴漲情況,自己也沒分析出來,讓我幫忙看下怎麼回事,拿了一個20G+的dump文件,這文件是夠大的,我個人建議一般是不超過10G,不然的話windbg分析起來很吃力。

二:內存暴漲分析

1. 為什麼會內存暴漲

還是老辦法,使用 !address -summary 觀察提交內存,輸出如下:


0:000> !address -summary

--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free                                   1870     5ff8`c8447000 (  95.972 TB)           74.98%
<unknown>                              1064     2005`7faca000 (  32.021 TB)  99.98%   25.02%
Heap                                   3594        1`56a34000 (   5.354 GB)   0.02%    0.00%
Image                                  4747        0`35dfb000 ( 861.980 MB)   0.00%    0.00%
Stack                                   522        0`2b440000 ( 692.250 MB)   0.00%    0.00%
Other                                   314        0`00313000 (   3.074 MB)   0.00%    0.00%
TEB                                     174        0`0015c000 (   1.359 MB)   0.00%    0.00%
PEB                                       1        0`00001000 (   4.000 kB)   0.00%    0.00%

--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE                               1870     5ff8`c8447000 (  95.972 TB)           74.98%
MEM_RESERVE                            2326     2001`b95a7000 (  32.007 TB)  99.93%   25.01%
MEM_COMMIT                             8090        5`7e602000 (  21.975 GB)   0.07%    0.02%

0:000> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x0000013e0f5919d8
generation 1 starts at 0x0000013e0f49a8b0
generation 2 starts at 0x0000013e09f21000
ephemeral segment allocation context: none
         segment             begin         allocated              size
0000013e09f20000  0000013e09f21000  0000013e0fb15b20  0x5bf4b20(96422688)
Large object heap starts at 0x0000013e19f21000
         segment             begin         allocated              size
0000013e19f20000  0000013e19f21000  0000013e211b6f50  0x7295f50(120151888)
...
00000143d6850000  00000143d6851000  00000143db009118  0x47b8118(75202840)
Total Size:              Size: 0x33bd0f148 (13888450888) bytes.
------------------------------
GC Heap Size:            Size: 0x33bd0f148 (13888450888) bytes.

從卦中可以看到提交內存是21.9G, Heap堆是5.3G,託管堆是 13.8G,既然佔了一半多的提交內存,看樣子要從託管堆入手了。

2. 託管堆怎麼了

看託管內存的佔用,可以藉助強大的 perfview 做一個快速識別,看看哪些gcroot根佔用比較大,截圖如下:

從卦中可以清晰的看到 FinalizerQueue 吃了幾乎所有的託管內存,如果大家對 FinalizerQueue 有所瞭解,應該知道下一步的追蹤方向了。

接下來使用 !fq 命令觀察終結器隊列情況,參考輸出如下:


0:000> !fq
SyncBlocks to be cleaned up: 0
Free-Threaded Interfaces to be released: 0
MTA Interfaces to be released: 0
STA Interfaces to be released: 0
----------------------------------
generation 0 has 2722 finalizable objects (0000013f4c737e08->0000013f4c73d318)
generation 1 has 73 finalizable objects (0000013f4c737bc0->0000013f4c737e08)
generation 2 has 20328 finalizable objects (0000013f4c710080->0000013f4c737bc0)
Ready for finalization 34482 objects (0000013f4c73d318->0000013f4c7808a8)
Statistics for all finalizable objects (including all objects ready for finalization):

上面的 Ready for finalization 即 終結器隊列的 Freachable 區域,也就是終結器線程提取數據的地方,可以看到此時這個小節裏積壓了 3.4w 的數據,也就表明此時的終結器線程應該出了問題。

3. 終結器線程怎麼了

要想找到終結器線程,可以先用 !t 切過去再觀察調用棧即可。


0:000> !t
ThreadCount:      104
UnstartedThread:  0
BackgroundThread: 40
PendingThread:    0
DeadThread:       63
Hosted Runtime:   no
                                                                                                        Lock  
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1 3854 0000013e082beb60    26020 Preemptive  0000013E0F6E63A0:0000013E0F6E79D8 0000013e08293ef0 0     STA 
   5    2  708 0000013e082e7bd0    2b220 Preemptive  0000000000000000:0000000000000000 0000013e08293ef0 0     MTA (Finalizer) 

0:000> ~~[708]s
win32u!NtUserMessageCall+0x14:
00007ff8`6b151124 c3              ret
0:005> k
 # Child-SP          RetAddr               Call Site
00 00000029`dbdfea38 00007ff8`6cce1082     win32u!NtUserMessageCall+0x14
01 00000029`dbdfea40 00007fff`9879b2d0     user32!SendMessageTimeoutW+0x102
02 00000029`dbdfead0 00007fff`985c4dc7     halcon!IOWIN32DumpToTexture+0xc90
03 00000029`dbdfef60 00007fff`974bff0e     halcon!IPGenImaMask+0xae7
04 00000029`dbdfefd0 00007fff`9739d0ca     halcon!HHandleClear+0x10e
05 00000029`dbdff050 00007ff7`f5d5a1a2     halcon!HLIClearHandle+0x2a
06 00000029`dbdff090 00007ff7`f5d5b571     halcondotnet!HalconDotNet.HHandleBase.ClearHandleInternal+0x92
07 00000029`dbdff140 00007ff7`f5ddf865     halcondotnet!HalconDotNet.HHandleBase.Dispose+0x21
08 00000029`dbdff180 00007ff8`542d67b6     halcondotnet!HalconDotNet.HHandleBase.Finalize+0x15
09 00000029`dbdff1c0 00007ff8`544934a1     clr!FastCallFinalizeWorker+0x6
0a 00000029`dbdff1f0 00007ff8`54493429     clr!FastCallFinalize+0x55
0b 00000029`dbdff240 00007ff8`54493358     clr!MethodTable::CallFinalizer+0xb5
0c 00000029`dbdff290 00007ff8`5449318b     clr!CallFinalizer+0x5e
0d 00000029`dbdff2d0 00007ff8`544930a4     clr!FinalizerThread::DoOneFinalization+0x95
0e 00000029`dbdff3b0 00007ff8`544923fa     clr!FinalizerThread::FinalizeAllObjects+0xbf
0f 00000029`dbdff3f0 00007ff8`542d7be8     clr!FinalizerThread::FinalizerThreadWorker+0xba
10 00000029`dbdff440 00007ff8`542d7b53     clr!ManagedThreadBase_DispatchInner+0x40
11 00000029`dbdff480 00007ff8`542d7a92     clr!ManagedThreadBase_DispatchMiddle+0x6c
12 00000029`dbdff580 00007ff8`5441c316     clr!ManagedThreadBase_DispatchOuter+0x4c
13 00000029`dbdff5f0 00007ff8`542dbcc5     clr!FinalizerThread::FinalizerThreadStart+0x116
14 00000029`dbdff690 00007ff8`6b3a7374     clr!Thread::intermediateThreadProc+0x8b
15 00000029`dbdff750 00007ff8`6d35cc91     kernel32!BaseThreadInitThunk+0x14
16 00000029`dbdff780 00000000`00000000     ntdll!RtlUserThreadStart+0x21

0:005> !clrstack
OS Thread Id: 0x708 (5)
        Child SP               IP Call Site
00000029dbdff0b8 00007ff86b151124 [InlinedCallFrame: 00000029dbdff0b8] HalconDotNet.HalconAPI.ClearHandle(IntPtr)
00000029dbdff0b8 00007ff7f5d5a1a2 [InlinedCallFrame: 00000029dbdff0b8] HalconDotNet.HalconAPI.ClearHandle(IntPtr)
00000029dbdff090 00007ff7f5d5a1a2 HalconDotNet.HHandleBase.ClearHandleInternal()
00000029dbdff140 00007ff7f5d5b571 HalconDotNet.HHandleBase.Dispose(Boolean)
00000029dbdff180 00007ff7f5ddf865 HalconDotNet.HHandleBase.Finalize()
00000029dbdff5d0 00007ff8542d67b6 [DebuggerU2MCatchHandlerFrame: 00000029dbdff5d0] 

從卦象看,真尼瑪坑爹呀,halcon的釋放居然還要和某一個窗口通訊,即底層的 NtUserMessageCall 方法,窗口句柄記錄在 rcx 寄存器裏,輸出如下:


0:005> r
rax=0000000000001007 rbx=00000029dbdfef10 rcx=00000000000f3736
rdx=000000000000c258 rsi=000000000000c258 rdi=0000000000000000
rip=00007ff86b151124 rsp=00000029dbdfea38 rbp=00007fff985c4ed0
 r8=0000000000000015  r9=0000000000000000 r10=00007fff96d40000
r11=0000000000000000 r12=00000029dbdfefb0 r13=00007fff985c4ed0
r14=0000000000000e20 r15=00000000000f3736
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
win32u!NtUserMessageCall+0x14:
00007ff8`6b151124 c3              ret

接下來的問題如何找到 rcx 對應的窗口是哪一個,這個需要藉助強大的 spy++ 探測,這個在我之前的文章都有所介紹,截圖如下:

到這裏所有的來龍去脈都搞清楚了,即窗體無響應導致的終結器線程卡死,進而引發災難性的後果,最後讓朋友重點關注下 halcon 以及用 spy++ 的探測。

三:總結

作為一個調試師,要善用多個分析工具,往往在解決問題時事半功倍。

圖片名稱
user avatar
0 位用戶收藏了這個故事!

發佈 評論

Some HTML is okay.