eBPF 求證坊間傳聞：Java GC 日誌可導致整個 JVM 服務卡頓？詳情 - jvm,gc MarkZhu 博客

概述

實現世界的 Java 應用，都會記錄 GC 日誌。但不是所有人都知道小小的日誌可能導致整個 JVM 服務卡頓。本文嘗試用 eBPF 等分析方法，去證明具體環境下，問題的存在與否。

審誤和發佈本文時，我才是二陽後活過來數小時而已，寫了數週的文章實在不想再拖延發布了。如文章有錯漏，還請多多包涵和指正。

引

Java 服務卡頓，是 Java 世界永恆的話題之王。想到 Java 卡頓，大部分人的第一反應是以下關鍵詞：

GC
Stop the world(STW)

而寫 GC 日誌導致 Java 卡頓可能是個小眾話題。不過我的確在之前的電商公司中目睹過。那是 2017 年的 12/8 左右，還是 Java 8，當時相關的專家花了大概數週時間去論證和實驗，最終開出診斷結果和藥方：

診斷結果：寫 GC 日誌導致 JVM GC 卡頓
藥方：GC 日誌不落盤（寫 ram disk/tmpfs），或落到獨立的 disk 中。

轉眼來到 2023 年，OpenJDK 21 都來了。問題還在嗎？看看搜索引擎的結果：

Eliminating Large JVM GC Pauses Caused by Background IO Traffic - 2016年 Zhenyun Zhuang@LinkedIn
- OS-Caused Large JVM Pauses: Investigations and Solutions - 2016年 Zhenyun Zhuang@LinkedIn
JVM性能調優--YGC - heapdump.cn
JVM 輸出 GC 日誌導致 JVM 卡住，我 TM 人傻了

OpenJDK 的修正方案

這個問題是有廣泛認知了，OpenJDK 也給出瞭解決修正：

Buffered Logging? - rramakrishna@twitter.com
JDK-8229517: Support for optional asynchronous/buffered logging
- Git Commit
JDK-8264323: Add a global option -Xlog:async for Unified Logging
Asynchronous Logging in Corretto 17 - Xin Liu@amazon.com

遺憾的是，出於新版本與舊版本行為一致的兼容性原則，這個修正默認是關閉的，即需要加 Java 命令行參數 -Xlog:async 來手工打開異步原生線程寫 GC 日誌的 “功能” 。

那麼問題來了，是直接加參數了事，還是求證一下，服務的卡頓來源是否真正來源於這個問題，再考慮修改參數？當然，本文選擇是後者。

基本原理

JVM 寫 GC 日誌時機分析

Safepoint

[From https://krzysztofslusarski.github.io/2022/12/12/async-manual....]

上圖大概説明了 safepoint 的主要流程。有興趣同學可以看看上面鏈接，或搜索一下，網上很多好文章説明過了。我就不搬門弄斧了。

一點需要注意的是，safepoint STOP THE WORLD(STW) 的使用者不只有 GC。還有其它。

這裏只簡單説明一下（我不是 JVM 專家，所以請謹慎使用以下觀點）：

Global safepoint request

1.1 有一個線程提出了進入 safepoint 的請求，其中帶上 safepoint operation 參數，參數其實是 STOP THE WORLD(STW) 後要執行的 Callback 操作。可能是分配內存不足，觸發 GC。也可能是其它原因。

1.2 一個叫 “VM Thread” 的線程在收到 safepoint request 後，修改一個 JVM 全局的 safepoint flag 為 true（這裏只為簡單説明，如果你是行家，那麼這個 flag 其實是操作系統的內存缺頁標識）。

1.3 然後這個 “VM Thread” 就開始等待其它應用線程（App thread）到達（進入） safepoint 。

1.4 其它應用線程（App thread）其實會高頻檢查這個 safepoint flag ，當發現為 true 時，就到達（進入） safepoint 狀態。

源碼鏈接 SafepointSynchronize::begin()

Global safepoint

當 “VM Thread” 發現所有 App thread 都到達 safepoint （真實的 STW 的開始）。就開始執行 safepoint operation 。本文中，當然就是 GC 操作了。

源碼鏈接 RuntimeService::record_safepoint_synchronized()

End of safepoint operation

safepoint operation 執行完畢， “VM Thread” 結束 STW 。

源碼鏈接 SafepointSynchronize::end()

GC

上面關鍵一點是，在 STW 後，會執行 GC 操作 ，完畢後，才結束 STW。GC 操作 是會寫 GC 日誌的，默認會調用 write syscall 。證明見下面。

診斷

診斷 - eBPF 跟蹤法

先看看 bpftrace 程序 trace-safepoint.bt ：

#!/usr/local/bin/bpftrace
// trace-safepoint.bt

BEGIN
{}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*begin*
{
    @java_pid[pid] = pid;
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*begin*
/@java_pid[pid]/
{
    @sp_begin[pid, tid, comm, ustack] = count();
    @vm_thread_tid[tid] = tid;
    @vm_thread_waiting_sp_sync[tid] = tid;
    @vm_thread_in_STW[tid] = tid;
    printf("SafepointSynchronize::begin %-6d %-16s\n", tid, comm);
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*RuntimeService*record_safepoint_synchronized*
/@java_pid[pid]/
{
    @sp_synchronized[pid, tid, comm, ustack] = count();
    delete (@vm_thread_waiting_sp_sync[tid]);
    printf("RuntimeService::record_safepoint_synchronized %-6d %-16s\n", tid, comm);
}

uretprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*end*
/@java_pid[pid]/
{
    delete (@vm_thread_in_STW[tid]);
    printf("AsyncLogWriter: %-6d %-16s %s\n", pid, comm, func);
    printf("SafepointSynchronize::end %-6d %-16s\n", tid, comm);
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*AsyncLogWriter*
/@java_pid[pid] && @vm_thread_in_STW[tid]/
{
    printf("AsyncLogWriter: %-6d %-16s %s\n", pid, comm, func);
    @AsyncLogWriter[pid, tid, comm, ustack] = count();
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*LogFileOutput*
/@java_pid[pid] && @vm_thread_in_STW[tid]/
{
    printf("LogFileOutput: %-6d %-16s %s\n", pid, comm, func);
    @LogFileOutput[pid, tid, comm, ustack] = count();
}

tracepoint:block:block_rq_complete,
tracepoint:block:block_bio_complete
/@java_pid[pid]/
{
    @block_bio_complete[pid, tid, comm, kstack(4)] = count();
}

tracepoint:syscalls:sys_enter_sync,
tracepoint:syscalls:sys_enter_syncfs,
tracepoint:syscalls:sys_enter_fsync,
tracepoint:syscalls:sys_enter_fdatasync,
tracepoint:syscalls:sys_enter_sync_file_range*,
tracepoint:syscalls:sys_enter_msync
/@java_pid[pid]/
{
    time("%H:%M:%S  ");
    printf("sync syscall: %-6d %-16s %s\n", pid, comm, probe);

    @sync_call[pid, tid, comm, probe, ustack(6)] = count();
    printf("sp sync_call: %-6d %-16s %s\n", pid, comm, probe);

}

kprobe:vfs_read*,
kprobe:vfs_write*,
kprobe:vfs_fsync,
kprobe:vfs_open,
kprobe:vfs_create
/@java_pid[pid]/
{
    @vfs_call[pid, tid, comm, func, ustack(6)] = count();
    printf("sp vfs_call: %-6d %-16s %s\n", pid, comm, probe);
}

END
{}

這個程序假設你的機器上只運行了一個 java 程序，如果多個，要作點修改。

我們主要看 BPF probe 探針點：

uprobe:SafepointSynchronize::begin()
uprobe:RuntimeService::record_safepoint_synchronized()
uprobe:LogFileOutput::*() ： JVM 日誌寫文件代碼
uprobe:AsyncLogWriter::*() ： JVM 日誌異步化代碼
kprobe:vfs_write ：內核文件寫操作
uprobe:SafepointSynchronize::end()

然後寫一個簡單的 Java 程序， TestGCLog.java ：

// TestGCLog.java
import java.util.*;
public class TestGCLog {
   List objList = new ArrayList();
   byte[] m = new byte[100*1024];
   public static void main(String[] arg) throws Exception {
        TestGCLog t = new TestGCLog();
        t.test();
   }

   void test() throws Exception {
    long lastprint = System.currentTimeMillis();
    while(true) {
        objList.clear();
                long start = System.currentTimeMillis();
                TestGCLog newObj = new TestGCLog();
                long end = System.currentTimeMillis();
                if( end - start > 10 ) {
                    System.out.println( new Date() + " - TestGCLog: slow new object: " + (end - start) + "ms"  );
                }                
                objList.add( newObj );
        Thread.sleep(5);
        long now = System.currentTimeMillis();
        if( now - lastprint > 5 * 1000 ) {
            System.out.println( new Date() + " - TestGCLog" );
            lastprint = now;
        }
    }
   }
}

執行 java 程序和 bpftrace:

/usr/lib/jvm/java-17-openjdk-amd64/bin/javac TestGCLog.java

/usr/lib/jvm/java-17-openjdk-amd64/bin/java -Xloggc:/mnt/dm0/gc.log  -XX:+UnlockDiagnosticVMOptions -XX:+PreserveFramePointer -Xms128M -Xmx128M -cp . TestGCLog

sudo bpftrace ./trace-safepoint.bt

可以看到一些有趣的 bpftrace 程序輸出:

事件發生順序：

SafepointSynchronize::begin 422373 VM Thread       
RuntimeService::record_safepoint_synchronized 422373 VM Thread       
LogFileOutput: 422361 VM Thread        LogFileOutput::write(LogDecorations const&, char const*)
AsyncLogWriter: 422361 VM Thread        AsyncLogWriter::instance()
sp vfs_call: 422361 VM Thread        kprobe:vfs_write
AsyncLogWriter: 422361 VM Thread        VMThread::inner_execute(VM_Operation*)
SafepointSynchronize::end 422373 VM Thread

可見 vfs_write 寫文件的 syscall 調用，是同步地發生在 STW 期間的。即在 SafepointSynchronize::begin 與

SafepointSynchronize::end 之間。

一些 stack 信息：

@AsyncLogWriter[422361, 422373, VM Thread, 
    AsyncLogWriter::instance()+0
    LogTagSet::vwrite(LogLevel::type, char const*, __va_list_tag*)+471
    LogTargetHandle::print(char const*, ...)+159
    LogStream::write(char const*, unsigned long)+338
    outputStream::do_vsnprintf_and_write_with_automatic_buffer(char const*, __va_list_tag*, bool)+194
    outputStream::print_cr(char const*, ...)+407
    GCTraceTimeLoggerImpl::log_end(TimeInstant<CompositeCounterRepresentation, CompositeElapsedCounterSource>)+288
    G1CollectedHeap::do_collection_pause_at_safepoint_helper(double)+2116
    G1CollectedHeap::do_collection_pause_at_safepoint(double)+53
    VM_G1CollectForAllocation::doit()+75
    VM_Operation::evaluate()+226
    VMThread::evaluate_operation(VM_Operation*)+177
    VMThread::inner_execute(VM_Operation*)+436
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 18

@LogFileOutput[422361, 422373, VM Thread, 
    LogFileOutput::write(LogDecorations const&, char const*)+0
    LogTargetHandle::print(char const*, ...)+159
    LogStream::write(char const*, unsigned long)+338
    outputStream::do_vsnprintf_and_write_with_automatic_buffer(char const*, __va_list_tag*, bool)+194
    outputStream::print_cr(char const*, ...)+407
    GCTraceTimeLoggerImpl::log_end(TimeInstant<CompositeCounterRepresentation, CompositeElapsedCounterSource>)+288
    G1CollectedHeap::do_collection_pause_at_safepoint_helper(double)+2116
    G1CollectedHeap::do_collection_pause_at_safepoint(double)+53
    VM_G1CollectForAllocation::doit()+75
    VM_Operation::evaluate()+226
    VMThread::evaluate_operation(VM_Operation*)+177
    VMThread::inner_execute(VM_Operation*)+436
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 17

@sp_begin[404917, 404924, VM Thread, 
    SafepointSynchronize::begin()+0
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 1
@sp_synchronized[404917, 404924, VM Thread, 
    RuntimeService::record_safepoint_synchronized(long)+0
    VMThread::inner_execute(VM_Operation*)+389
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 1

你可以參見下文的 [Linux 如何模擬 IO 卡頓] 一節來模擬 GC 日誌輸出的目標盤 IO 卡頓、甚至掛起的情況。

如果你用 sudo dmsetup suspend /dev/$your_dm_dev 來掛起目標盤，從 Java 程序的 stdout 和 top 中可以直接看到，整個 JVM 也暫停了。

如果這時用 top -H -p $(pgrep java) 查看線程狀態，你會發現一個在 D (uninterruptable_sleep) 狀態的 VM Thread 線程。

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND           
 430796 labile    20   0 3068740 112820  20416 S   0.3   0.2   0:00.57 VM Periodic Tas   
 430779 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.00 java              
 430780 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:02.96 java              
 430781 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.02 GC Thread#0       
 430782 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.00 G1 Main Marker       
 430785 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.05 G1 Service        
 430786 labile    20   0 3068740 112820  20416 D   0.0   0.2   0:00.04 VM Thread

如果好奇 VM Thread 掛在什麼地方，可以看：

$ sudo cat /proc/`pgrep java`/task/$thread_id_ofVM_Thread/stack
[<0>] percpu_rwsem_wait+0x107/0x130
[<0>] vfs_write+0x274/0x2a0  <----
[<0>] ksys_write+0x67/0xf0  <----
[<0>] __x64_sys_write+0x19/0x30
[<0>] do_syscall_64+0x5c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

或：
[<0>] do_get_write_access+0x261/0x330
[<0>] jbd2_journal_get_write_access+0x54/0xa0
[<0>] __ext4_journal_get_write_access+0x8f/0x1e0
[<0>] ext4_reserve_inode_write+0xa0/0xe0
[<0>] __ext4_mark_inode_dirty+0x57/0x230
[<0>] ext4_dirty_inode+0x5c/0x90
[<0>] __mark_inode_dirty+0x5e/0x3b0
[<0>] generic_update_time+0x6c/0xf0
[<0>] file_update_time+0x12f/0x150
[<0>] file_modified+0x27/0x40
[<0>] ext4_buffered_write_iter+0x53/0x130
[<0>] ext4_file_write_iter+0x49/0x80
[<0>] new_sync_write+0xfe/0x190 <----
[<0>] vfs_write+0x20d/0x2a0 <----
[<0>] ksys_write+0x67/0xf0
[<0>] __x64_sys_write+0x19/0x30
[<0>] do_syscall_64+0x5c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

診斷 - gdb 跟蹤法

由於本文的重點只是 BPF，所以 gdb 只簡單寫一下。

gdb debug open-jdk 的方法，你可以參見下文的 [Ubuntu 下的 jdk gdb 環境] 一節。

使用 gdb 可以在關注的 JVM 函數和內核調用中設置斷點。

$ gdb -ex 'handle SIGSEGV noprint nostop pass' -p `pgrep java`

(gdb) thead thead_id_of_vm_thread

(gdb) bt
#0  __GI___libc_write (nbytes=95, buf=0x7f7cc4031390, fd=4) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __GI___libc_write (fd=4, buf=0x7f7cc4031390, nbytes=95) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x00007f7cca88af6d in _IO_new_file_write (f=0x7f7cc4000c20, data=0x7f7cc4031390, n=95) at ./libio/fileops.c:1180
#3  0x00007f7cca88ca61 in new_do_write (to_do=95, data=0x7f7cc4031390 "[3598.772s][info][gc] GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms\n", fp=0x7f7cc4000c20) at ./libio/libioP.h:947
#4  _IO_new_do_write (to_do=95, data=0x7f7cc4031390 "[3598.772s][info][gc] GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms\n", fp=0x7f7cc4000c20) at ./libio/fileops.c:425
#5  _IO_new_do_write (fp=0x7f7cc4000c20, data=0x7f7cc4031390 "[3598.772s][info][gc] GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms\n", to_do=95) at ./libio/fileops.c:422
#6  0x00007f7cca88a558 in _IO_new_file_sync (fp=0x7f7cc4000c20) at ./libio/fileops.c:798
#7  0x00007f7cca87f22a in __GI__IO_fflush (fp=0x7f7cc4000c20) at ./libio/libioP.h:947
#8  0x00007f7cc9eae679 in LogFileStreamOutput::flush (this=0x7f7cc4003e70) at src/hotspot/share/logging/logFileStreamOutput.cpp:93
#9  LogFileStreamOutput::write (this=this@entry=0x7f7cc4003e70, decorations=..., msg=msg@entry=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms")
    at src/hotspot/share/logging/logFileStreamOutput.cpp:131
#10 0x00007f7cc9eadade in LogFileOutput::write_blocking (msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms", decorations=..., this=0x7f7cc4003e70)
    at src/hotspot/share/logging/logFileOutput.cpp:308
#11 LogFileOutput::write (msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms", decorations=..., this=0x7f7cc4003e70) at src/hotspot/share/logging/logFileOutput.cpp:332
#12 LogFileOutput::write (this=0x7f7cc4003e70, decorations=..., msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms") at src/hotspot/share/logging/logFileOutput.cpp:320
#13 0x00007f7cc9eb3a47 in LogTagSet::log (msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms", level=LogLevel::Info, 
    this=0x7f7cca74bee0 <LogTagSetMapping<(LogTag::type)43, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::_tagset>) at src/hotspot/share/logging/logTagSet.cpp:85
#14 LogTagSet::vwrite (this=0x7f7cca74bee0 <LogTagSetMapping<(LogTag::type)43, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::_tagset>, level=LogLevel::Info, fmt=<optimized out>, 
    args=<optimized out>) at src/hotspot/share/logging/logTagSet.cpp:150
#15 0x00007f7cc9b7f09f in LogTargetHandle::print (this=this@entry=0x7f7cac39c2f0, fmt=fmt@entry=0x7f7cca41dd6e "%s") at src/hotspot/share/logging/logHandle.hpp:95
#16 0x00007f7cc9eb3232 in LogStream::write (this=0x7f7cac39c260, s=0x7f7cac39b930 " 3.261ms\n", len=9) at src/hotspot/share/logging/logStream.hpp:52
#17 0x00007f7cca03eaa2 in outputStream::do_vsnprintf_and_write_with_automatic_buffer (this=this@entry=0x7f7cac39c260, format=format@entry=0x7f7cca498e6c " %.3fms", ap=ap@entry=0x7f7cac39c158, add_cr=add_cr@entry=true)
    at src/hotspot/share/utilities/ostream.cpp:131
#18 0x00007f7cca03f427 in outputStream::do_vsnprintf_and_write (add_cr=true, ap=0x7f7cac39c158, format=0x7f7cca498e6c " %.3fms", this=0x7f7cac39c260) at src/hotspot/share/utilities/ostream.cpp:144
#19 outputStream::print_cr (this=this@entry=0x7f7cac39c260, format=format@entry=0x7f7cca498e6c " %.3fms") at src/hotspot/share/utilities/ostream.cpp:158
#20 0x00007f7cc9b8fec0 in GCTraceTimeLoggerImpl::log_end (this=<optimized out>, end=...) at src/hotspot/share/gc/shared/gcTraceTime.cpp:70
#21 0x00007f7cc9afc8c4 in GCTraceTimeDriver::at_end (end=..., cb=<optimized out>, this=0x7f7cac39c688) at src/hotspot/share/gc/shared/gcTraceTime.inline.hpp:78
#22 GCTraceTimeDriver::~GCTraceTimeDriver (this=0x7f7cac39c688, __in_chrg=<optimized out>) at src/hotspot/share/gc/shared/gcTraceTime.inline.hpp:61
#23 GCTraceTimeImpl::~GCTraceTimeImpl (this=0x7f7cac39c618, __in_chrg=<optimized out>) at src/hotspot/share/gc/shared/gcTraceTime.hpp:141
#24 GCTraceTimeWrapper<(LogLevel::type)3, (LogTag::type)43, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::~GCTraceTimeWrapper (this=0x7f7cac39c610, __in_chrg=<optimized out>)
    at src/hotspot/share/gc/shared/gcTraceTime.inline.hpp:184
#25 G1CollectedHeap::do_collection_pause_at_safepoint_helper (this=this@entry=0x7f7cc4026ef0, target_pause_time_ms=target_pause_time_ms@entry=200) at src/hotspot/share/gc/g1/g1CollectedHeap.cpp:3071
#26 0x00007f7cc9afcb45 in G1CollectedHeap::do_collection_pause_at_safepoint (this=this@entry=0x7f7cc4026ef0, target_pause_time_ms=200) at src/hotspot/share/gc/g1/g1CollectedHeap.cpp:2887
#27 0x00007f7cc9b7902b in VM_G1CollectForAllocation::doit (this=0x7f7cc8ffe690) at src/hotspot/share/gc/g1/g1VMOperations.cpp:146
#28 0x00007f7cca35b632 in VM_Operation::evaluate (this=this@entry=0x7f7cc8ffe690) at src/hotspot/share/runtime/vmOperations.cpp:70
#29 0x00007f7cca35ced1 in VMThread::evaluate_operation (this=this@entry=0x7f7cc40e0f10, op=0x7f7cc8ffe690) at src/hotspot/share/runtime/vmThread.cpp:269
#30 0x00007f7cca35d2d4 in VMThread::inner_execute (this=this@entry=0x7f7cc40e0f10, op=<optimized out>) at src/hotspot/share/runtime/vmThread.cpp:415
#31 0x00007f7cca35d5c7 in VMThread::loop (this=0x7f7cc40e0f10) at src/hotspot/share/runtime/vmThread.cpp:482
#32 VMThread::run (this=0x7f7cc40e0f10) at src/hotspot/share/runtime/vmThread.cpp:162
#33 0x00007f7cca2d7834 in Thread::call_run (this=0x7f7cc40e0f10) at src/hotspot/share/runtime/thread.cpp:394
#34 0x00007f7cca0326c9 in thread_native_entry (thread=0x7f7cc40e0f10) at src/hotspot/os/linux/os_linux.cpp:724
#35 0x00007f7cca894b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#36 0x00007f7cca926a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

雲原生下可能加劇的問題

worker node 共享內存或 container limit 內存後，Direct Page Reclaim

在雲環境下，一台機器可能運行很多 container 。有的 container 是高 IO 的應用，如寫日誌，這些都會佔用 page cache 和對共享的 disk 產生爭用壓力。整機的內存不足，或者是 container 本身的內存不足，都可能觸發 Direct Page Reclaim
worker node 共享本地磁盤 IO 爭用
Ceph 等 PV 網絡盤穩定性遠不如本地盤，帶寬與 latency 相對不穩定。

如以下一個配置，多個 container 可能共享了 emptyDir 用的本地盤。然後在 IO 爭用時，寫 GC 日誌就可能出現 STW 下的高 latency ，直接導致服務 API latency 超時。

apiVersion: v1
kind: Pod
metadata:
  annotations:
  labels:
  name: xyz
spec:
  containers:
  - command: java ... -Xloggc:/usr/share/myapp/logs/gc.log ...
    ...
    volumeMounts:
    - mountPath: /usr/share/myapp/logs
      name: myapp-logs
  volumes:
  - emptyDir: {}
    name: myapp-logs

結語

資金是好東西，注意力也是好東西，人們為了擁有它們無所不用其極。但也有它的副作用。君不見現在 IT 技術界，很容易就出現一些 拜名主義，時常三句不離一堆新技術 keyword 。知道錘子好是好事，但不是所有東西都是釘子。瞭解東西本身和它的特性與缺點，才是做好技術選型的關鍵。現在市場上已經開始了一些 BPF 的濫用，我們靜觀其變吧。我還是覺得， eBPF 現在最成熟的應用場景是：用其它手段難以分析的問題的跟蹤。如內核的異常行為。

後面，我會寫另外兩編 Twins 文章：《eBPF 求證坊間傳聞：mmap + Java Safepoint 可導致整個 JVM 服務卡頓？》、《如何測量進程級別的 IO 延遲》

“學而不思則罔思而不學則殆” -- 《論語·為政》

附錄

Linux 應用的寫文件操作為何會卡頓

瞭解過 Linux IO 機制的同學可能會提出疑問，應用的 write syscall 是寫入 pagecache(dirty)，然後再由專門的 pdflush 線程負責實際的IO操作。所以應用本身一般是不會有寫文件卡頓的情況的。除非應用自己調用 fsync() 強制同步落盤。

的確，Linux 的主要IO寫路徑(Wite Path)是異步為主。但凡事有例外（魔鬼在細節），這正是程序員世界好玩的地方：Why buffered writes are sometimes stalled - YOSHINORI MATSUNOBU 。這文章總結起來有幾個可能

READ MODIFY WRITE - 一些情況需要先讀再寫
WRITE() MAY BE BLOCKED FOR "STABLE PAGE WRITES" - 寫入的 page 之前已經是 dirty 狀態，且正在被 pdflush writeback 線程鎖住，準備落盤。
WAITING FOR JOURNAL BLOCK ALLOCATION IN EXT3/4

上面完整了嗎？我看還有其它可能：

Direct Page Reclaim - 內存頁回收

當應用向內核申請內存，而內核發現當前的空閒內存（如果是 container(cgroup)，空閒內存只包括 limit 內的部分）不足以滿足時，就可能會掛起當前線程，去釋放更多的內存，通常這個操作包括：

之前標記為 dirty 的 page cache 的落盤，落盤當然是慢的 IO 操作了。當前線程可能進入 “uninterruptable_sleep” 狀態。直到 IO 寫和釋放內存完成。

在雲環境下，一台機器可能運行很多 container 。有的 container 是高 IO 的應用，如寫日誌，這些都會佔用 page cache 和對共享的 disk 產生爭用壓力。整機的內存不足，或者是 container 本身的內存不足，都可能觸發 Direct Page Reclaim 。

更多可參考：Our lessons on Linux writeback: do “dirty” jobs the right way

偶像 brendangregg 利用 BPF(bpftrace) 有診斷方法：vmscan.bt

Linux 如何模擬 IO 卡頓

網上能搜索到的模擬 IO 卡頓大概有兩個方法：

device mapper delay 法
- Emulate a slow block device with dm-delay - Christophe Vu-Brugier
- Emulate a slow block device with dm-delay - flamingbytes
- dm-delay
systemtap 注入延遲法
- 巧用Systemtap注入延遲模擬IO設備抖動
- 使用SystemTap給I/O設備註入延遲

本文用的是 device mapper delay 法。device 本身帶了一個 delay。

# load the brd kernel module. rd_nr is the maximum number of ramdisks. rd_size is the ramdisk size in KB.
sudo modprobe brd rd_nr=1 rd_size=1048576

ls -l /dev/ram0
brw-rw---- 1 root disk 1, 0 Aug 24 20:00 /dev/ram0

sudo blockdev --getsize /dev/ram0 # Display the size in 512-byte sectors
2097152

#We specify the read write delay as 200ms, specify the write latency(e.g. 100ms) as below.
size=$(sudo blockdev --getsize /dev/ram0)
sudo dmsetup remove delayed
echo "0 $size delay /dev/ram0 0 200" | sudo dmsetup create delayed


sudo dmsetup table delayed
0 2097152 delay 1:0 0 50

ls -la /dev/mapper/ | grep delayed
lrwxrwxrwx  1 root root       7 May 19 18:37 delayed -> ../dm-0

sudo mkfs.ext4 /dev/dm-0

sudo mkdir /mnt/dm0
sudo mount /dev/dm-0 /mnt/dm0

cd /mnt/dm0

sudo dd if=/dev/urandom of=/mnt/dm0/mapped.file bs=1G count=1

另外如果 debug 需要，可以掛起 device ，直接讓 mapper device 的所有 IO 操作進程掛起（uninterruptable_sleep）：

# 掛起
sudo dmsetup suspend /dev/$your_dm_dev

# 恢復
sudo dmsetup resume /dev/$your_dm_dev

Ubuntu 下的 BPF 跟蹤堆棧由於 -fno-omit-frame-pointer 的 glibc 缺失的問題

sudo apt install libc6-prof

env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu /usr/lib/jvm/java-17-openjdk-amd64/bin/java ...

TL;DR :

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1908307

Ubuntu 下的 jdk gdb 環境

gdb 用的 symbol table

sudo apt install openjdk
sudo apt install openjdk-17-dbg

Ubuntu 定製的 openjdk 源碼：

詳細見：https://www.cyberciti.biz/faq/how-to-get-source-code-of-packa...

## 前置條件，已經在 apt soures.list 中加入源碼的 apt repo 源

sudo apt update
cd ~
sudo apt-get source openjdk-17

MarkZhu 博客

MarkZhu 博客

博客 / 詳情