博客 / 詳情

返回

eBPF 求證坊間傳聞:Java GC 日誌可導致整個 JVM 服務卡頓?

概述

實現世界的 Java 應用,都會記錄 GC 日誌。但不是所有人都知道小小的日誌可能導致整個 JVM 服務卡頓。本文嘗試用 eBPF 等分析方法,去證明具體環境下,問題的存在與否。

審誤和發佈本文時,我才是二陽後活過來數小時而已,寫了數週的文章實在不想再拖延發布了。如文章有錯漏,還請多多包涵和指正。

Java 服務卡頓,是 Java 世界永恆的話題之王。想到 Java 卡頓,大部分人的第一反應是以下關鍵詞:

  • GC
  • Stop the world(STW)

而寫 GC 日誌導致 Java 卡頓可能是個小眾話題。不過我的確在之前的電商公司中目睹過。那是 2017 年的 12/8 左右,還是 Java 8,當時相關的專家花了大概數週時間去論證和實驗,最終開出診斷結果和藥方:

  • 診斷結果:寫 GC 日誌導致 JVM GC 卡頓
  • 藥方:GC 日誌不落盤(寫 ram disk/tmpfs),或落到獨立的 disk 中。

轉眼來到 2023 年,OpenJDK 21 都來了。問題還在嗎?看看搜索引擎的結果:

  • Eliminating Large JVM GC Pauses Caused by Background IO Traffic - 2016年 Zhenyun Zhuang@LinkedIn

    • OS-Caused Large JVM Pauses: Investigations and Solutions - 2016年 Zhenyun Zhuang@LinkedIn
  • JVM性能調優--YGC - heapdump.cn
  • JVM 輸出 GC 日誌導致 JVM 卡住,我 TM 人傻了

OpenJDK 的修正方案

這個問題是有廣泛認知了,OpenJDK 也給出瞭解決修正:

  • Buffered Logging? - rramakrishna@twitter.com
  • JDK-8229517: Support for optional asynchronous/buffered logging

    • Git Commit
  • JDK-8264323: Add a global option -Xlog:async for Unified Logging
  • Asynchronous Logging in Corretto 17 - Xin Liu@amazon.com

遺憾的是,出於新版本與舊版本行為一致的兼容性原則,這個修正默認是關閉的,即需要加 Java 命令行參數 -Xlog:async 來手工打開異步原生線程寫 GC 日誌的 “功能” 。

那麼問題來了,是直接加參數了事,還是求證一下,服務的卡頓來源是否真正來源於這個問題,再考慮修改參數? 當然,本文選擇是後者。

基本原理

JVM 寫 GC 日誌時機分析

Safepoint

image.png

[From https://krzysztofslusarski.github.io/2022/12/12/async-manual....]

上圖大概説明了 safepoint 的主要流程。有興趣同學可以看看上面鏈接,或搜索一下,網上很多好文章説明過了。我就不搬門弄斧了。

一點需要注意的是,safepoint STOP THE WORLD(STW) 的使用者不只有 GC。還有其它。

這裏只簡單説明一下(我不是 JVM 專家,所以請謹慎使用以下觀點):

  1. Global safepoint request

    1.1 有一個線程提出了進入 safepoint 的請求,其中帶上 safepoint operation 參數,參數其實是 STOP THE WORLD(STW) 後要執行的 Callback 操作 。可能是分配內存不足,觸發 GC。也可能是其它原因。

    1.2 一個叫 “VM Thread” 的線程在收到 safepoint request 後,修改一個 JVM 全局的 safepoint flag 為 true(這裏只為簡單説明,如果你是行家,那麼這個 flag 其實是操作系統的內存缺頁標識) 。

    1.3 然後這個 “VM Thread” 就開始等待其它應用線程(App thread) 到達(進入) safepoint 。

    1.4 其它應用線程(App thread)其實會高頻檢查這個 safepoint flag ,當發現為 true 時,就到達(進入) safepoint 狀態。

    源碼鏈接 SafepointSynchronize::begin()

  1. Global safepoint

    當 “VM Thread” 發現所有 App thread 都到達 safepoint (真實的 STW 的開始) 。就開始執行 safepoint operation 。本文中,當然就是 GC 操作了。

    源碼鏈接 RuntimeService::record_safepoint_synchronized()

  1. End of safepoint operation

    safepoint operation 執行完畢, “VM Thread” 結束 STW 。

    源碼鏈接 SafepointSynchronize::end()

GC

上面關鍵一點是,在 STW 後,會執行 GC 操作 ,完畢後,才結束 STW。GC 操作 是會寫 GC 日誌的,默認會調用 write syscall 。證明見下面。

診斷

診斷 - eBPF 跟蹤法

先看看 bpftrace 程序 trace-safepoint.bt :

#!/usr/local/bin/bpftrace
// trace-safepoint.bt

BEGIN
{}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*begin*
{
    @java_pid[pid] = pid;
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*begin*
/@java_pid[pid]/
{
    @sp_begin[pid, tid, comm, ustack] = count();
    @vm_thread_tid[tid] = tid;
    @vm_thread_waiting_sp_sync[tid] = tid;
    @vm_thread_in_STW[tid] = tid;
    printf("SafepointSynchronize::begin %-6d %-16s\n", tid, comm);
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*RuntimeService*record_safepoint_synchronized*
/@java_pid[pid]/
{
    @sp_synchronized[pid, tid, comm, ustack] = count();
    delete (@vm_thread_waiting_sp_sync[tid]);
    printf("RuntimeService::record_safepoint_synchronized %-6d %-16s\n", tid, comm);
}

uretprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*end*
/@java_pid[pid]/
{
    delete (@vm_thread_in_STW[tid]);
    printf("AsyncLogWriter: %-6d %-16s %s\n", pid, comm, func);
    printf("SafepointSynchronize::end %-6d %-16s\n", tid, comm);
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*AsyncLogWriter*
/@java_pid[pid] && @vm_thread_in_STW[tid]/
{
    printf("AsyncLogWriter: %-6d %-16s %s\n", pid, comm, func);
    @AsyncLogWriter[pid, tid, comm, ustack] = count();
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*LogFileOutput*
/@java_pid[pid] && @vm_thread_in_STW[tid]/
{
    printf("LogFileOutput: %-6d %-16s %s\n", pid, comm, func);
    @LogFileOutput[pid, tid, comm, ustack] = count();
}

tracepoint:block:block_rq_complete,
tracepoint:block:block_bio_complete
/@java_pid[pid]/
{
    @block_bio_complete[pid, tid, comm, kstack(4)] = count();
}

tracepoint:syscalls:sys_enter_sync,
tracepoint:syscalls:sys_enter_syncfs,
tracepoint:syscalls:sys_enter_fsync,
tracepoint:syscalls:sys_enter_fdatasync,
tracepoint:syscalls:sys_enter_sync_file_range*,
tracepoint:syscalls:sys_enter_msync
/@java_pid[pid]/
{
    time("%H:%M:%S  ");
    printf("sync syscall: %-6d %-16s %s\n", pid, comm, probe);

    @sync_call[pid, tid, comm, probe, ustack(6)] = count();
    printf("sp sync_call: %-6d %-16s %s\n", pid, comm, probe);

}

kprobe:vfs_read*,
kprobe:vfs_write*,
kprobe:vfs_fsync,
kprobe:vfs_open,
kprobe:vfs_create
/@java_pid[pid]/
{
    @vfs_call[pid, tid, comm, func, ustack(6)] = count();
    printf("sp vfs_call: %-6d %-16s %s\n", pid, comm, probe);
}

END
{}
這個程序假設你的機器上只運行了一個 java 程序,如果多個,要作點修改。

我們主要看 BPF probe 探針點:

  1. uprobe:SafepointSynchronize::begin()
  2. uprobe:RuntimeService::record_safepoint_synchronized()
  3. uprobe:LogFileOutput::*() : JVM 日誌寫文件代碼
  4. uprobe:AsyncLogWriter::*() : JVM 日誌異步化代碼
  5. kprobe:vfs_write :內核文件寫操作
  6. uprobe:SafepointSynchronize::end()

然後寫一個簡單的 Java 程序, TestGCLog.java :

// TestGCLog.java
import java.util.*;
public class TestGCLog {
   List objList = new ArrayList();
   byte[] m = new byte[100*1024];
   public static void main(String[] arg) throws Exception {
        TestGCLog t = new TestGCLog();
        t.test();
   }

   void test() throws Exception {
    long lastprint = System.currentTimeMillis();
    while(true) {
        objList.clear();
                long start = System.currentTimeMillis();
                TestGCLog newObj = new TestGCLog();
                long end = System.currentTimeMillis();
                if( end - start > 10 ) {
                    System.out.println( new Date() + " - TestGCLog: slow new object: " + (end - start) + "ms"  );
                }                
                objList.add( newObj );
        Thread.sleep(5);
        long now = System.currentTimeMillis();
        if( now - lastprint > 5 * 1000 ) {
            System.out.println( new Date() + " - TestGCLog" );
            lastprint = now;
        }
    }
   }
}

執行 java 程序和 bpftrace:

/usr/lib/jvm/java-17-openjdk-amd64/bin/javac TestGCLog.java

/usr/lib/jvm/java-17-openjdk-amd64/bin/java -Xloggc:/mnt/dm0/gc.log  -XX:+UnlockDiagnosticVMOptions -XX:+PreserveFramePointer -Xms128M -Xmx128M -cp . TestGCLog
sudo bpftrace ./trace-safepoint.bt

可以看到一些有趣的 bpftrace 程序輸出:

事件發生順序:

SafepointSynchronize::begin 422373 VM Thread       
RuntimeService::record_safepoint_synchronized 422373 VM Thread       
LogFileOutput: 422361 VM Thread        LogFileOutput::write(LogDecorations const&, char const*)
AsyncLogWriter: 422361 VM Thread        AsyncLogWriter::instance()
sp vfs_call: 422361 VM Thread        kprobe:vfs_write
AsyncLogWriter: 422361 VM Thread        VMThread::inner_execute(VM_Operation*)
SafepointSynchronize::end 422373 VM Thread 

可見 vfs_write 寫文件的 syscall 調用,是同步地發生在 STW 期間的。即在 SafepointSynchronize::begin 與

SafepointSynchronize::end 之間。

一些 stack 信息:

@AsyncLogWriter[422361, 422373, VM Thread, 
    AsyncLogWriter::instance()+0
    LogTagSet::vwrite(LogLevel::type, char const*, __va_list_tag*)+471
    LogTargetHandle::print(char const*, ...)+159
    LogStream::write(char const*, unsigned long)+338
    outputStream::do_vsnprintf_and_write_with_automatic_buffer(char const*, __va_list_tag*, bool)+194
    outputStream::print_cr(char const*, ...)+407
    GCTraceTimeLoggerImpl::log_end(TimeInstant<CompositeCounterRepresentation, CompositeElapsedCounterSource>)+288
    G1CollectedHeap::do_collection_pause_at_safepoint_helper(double)+2116
    G1CollectedHeap::do_collection_pause_at_safepoint(double)+53
    VM_G1CollectForAllocation::doit()+75
    VM_Operation::evaluate()+226
    VMThread::evaluate_operation(VM_Operation*)+177
    VMThread::inner_execute(VM_Operation*)+436
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 18

@LogFileOutput[422361, 422373, VM Thread, 
    LogFileOutput::write(LogDecorations const&, char const*)+0
    LogTargetHandle::print(char const*, ...)+159
    LogStream::write(char const*, unsigned long)+338
    outputStream::do_vsnprintf_and_write_with_automatic_buffer(char const*, __va_list_tag*, bool)+194
    outputStream::print_cr(char const*, ...)+407
    GCTraceTimeLoggerImpl::log_end(TimeInstant<CompositeCounterRepresentation, CompositeElapsedCounterSource>)+288
    G1CollectedHeap::do_collection_pause_at_safepoint_helper(double)+2116
    G1CollectedHeap::do_collection_pause_at_safepoint(double)+53
    VM_G1CollectForAllocation::doit()+75
    VM_Operation::evaluate()+226
    VMThread::evaluate_operation(VM_Operation*)+177
    VMThread::inner_execute(VM_Operation*)+436
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 17

@sp_begin[404917, 404924, VM Thread, 
    SafepointSynchronize::begin()+0
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 1
@sp_synchronized[404917, 404924, VM Thread, 
    RuntimeService::record_safepoint_synchronized(long)+0
    VMThread::inner_execute(VM_Operation*)+389
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 1

你可以參見下文的 [Linux 如何模擬 IO 卡頓] 一節來模擬 GC 日誌輸出的目標盤 IO 卡頓、甚至掛起的情況。

如果你用 sudo dmsetup suspend /dev/$your_dm_dev 來掛起目標盤,從 Java 程序的 stdout 和 top 中可以直接看到,整個 JVM 也暫停了。

如果這時用 top -H -p $(pgrep java) 查看線程狀態,你會發現一個在 D (uninterruptable_sleep) 狀態的 VM Thread 線程。

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND           
 430796 labile    20   0 3068740 112820  20416 S   0.3   0.2   0:00.57 VM Periodic Tas   
 430779 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.00 java              
 430780 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:02.96 java              
 430781 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.02 GC Thread#0       
 430782 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.00 G1 Main Marker       
 430785 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.05 G1 Service        
 430786 labile    20   0 3068740 112820  20416 D   0.0   0.2   0:00.04 VM Thread 

如果好奇 VM Thread 掛在什麼地方,可以看:

$ sudo cat /proc/`pgrep java`/task/$thread_id_ofVM_Thread/stack
[<0>] percpu_rwsem_wait+0x107/0x130
[<0>] vfs_write+0x274/0x2a0  <----
[<0>] ksys_write+0x67/0xf0  <----
[<0>] __x64_sys_write+0x19/0x30
[<0>] do_syscall_64+0x5c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

或:
[<0>] do_get_write_access+0x261/0x330
[<0>] jbd2_journal_get_write_access+0x54/0xa0
[<0>] __ext4_journal_get_write_access+0x8f/0x1e0
[<0>] ext4_reserve_inode_write+0xa0/0xe0
[<0>] __ext4_mark_inode_dirty+0x57/0x230
[<0>] ext4_dirty_inode+0x5c/0x90
[<0>] __mark_inode_dirty+0x5e/0x3b0
[<0>] generic_update_time+0x6c/0xf0
[<0>] file_update_time+0x12f/0x150
[<0>] file_modified+0x27/0x40
[<0>] ext4_buffered_write_iter+0x53/0x130
[<0>] ext4_file_write_iter+0x49/0x80
[<0>] new_sync_write+0xfe/0x190 <----
[<0>] vfs_write+0x20d/0x2a0 <----
[<0>] ksys_write+0x67/0xf0
[<0>] __x64_sys_write+0x19/0x30
[<0>] do_syscall_64+0x5c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

診斷 - gdb 跟蹤法

由於本文的重點只是 BPF,所以 gdb 只簡單寫一下。

gdb debug open-jdk 的方法,你可以參見下文的 [Ubuntu 下的 jdk gdb 環境] 一節。

使用 gdb 可以在關注的 JVM 函數和內核調用中設置斷點。

$ gdb -ex 'handle SIGSEGV noprint nostop pass' -p `pgrep java`

(gdb) thead thead_id_of_vm_thread

(gdb) bt
#0  __GI___libc_write (nbytes=95, buf=0x7f7cc4031390, fd=4) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __GI___libc_write (fd=4, buf=0x7f7cc4031390, nbytes=95) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x00007f7cca88af6d in _IO_new_file_write (f=0x7f7cc4000c20, data=0x7f7cc4031390, n=95) at ./libio/fileops.c:1180
#3  0x00007f7cca88ca61 in new_do_write (to_do=95, data=0x7f7cc4031390 "[3598.772s][info][gc] GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms\n", fp=0x7f7cc4000c20) at ./libio/libioP.h:947
#4  _IO_new_do_write (to_do=95, data=0x7f7cc4031390 "[3598.772s][info][gc] GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms\n", fp=0x7f7cc4000c20) at ./libio/fileops.c:425
#5  _IO_new_do_write (fp=0x7f7cc4000c20, data=0x7f7cc4031390 "[3598.772s][info][gc] GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms\n", to_do=95) at ./libio/fileops.c:422
#6  0x00007f7cca88a558 in _IO_new_file_sync (fp=0x7f7cc4000c20) at ./libio/fileops.c:798
#7  0x00007f7cca87f22a in __GI__IO_fflush (fp=0x7f7cc4000c20) at ./libio/libioP.h:947
#8  0x00007f7cc9eae679 in LogFileStreamOutput::flush (this=0x7f7cc4003e70) at src/hotspot/share/logging/logFileStreamOutput.cpp:93
#9  LogFileStreamOutput::write (this=this@entry=0x7f7cc4003e70, decorations=..., msg=msg@entry=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms")
    at src/hotspot/share/logging/logFileStreamOutput.cpp:131
#10 0x00007f7cc9eadade in LogFileOutput::write_blocking (msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms", decorations=..., this=0x7f7cc4003e70)
    at src/hotspot/share/logging/logFileOutput.cpp:308
#11 LogFileOutput::write (msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms", decorations=..., this=0x7f7cc4003e70) at src/hotspot/share/logging/logFileOutput.cpp:332
#12 LogFileOutput::write (this=0x7f7cc4003e70, decorations=..., msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms") at src/hotspot/share/logging/logFileOutput.cpp:320
#13 0x00007f7cc9eb3a47 in LogTagSet::log (msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms", level=LogLevel::Info, 
    this=0x7f7cca74bee0 <LogTagSetMapping<(LogTag::type)43, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::_tagset>) at src/hotspot/share/logging/logTagSet.cpp:85
#14 LogTagSet::vwrite (this=0x7f7cca74bee0 <LogTagSetMapping<(LogTag::type)43, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::_tagset>, level=LogLevel::Info, fmt=<optimized out>, 
    args=<optimized out>) at src/hotspot/share/logging/logTagSet.cpp:150
#15 0x00007f7cc9b7f09f in LogTargetHandle::print (this=this@entry=0x7f7cac39c2f0, fmt=fmt@entry=0x7f7cca41dd6e "%s") at src/hotspot/share/logging/logHandle.hpp:95
#16 0x00007f7cc9eb3232 in LogStream::write (this=0x7f7cac39c260, s=0x7f7cac39b930 " 3.261ms\n", len=9) at src/hotspot/share/logging/logStream.hpp:52
#17 0x00007f7cca03eaa2 in outputStream::do_vsnprintf_and_write_with_automatic_buffer (this=this@entry=0x7f7cac39c260, format=format@entry=0x7f7cca498e6c " %.3fms", ap=ap@entry=0x7f7cac39c158, add_cr=add_cr@entry=true)
    at src/hotspot/share/utilities/ostream.cpp:131
#18 0x00007f7cca03f427 in outputStream::do_vsnprintf_and_write (add_cr=true, ap=0x7f7cac39c158, format=0x7f7cca498e6c " %.3fms", this=0x7f7cac39c260) at src/hotspot/share/utilities/ostream.cpp:144
#19 outputStream::print_cr (this=this@entry=0x7f7cac39c260, format=format@entry=0x7f7cca498e6c " %.3fms") at src/hotspot/share/utilities/ostream.cpp:158
#20 0x00007f7cc9b8fec0 in GCTraceTimeLoggerImpl::log_end (this=<optimized out>, end=...) at src/hotspot/share/gc/shared/gcTraceTime.cpp:70
#21 0x00007f7cc9afc8c4 in GCTraceTimeDriver::at_end (end=..., cb=<optimized out>, this=0x7f7cac39c688) at src/hotspot/share/gc/shared/gcTraceTime.inline.hpp:78
#22 GCTraceTimeDriver::~GCTraceTimeDriver (this=0x7f7cac39c688, __in_chrg=<optimized out>) at src/hotspot/share/gc/shared/gcTraceTime.inline.hpp:61
#23 GCTraceTimeImpl::~GCTraceTimeImpl (this=0x7f7cac39c618, __in_chrg=<optimized out>) at src/hotspot/share/gc/shared/gcTraceTime.hpp:141
#24 GCTraceTimeWrapper<(LogLevel::type)3, (LogTag::type)43, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::~GCTraceTimeWrapper (this=0x7f7cac39c610, __in_chrg=<optimized out>)
    at src/hotspot/share/gc/shared/gcTraceTime.inline.hpp:184
#25 G1CollectedHeap::do_collection_pause_at_safepoint_helper (this=this@entry=0x7f7cc4026ef0, target_pause_time_ms=target_pause_time_ms@entry=200) at src/hotspot/share/gc/g1/g1CollectedHeap.cpp:3071
#26 0x00007f7cc9afcb45 in G1CollectedHeap::do_collection_pause_at_safepoint (this=this@entry=0x7f7cc4026ef0, target_pause_time_ms=200) at src/hotspot/share/gc/g1/g1CollectedHeap.cpp:2887
#27 0x00007f7cc9b7902b in VM_G1CollectForAllocation::doit (this=0x7f7cc8ffe690) at src/hotspot/share/gc/g1/g1VMOperations.cpp:146
#28 0x00007f7cca35b632 in VM_Operation::evaluate (this=this@entry=0x7f7cc8ffe690) at src/hotspot/share/runtime/vmOperations.cpp:70
#29 0x00007f7cca35ced1 in VMThread::evaluate_operation (this=this@entry=0x7f7cc40e0f10, op=0x7f7cc8ffe690) at src/hotspot/share/runtime/vmThread.cpp:269
#30 0x00007f7cca35d2d4 in VMThread::inner_execute (this=this@entry=0x7f7cc40e0f10, op=<optimized out>) at src/hotspot/share/runtime/vmThread.cpp:415
#31 0x00007f7cca35d5c7 in VMThread::loop (this=0x7f7cc40e0f10) at src/hotspot/share/runtime/vmThread.cpp:482
#32 VMThread::run (this=0x7f7cc40e0f10) at src/hotspot/share/runtime/vmThread.cpp:162
#33 0x00007f7cca2d7834 in Thread::call_run (this=0x7f7cc40e0f10) at src/hotspot/share/runtime/thread.cpp:394
#34 0x00007f7cca0326c9 in thread_native_entry (thread=0x7f7cc40e0f10) at src/hotspot/os/linux/os_linux.cpp:724
#35 0x00007f7cca894b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#36 0x00007f7cca926a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

雲原生下可能加劇的問題

  • worker node 共享內存 或 container limit 內存後,Direct Page Reclaim

    在雲環境下,一台機器可能運行很多 container 。有的 container 是高 IO 的應用,如寫日誌,這些都會佔用 page cache 和對共享的 disk 產生爭用壓力 。 整機的內存不足,或者是 container 本身的內存不足,都可能觸發 Direct Page Reclaim

  • worker node 共享本地磁盤 IO 爭用
  • Ceph 等 PV 網絡盤穩定性遠不如本地盤,帶寬與 latency 相對不穩定。

如以下一個配置,多個 container 可能共享了 emptyDir 用的本地盤。然後在 IO 爭用時,寫 GC 日誌就可能出現 STW 下的高 latency ,直接導致服務 API latency 超時。

apiVersion: v1
kind: Pod
metadata:
  annotations:
  labels:
  name: xyz
spec:
  containers:
  - command: java ... -Xloggc:/usr/share/myapp/logs/gc.log ...
    ...
    volumeMounts:
    - mountPath: /usr/share/myapp/logs
      name: myapp-logs
  volumes:
  - emptyDir: {}
    name: myapp-logs

結語

資金是好東西,注意力也是好東西,人們為了擁有它們無所不用其極。但也有它的副作用。君不見現在 IT 技術界,很容易就出現一些 拜名主義,時常三句不離一堆新技術 keyword 。知道錘子好是好事,但不是所有東西都是釘子。瞭解東西本身和它的特性與缺點,才是做好技術選型的關鍵。現在市場上已經開始了一些 BPF 的濫用,我們靜觀其變吧。我還是覺得, eBPF 現在最成熟的應用場景是:用其它手段難以分析的問題的跟蹤。如內核的異常行為。

後面,我會寫另外兩編 Twins 文章:《eBPF 求證坊間傳聞:mmap + Java Safepoint 可導致整個 JVM 服務卡頓?》、《如何測量 進程級別的 IO 延遲》

“學而不思則罔 思而不學則殆” -- 《論語·為政》

附錄

Linux 應用的寫文件操作為何會卡頓

瞭解過 Linux IO 機制的同學可能會提出疑問,應用的 write syscall 是寫入 pagecache(dirty),然後再由專門的 pdflush 線程負責實際的IO操作。所以應用本身一般是不會有寫文件卡頓的情況的。除非應用自己調用 fsync() 強制同步落盤。

的確,Linux 的主要IO寫路徑(Wite Path)是異步為主。但凡事有例外(魔鬼在細節),這正是程序員世界好玩的地方:Why buffered writes are sometimes stalled - YOSHINORI MATSUNOBU 。這文章總結起來有幾個可能

  • READ MODIFY WRITE - 一些情況需要先讀再寫
  • WRITE() MAY BE BLOCKED FOR "STABLE PAGE WRITES" - 寫入的 page 之前已經是 dirty 狀態,且正在被 pdflush writeback 線程鎖住,準備落盤。
  • WAITING FOR JOURNAL BLOCK ALLOCATION IN EXT3/4

上面完整了嗎?我看還有其它可能:

  • Direct Page Reclaim - 內存頁回收

當應用向內核申請內存,而內核發現當前的空閒內存(如果是 container(cgroup),空閒內存只包括 limit 內的部分)不足以滿足時, 就可能會掛起當前線程,去釋放更多的內存,通常這個操作包括:

  1. 之前標記為 dirty 的 page cache 的落盤,落盤當然是慢的 IO 操作了。當前線程可能進入 “uninterruptable_sleep” 狀態。直到 IO 寫和釋放內存完成。

在雲環境下,一台機器可能運行很多 container 。有的 container 是高 IO 的應用,如寫日誌,這些都會佔用 page cache 和對共享的 disk 產生爭用壓力 。 整機的內存不足,或者是 container 本身的內存不足,都可能觸發 Direct Page Reclaim 。

更多可參考:Our lessons on Linux writeback: do “dirty” jobs the right way

偶像 brendangregg 利用 BPF(bpftrace) 有診斷方法:vmscan.bt

Linux 如何模擬 IO 卡頓

網上能搜索到的模擬 IO 卡頓大概有兩個方法:

  • device mapper delay 法

    • Emulate a slow block device with dm-delay - Christophe Vu-Brugier
    • Emulate a slow block device with dm-delay - flamingbytes
    • dm-delay
  • systemtap 注入延遲法

    • 巧用Systemtap注入延遲模擬IO設備抖動
    • 使用SystemTap給I/O設備註入延遲

本文用的是 device mapper delay 法。device 本身帶了一個 delay。

# load the brd kernel module. rd_nr is the maximum number of ramdisks. rd_size is the ramdisk size in KB.
sudo modprobe brd rd_nr=1 rd_size=1048576

ls -l /dev/ram0
brw-rw---- 1 root disk 1, 0 Aug 24 20:00 /dev/ram0

sudo blockdev --getsize /dev/ram0 # Display the size in 512-byte sectors
2097152

#We specify the read write delay as 200ms, specify the write latency(e.g. 100ms) as below.
size=$(sudo blockdev --getsize /dev/ram0)
sudo dmsetup remove delayed
echo "0 $size delay /dev/ram0 0 200" | sudo dmsetup create delayed


sudo dmsetup table delayed
0 2097152 delay 1:0 0 50

ls -la /dev/mapper/ | grep delayed
lrwxrwxrwx  1 root root       7 May 19 18:37 delayed -> ../dm-0

sudo mkfs.ext4 /dev/dm-0

sudo mkdir /mnt/dm0
sudo mount /dev/dm-0 /mnt/dm0

cd /mnt/dm0

sudo dd if=/dev/urandom of=/mnt/dm0/mapped.file bs=1G count=1

另外如果 debug 需要,可以掛起 device ,直接讓 mapper device 的所有 IO 操作進程掛起(uninterruptable_sleep):

# 掛起
sudo dmsetup suspend /dev/$your_dm_dev

# 恢復
sudo dmsetup resume /dev/$your_dm_dev

Ubuntu 下的 BPF 跟蹤堆棧由於 -fno-omit-frame-pointer 的 glibc 缺失的問題

sudo apt install libc6-prof

env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu /usr/lib/jvm/java-17-openjdk-amd64/bin/java ...

TL;DR :

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1908307

Ubuntu 下的 jdk gdb 環境

gdb 用的 symbol table

sudo apt install openjdk
sudo apt install openjdk-17-dbg

Ubuntu 定製的 openjdk 源碼:

詳細見:https://www.cyberciti.biz/faq/how-to-get-source-code-of-packa...
## 前置條件,已經在 apt soures.list 中加入源碼的 apt repo 源

sudo apt update
cd ~
sudo apt-get source openjdk-17
user avatar object 頭像
1 位用戶收藏了這個故事!

發佈 評論

Some HTML is okay.