動態

詳情 返回 返回

技術分享 | Oracle RAC集羣節點2被驅逐案例 - 動態 詳情

本文為墨天輪數據庫管理服務團隊第135期技術分享,內容原創,作者為技術顧問羅海鷗,如需轉載請聯繫小墨(VX:modb666)並註明來源。如需查看更多文章可關注【墨天輪】公眾號。

適用範圍

Oracle RAC 11G、12C、19C

問題概述

客户一套RAC節點,版本為12.2.0.1。2025-07-31 07:39:21 節點2被驅逐。

以下是節點2alter日誌。

2025-07-31T07:20:49.680473+08:00
Thread 2 advanced to log sequence 116120 (LGWR switch)
  Current log# 7 seq# 116120 mem# 0: +DATA/FDCDB/ONLINELOG/group_7.271.1000847113
  Current log# 7 seq# 116120 mem# 1: +ARC/FDCDB/ONLINELOG/group_7.269.1000847113
2025-07-31T07:20:51.220283+08:00
Archived Log entry 193846 added for T-2.S-116119 ID 0xeac4c440 LAD:1
2025-07-31T07:38:00.059963+08:00
minact-scn: got error during useg scan e:12751 usn:1
minact-scn: useg scan erroring out with error e:12751
2025-07-31T07:39:00.979683+08:00
IPC Receiver dump detected. Sender instance 1 Receiver pnum 570 ospid 41888 [oracle@fdcdb2 (PPA7)], pser 12
2025-07-31T07:39:00.979980+08:00
Errors in file /oracle/app/diag/rdbms/fdcdb/fdcdb2/trace/fdcdb2_ppa7_41888.trc:
2025-07-31T07:39:21.277326+08:00
Detected an inconsistent instance membership by instance 1
Errors in file /oracle/app/diag/rdbms/fdcdb/fdcdb2/trace/fdcdb2_lmon_14788.trc  (incident=1987597):
ORA-29740: evicted by instance number 1, group incarnation 14
Incident details in: /oracle/app/diag/rdbms/fdcdb/fdcdb2/incident/incdir_1987597/fdcdb2_lmon_14788_i1987597.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2025-07-31T07:39:22.156952+08:00
Errors in file /oracle/app/diag/rdbms/fdcdb/fdcdb2/trace/fdcdb2_lmon_14788.trc:
ORA-29740: evicted by instance number 1, group incarnation 14
Errors in file /oracle/app/diag/rdbms/fdcdb/fdcdb2/trace/fdcdb2_lmon_14788.trc  (incident=1987598):
ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /oracle/app/diag/rdbms/fdcdb/fdcdb2/incident/incdir_1987598/fdcdb2_lmon_14788_i1987598.trc
2025-07-31T07:39:22.367538+08:00
CJQ0 (ospid: 17003): terminating the instance due to error 481
2025-07-31T07:39:23.851520+08:00
System state dump requested by (instance=2, osid=17003 (CJQ0)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/diag/rdbms/fdcdb/fdcdb2/trace/fdcdb2_diag_14756_20250731073923.trc
2025-07-31T07:39:24.449532+08:00

節點1alter日誌顯示:

2025-07-31T07:38:33 節點1的LMS進程無法和節點2連接

07:39:00 等待超時,

07:39:21 節點2被驅逐

2025-07-31T07:13:53.757631+08:00
Thread 1 advanced to log sequence 77598 (LGWR switch)
  Current log# 4 seq# 77598 mem# 0: +DATA/FDCDB/ONLINELOG/group_4.267.1000847077
  Current log# 4 seq# 77598 mem# 1: +ARC/FDCDB/ONLINELOG/group_4.260.1000847077
2025-07-31T07:13:55.322692+08:00
Archived Log entry 193845 added for T-1.S-77597 ID 0xeac4c440 LAD:1
2025-07-31T07:28:14.833306+08:00
Thread 1 advanced to log sequence 77599 (LGWR switch)
  Current log# 2 seq# 77599 mem# 0: +DATA/FDCDB/ONLINELOG/group_2.263.1000846993
  Current log# 2 seq# 77599 mem# 1: +ARC/FDCDB/ONLINELOG/group_2.258.1000846993
2025-07-31T07:28:16.360189+08:00
Archived Log entry 193847 added for T-1.S-77598 ID 0xeac4c440 LAD:1
2025-07-31T07:38:33.029246+08:00
LMS1 (ospid: 23514) has detected no messaging activity from instance 2
USER (ospid: 23514) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2025-07-31T07:38:33.035568+08:00
LMS4 (ospid: 23521) has detected no messaging activity from instance 2
2025-07-31T07:38:33.035590+08:00
Communications reconfiguration: instance_number 2 by ospid 23514
2025-07-31T07:38:33.035740+08:00
LMS3 (ospid: 23519) has detected no messaging activity from instance 2
2025-07-31T07:38:33.036206+08:00
LMS0 (ospid: 23512) has detected no messaging activity from instance 2
2025-07-31T07:38:33.043106+08:00
USER (ospid: 23521) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2025-07-31T07:38:33.043264+08:00
USER (ospid: 23519) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2025-07-31T07:38:33.043562+08:00
USER (ospid: 23512) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2025-07-31T07:38:33.047480+08:00
LMS2 (ospid: 23516) has detected no messaging activity from instance 2
USER (ospid: 23516) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2025-07-31T07:38:33.077660+08:00
LMON (ospid: 23508) drops the IMR request from LMS4 (ospid: 23521) because IMR is in progress and inst 2 is marked bad.
LMON (ospid: 23508) drops the IMR request from LMS3 (ospid: 23519) because IMR is in progress and inst 2 is marked bad.
LMON (ospid: 23508) drops the IMR request from LMS0 (ospid: 23512) because IMR is in progress and inst 2 is marked bad.
LMON (ospid: 23508) drops the IMR request from LMS2 (ospid: 23516) because IMR is in progress and inst 2 is marked bad.
2025-07-31T07:39:00.964588+08:00
IPC Send timeout detected. Sender: ospid 41916 [oracle@fdcdb1]
Receiver: inst 2 binc 2 ospid 41888
2025-07-31T07:39:21.290230+08:00
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2 
2025-07-31T07:39:21.444234+08:00
IPC Send timeout to 2.3 inc 12 for msg type 65518 from opid 26
2025-07-31T07:39:21.444310+08:00
IPC Send timeout to 2.3 inc 12 for msg type 151 from opid 26
2025-07-31T07:39:21.444366+08:00
IPC Send timeout to 2.3 inc 12 for msg type 73 from opid 26
2025-07-31T07:39:21.444401+08:00
IPC Send timeout to 2.2 inc 12 for msg type 65521 from opid 25

問題原因

根本原因:網絡擁塞導致節點2被驅逐,以下是分析過程。

1、lmon分析

節點1 lmon分析

節點1的lmon進程trc文件fdcdb2\_lmon\_14788.trc表明故障原因是網絡無法通信。(根據MOS解釋kjxgrrcfgchk: Initiating reconfig, reason 3表示網絡無法連接。)

*** 2025-07-31T07:38:33.046462+08:00
2025-07-31 07:38:33.046 : kjxgrrcfg: done (device ok) - ret = 3  hist 0x1679a (initial rsn: 3, cachd rsn: 0)
kjxgrrcfgchk: Initiating reconfig, reason=3
kjxgrrcfgchk: COMM rcfg - Disk Vote Required
kjfmReceiverHealthCB_CheckAll: Recievers are healthy.
2025-07-31 07:38:33.046 : kjxgrnetchk: start 0x44281b3b, end 0x4428d185
2025-07-31 07:38:33.046 : kjxgrnetchk: Network Validation wait: 46 sec
kjxgrnetchk: ce-event: from inst 1 to inst 2 ver 0x3cff
kjxgrrcfgchk: prev pstate 6  mapsz 1024
kjxgrrcfgchk: new  bmp: 1 2 
kjxgrrcfgchk: cnct bmp: 1 2 
kjxgrrcfgchk: disc bmp: 
kjxgrrcfgchk: work bmp: 1 2 
kjxgrrcfgchk: rr  bmp: 1 2 

image.png

節點2 lmon分析

07:32:52開始知道節點2被驅逐,一直存在網絡等待。

2025-07-31 07:29:29.457 : GSIPC:BSEND: last proc to post me: pid 38.14864 in kjc.h LINE:2308 ID:KJCS Post snd proxy to flush msg
2025-07-31 07:29:29.457 : GSIPC:BSEND: deq msg 0x2a5f756e0 type 44 qtm 30049 us seq 0.514174518 from opid 38 to 1.2
2025-07-31 07:29:29.457 : GSIPC:BSEND: last proc to post me: pid 38.14864 in kjc.h LINE:2308 ID:KJCS Post snd proxy to flush msg
IPCLW:[0.913]{E}[WAIT]:PROTO: [1753918372165022]PT DMP_REQ at seq 1292873920 from peer 192.168.2.10:14404

*** 2025-07-31T07:32:52.165931+08:00 IPCLW:[0.914]{-}[WAIT]:UTIL: [1753918372165022] PCNH 0x7f6c64fdcf78 State: 1 MSN: 392301845 Seq: 1292873920 Last Rcv 0:0:0.116.116424 Last valid Rcv 0:0:7.273.273813 IPCLW:[0.915]{-}[WAIT]:UTIL: [1753918372165022] Peer: LMS1.KSMSQ_dlm.23514 AckSeq: 1292873920. # Coalesced: 0 IPCLW:[0.916]{-}[WAIT]:UTIL: [1753918372165022] IVPort: 45836 TVPort: 6284 IMPT: 29019 RMPT: 17464 Ongoing Recv: No, last retrans dmp seq 0 IPCLW:[0.917]{-}[WAIT]:UTIL: [1753918372165022] Flags: 0x00000001 THint: 0x44cab1bf00000226 IHint: 0x705ddeb40000001f IPCLW:[0.918]{-}[WAIT]:UTIL: [1753918372165022] IPCLW State Dump IPCLW:[0.919]{-}[WAIT]:UTIL: [1753918372165022] UDP PORT 0x7f6c6dd93ad8, [LMS1.KSMSQ_dlm] Dump IPCLW:[0.920]{-}[WAIT]:UTIL: [1753918372165022] ID: 0x3b5d2e10 Type: UDP PT Opts: [] Flags: [], Address: 192.168.2.20:9149, MPT: No IPCLW:[0.921]{-}[WAIT]:UTIL: [1753918372165022] Last DataReady time 0:0:0.67.67549 Last Unrel msg rcvd time 487199:32:52.165.165022 IPCLW:[0.922]{-}[WAIT]:UTIL: [1753918372165022] ACNH Dump IPCLW:[0.923]{-}[WAIT]:UTIL: [1753918372165022] PCNH Dump IPCLW:[0.924]{-}[WAIT]:UTIL: [1753918372165022] PCNH 0x7f6c65ff79d8 State: 1 MSN: 2057350729 Seq: 2099703066 Last Rcv 8062:40:44.501.501764 Last valid Rcv 8062:40:44.501.501764 IPCLW:[0.925]{-}[WAIT]:UTIL: [1753918372165022] Peer: LMS1.KSMSQ_dlm.9072 AckSeq: 2099703066. # Coalesced: 0 IPCLW:[0.926]{-}[WAIT]:UTIL: [1753918372165022] IVPort: 45836 TVPort: 60222 IMPT: 29019 RMPT: 10109 Ongoing Recv: No, last retrans dmp seq 0 IPCLW:[0.927]{-}[WAIT]:UTIL: [1753918372165022] Flags: 0x00000000 THint: 0x44cab0510000001f IHint: 0x47d663e50000001f IPCLW:[0.928]{-}[WAIT]:UTIL: [1753918372165022] PCNH 0x7f6c65ff7448 State: 1 MSN: 595742835 Seq: 14107475 Last Rcv 8241:8:1.748.748540 Last valid Rcv

2、lms分析

節點1 lms分析

07:32:51開始,lms進程出現網絡相關等待

*** 2025-07-31T07:32:51.968793+08:00
IPCLW:[0.618]{E}[WAIT]:PROTO: [1753918371968513]RETRANS DBG local acnh 0x7f71db3b3ea8 dump:
IPCLW:[0.619]{-}[WAIT]:UTIL: [1753918371968513]  ACNH 0x7f71db3b3ea8 State: 1 MSN: 1776967435 Seq: 461160960 # Pending: 53
IPCLW:[0.620]{-}[WAIT]:UTIL: [1753918371968513]   Peer: LMS0.KSMSQ_dlm.14795 AckSeq: 0
IPCLW:[0.621]{-}[WAIT]:UTIL: [1753918371968513]   Flags: 0x00000000 IHint: 0x64d73e5e0000001f THint: 0x2457a2fa000000fd 
IPCLW:[0.622]{-}[WAIT]:UTIL: [1753918371968513]   Local Address: 192.168.2.10:44600 Remote Address: 192.168.2.20:44440
IPCLW:[0.623]{-}[WAIT]:UTIL: [1753918371968513]   Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 2dd3 xdata2 4fada3f2
IPCLW:[0.624]{-}[WAIT]:UTIL: [1753918371968513]             : mmsz 8472 mmr 9200 mms 2 xdata 1e1dd769
IPCLW:[0.625]{-}[WAIT]:UTIL: [1753918371968513]   IVPort: 26295 TVPort: 55145 IMPT: 14510 RMPT: 11731   Pending Sends: Yes Unacked Sends: Yes 
IPCLW:[0.626]{-}[WAIT]:UTIL: [1753918371968513]   Send Engine Queued: No sshdl -1 ssts 0 rtts 1753918371968771 snderrchk 8 creqcnt 1
IPCLW:[0.627]{-}[WAIT]:UTIL: [1753918371968513]   Unackd Messages 1776967382 -> 1776967434. SSEQ 461160907 Send Time: 0:0:0.210.210159 SMSN # Xmits: 64 EMSN 0:0:0.210.210159
IPCLW:[0.628]{-}[WAIT]:UTIL: [1753918371968513]  Pending send queue: 
IPCLW:[0.629]{-}[WAIT]:UTIL: [1753918371968513]    [0] Mbuf 0x7f71db165af0 MSN 1776967382 Seq 461160907 -> 461160908 # XMits: 64
IPCLW:[0.630]{-}[WAIT]:UTIL: [1753918371968513]    [1] Mbuf 0x7f71d9a7b5f0 MSN 1776967383 Seq 461160908 -> 461160909 # XMits: 64
IPCLW:[0.631]{-}[WAIT]:UTIL: [1753918371968513]    [2] Mbuf 0x7f71db11ba50 MSN 1776967384 Seq 461160909 -> 461160910 # XMits: 63
IPCLW:[0.632]{-}[WAIT]:UTIL: [1753918371968513]    [3] Mbuf 0x7f71db165f10 MSN 1776967385 Seq 461160910 -> 461160911 # XMits: 62
IPCLW:[0.633]{-}[WAIT]:UTIL: [1753918371968513]    [4] Mbuf 0x7f71db40f7f0 MSN 1776967386 Seq 461160911 -> 461160912 # XMits: 62
IPCLW:[0.634]{-}[WAIT]:UTIL: [1753918371968513]    [5] Mbuf 0x7f71db404ed0 MSN 1776967387 Seq 461160912 -> 461160913 # XMits: 61
IPCLW:[0.635]{-}[WAIT]:UTIL: [1753918371968513]    [6] Mbuf 0x7f71db3fd730 MSN 1776967388 Seq 461160913 -> 461160914 # XMits: 60
IPCLW:[0.636]{-}[WAIT]:UTIL: [1753918371968513]    [7] Mbuf 0x7f71db1377d0 MSN 1776967389 Seq 461160914 -> 461160915 # XMits: 60
IPCLW:[0.637]{-}[WAIT]:UTIL: [1753918371968513]    [8] Mbuf 0x7f71db404ab0 MSN 1776967390 Seq 461160915 -> 461160916 # XMits: 59
IPCLW:[0.638]{-}[WAIT]:UTIL: [1753918371968513]    [9] Mbuf 0x7f71db400490 MSN 1776967391 Seq 461160916 -> 461160917 # XMits: 59

節點2 lms分析

07:32:52 出現網絡等待

2025-07-31 07:29:29.457 : GSIPC:BSEND: last proc to post me: pid 38.14864 in kjc.h LINE:2308 ID:KJCS Post snd proxy to flush msg
2025-07-31 07:29:29.457 : GSIPC:BSEND: deq msg 0x2a5f756e0 type 44 qtm 30049 us seq 0.514174518 from opid 38 to 1.2
2025-07-31 07:29:29.457 : GSIPC:BSEND: last proc to post me: pid 38.14864 in kjc.h LINE:2308 ID:KJCS Post snd proxy to flush msg
IPCLW:[0.913]{E}[WAIT]:PROTO: [1753918372165022]PT DMP_REQ at seq 1292873920 from peer 192.168.2.10:14404

*** 2025-07-31T07:32:52.165931+08:00 IPCLW:[0.914]{-}[WAIT]:UTIL: [1753918372165022] PCNH 0x7f6c64fdcf78 State: 1 MSN: 392301845 Seq: 1292873920 Last Rcv 0:0:0.116.116424 Last valid Rcv 0:0:7.273.273813 IPCLW:[0.915]{-}[WAIT]:UTIL: [1753918372165022] Peer: LMS1.KSMSQ_dlm.23514 AckSeq: 1292873920. # Coalesced: 0 IPCLW:[0.916]{-}[WAIT]:UTIL: [1753918372165022] IVPort: 45836 TVPort: 6284 IMPT: 29019 RMPT: 17464 Ongoing Recv: No, last retrans dmp seq 0 IPCLW:[0.917]{-}[WAIT]:UTIL: [1753918372165022] Flags: 0x00000001 THint: 0x44cab1bf00000226 IHint: 0x705ddeb40000001f IPCLW:[0.918]{-}[WAIT]:UTIL: [1753918372165022] IPCLW State Dump IPCLW:[0.919]{-}[WAIT]:UTIL: [1753918372165022] UDP PORT 0x7f6c6dd93ad8, [LMS1.KSMSQ_dlm] Dump IPCLW:[0.920]{-}[WAIT]:UTIL: [1753918372165022] ID: 0x3b5d2e10 Type: UDP PT Opts: [] Flags: [], Address: 192.168.2.20:9149, MPT: No IPCLW:[0.921]{-}[WAIT]:UTIL: [1753918372165022] Last DataReady time 0:0:0.67.67549 Last Unrel msg rcvd time 487199:32:52.165.165022 IPCLW:[0.922]{-}[WAIT]:UTIL: [1753918372165022] ACNH Dump IPCLW:[0.923]{-}[WAIT]:UTIL: [1753918372165022] PCNH Dump IPCLW:[0.924]{-}[WAIT]:UTIL: [1753918372165022] PCNH 0x7f6c65ff79d8 State: 1 MSN: 2057350729 Seq: 2099703066 Last Rcv 8062:40:44.501.501764 Last valid Rcv 8062:40:44.501.501764 IPCLW:[0.925]{-}[WAIT]:UTIL: [1753918372165022] Peer: LMS1.KSMSQ_dlm.9072 AckSeq: 2099703066. # Coalesced: 0

3、等待事件分析

故障期間等待事件情況:

07:30:00至07:50:00 數據庫存在大量row cache lock和GC等待,這應該是故障引起的。

image.png

對比30號相同時間段,30號沒有異常等待事件。

image.png

説明故障期間,私網存在問題。

4、網絡分析

客户這邊數據庫每天07:30都有一波定時任務,並且多套數據庫都經過一個交換機,導致私網延遲30s發生了節點驅逐。

image.png

建議定時任務錯開事件,避免網絡擁塞。

解決方案

客户自行重啓了異常節點,故障已自動恢復。


墨天輪從樂知樂享的數據庫技術社區蓄勢出發,全面升級,提供多類型數據庫管理服務。墨天輪數據庫管理服務旨在為用户構建信賴可託付的數據庫環境,併為數據庫廠商提供中立的生態支持。
墨天輪數據庫服務官網:https://www.modb.pro/service

user avatar maimengdegongjian 頭像 zhaoqianglaoshi 頭像 shoushoudeqie 頭像 shuirongshui 頭像 startshineye 頭像 wxp686 頭像 jiang_5f3236dd7afd1 頭像
點贊 7 用戶, 點贊了這篇動態!
點贊

Add a new 評論

Some HTML is okay.