Web 平台開發日記 - 可觀測性實踐詳情 - go 天天向尚博客

Web 平台開發日記 - 可觀測性實踐

核心內容: Prometheus 監控集成、健康檢查、請求追蹤、結構化日誌、可觀測性體系
技術棧: Go + Gin + Prometheus + Correlation ID + Structured Logging

📋 目錄

目標
可觀測性架構
Prometheus 指標集成
健康檢查實現
Correlation ID 請求追蹤
結構化日誌系統

🎯 目標

[x] Prometheus 指標收集與暴露
[x] Health/Readiness 探針實現
[x] Correlation ID 請求追蹤
[x] 結構化日誌（JSON 格式）
[x] 完整的驗收測試體系
[x] 監控棧配置（Prometheus + Grafana）

核心價值：

可觀測性 - 實時掌握系統運行狀態
故障診斷 - 快速定位和排查問題
請求追蹤 - 跨服務的端到端追蹤
生產就緒 - 符合企業級運維標準

項目 GitHub 地址：https://github.com/Mythetic/web_platform

🏗️ 可觀測性架構

三大支柱

現代應用的可觀測性（Observability）由三大支柱構成：

┌─────────────────────────────────────────────────────────────┐
│                     可觀測性三大支柱                          │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  📊 Metrics (指標)          📝 Logs (日誌)         🔍 Traces (追蹤)  │
│  ────────────────          ───────────────         ────────────────  │
│  • 系統性能指標            • 應用運行日誌           • 請求調用鏈路    │
│  • HTTP 請求計數           • 錯誤詳細信息           • 跨服務追蹤      │
│  • 響應時間分佈            • 業務操作記錄           • 性能瓶頸定位    │
│  • 資源使用率              • 結構化輸出             • 依賴關係分析    │
│                                                               │
│  工具: Prometheus          工具: ELK/Loki           工具: Jaeger     │
│                                                               │
└─────────────────────────────────────────────────────────────┘

本章實現架構

┌────────────────────────────────────────────────────────────┐
│                        用户請求                             │
└──────────────────────┬─────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                   Gin 中間件層                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ CorrelationID│→ │StructuredLog │→ │PrometheusMetrics│   │
│  │  生成請求ID   │  │  JSON日誌    │  │   收集指標    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                    業務處理層                                │
│  • API Handlers                                             │
│  • Business Logic                                           │
│  • Database Access                                          │
└─────────────────────────────────────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
┌──────────────┐ ┌──────────┐ ┌─────────────┐
│ /metrics     │ │ /health  │ │ server.log  │
│ (Prometheus) │ │ (K8s)    │ │ (JSON)      │
└──────┬───────┘ └────┬─────┘ └──────┬──────┘
       │              │               │
       ▼              ▼               ▼
┌──────────────┐ ┌──────────┐ ┌─────────────┐
│ Prometheus   │ │ LoadBalancer│ │ LogAggregator│
│   Server     │ │ HealthCheck │ │  (ELK/Loki) │
└──────┬───────┘ └──────────┘ └─────────────┘
       │
       ▼
┌──────────────┐
│   Grafana    │
│  Dashboard   │
└──────────────┘

數據流向

用户請求 → CorrelationID中間件（生成UUID）
         ↓
         StructuredLogger中間件（記錄請求信息）
         ↓
         PrometheusMetrics中間件（開始計時、增加併發計數）
         ↓
         業務Handler處理
         ↓
         PrometheusMetrics中間件（記錄延遲、狀態碼、遞減併發）
         ↓
         StructuredLogger中間件（記錄響應信息）
         ↓
         返回響應（攜帶 X-Request-ID header）

📊 Prometheus 指標集成

為什麼需要 Prometheus？

問題場景：

❓ 系統現在有多少併發請求？
❓ API 響應時間是否正常？
❓ 哪些接口最慢？
❓ 錯誤率是否在增加？

Prometheus 的答案：

✅ 實時採集應用指標
✅ 時間序列數據存儲
✅ 強大的查詢語言（PromQL）
✅ 圖形化展示（Grafana）

指標類型設計

在 server/middleware/metrics.go 中定義了三類核心指標：

1. HTTP 請求計數（Counter）

var httpRequestsTotal = promauto.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests",
    },
    []string{"method", "path", "status"},
)

用途：統計每個接口的總請求次數，按 HTTP 方法、路徑、狀態碼分類。

查詢示例：

# 查看所有接口的請求總數
sum(http_requests_total)

# 查看錯誤請求（5xx）
sum(http_requests_total{status=~"5.."})

# 查看登錄接口的成功率
rate(http_requests_total{path="/base/login",status="200"}[5m])

2. HTTP 請求延遲（Histogram）

var httpRequestDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request latency in seconds",
        Buckets: prometheus.DefBuckets, // [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
    },
    []string{"method", "path"},
)

用途：記錄接口響應時間的分佈情況，支持百分位數計算（P50、P95、P99）。

查詢示例：

# 查看 API 的 P95 延遲（95% 的請求在這個時間內完成）
histogram_quantile(0.95, http_request_duration_seconds_bucket)

# 查看平均響應時間
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# 查看慢接口（>1秒）
histogram_quantile(0.99, http_request_duration_seconds_bucket{path="/api/some-slow-endpoint"})

3. HTTP 併發請求數（Gauge）

var httpRequestsInFlight = promauto.NewGauge(
    prometheus.GaugeOpts{
        Name: "http_requests_in_flight",
        Help: "Current number of HTTP requests being served",
    },
)

用途：實時顯示當前正在處理的請求數量。

查詢示例：

# 查看當前併發數
http_requests_in_flight

# 查看最近 5 分鐘的最大併發數
max_over_time(http_requests_in_flight[5m])

中間件實現

func PrometheusMetrics() gin.HandlerFunc {
    return func(c *gin.Context) {
        // 跳過 metrics 端點本身（避免遞歸）
        if c.Request.URL.Path == "/metrics" {
            c.Next()
            return
        }
        
        // 1. 增加併發計數
        httpRequestsInFlight.Inc()
        defer httpRequestsInFlight.Dec()
        
        // 2. 記錄開始時間
        start := time.Now()
        
        // 3. 執行業務邏輯
        c.Next()
        
        // 4. 計算請求耗時
        duration := time.Since(start).Seconds()
        
        // 5. 收集指標
        status := strconv.Itoa(c.Writer.Status())
        method := c.Request.Method
        path := c.FullPath() // 使用路由路徑而不是原始URL（避免高基數）
        
        httpRequestsTotal.WithLabelValues(method, path, status).Inc()
        httpRequestDuration.WithLabelValues(method, path).Observe(duration)
    }
}

關鍵設計考慮：

避免高基數問題：
- ✅ 使用 c.FullPath() 而不是 c.Request.URL.Path
- 原因：路由路徑固定（如 /api/user/:id），而實際 URL 可能有無數個（/api/user/1, /api/user/2, ...）
- 高基數會導致 Prometheus 內存暴漲
跳過 /metrics 端點：
- 避免 Prometheus 抓取自身指標時產生遞歸記錄
- 減少無意義的指標數據
使用 defer 確保計數正確：
- 即使請求 panic，併發計數也會正確遞減

Metrics 端點暴露

// server/api/v1/system/metrics.go
type MetricsApi struct{}

func (m *MetricsApi) GetMetrics(c *gin.Context) {
    handler := promhttp.Handler()
    handler.ServeHTTP(c.Writer, c.Request)
}

// server/initialize/router.go
metricsApi := &system.MetricsApi{}
router.GET("/metrics", metricsApi.GetMetrics)

訪問 http://localhost:8888/metrics 可以看到：

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/health",status="200"} 145
http_requests_total{method="POST",path="/base/login",status="200"} 23
http_requests_total{method="GET",path="/api/user/getList",status="200"} 67

# HELP http_request_duration_seconds HTTP request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="/api/health",le="0.005"} 142
http_request_duration_seconds_bucket{method="GET",path="/api/health",le="0.01"} 145
http_request_duration_seconds_bucket{method="GET",path="/api/health",le="+Inf"} 145
http_request_duration_seconds_sum{method="GET",path="/api/health"} 0.523
http_request_duration_seconds_count{method="GET",path="/api/health"} 145

# HELP http_requests_in_flight Current number of HTTP requests being served
# TYPE http_requests_in_flight gauge
http_requests_in_flight 2

Prometheus 配置

在 deploy/monitoring/prometheus.yml 中配置抓取任務：

scrape_configs:
  - job_name: 'ewp-backend'
    static_configs:
      - targets: ['host.containers.internal:8888']
    metrics_path: '/metrics'
    scrape_interval: 15s  # 每 15 秒抓取一次

host.containers.internal 是 Podman 訪問宿主機的特殊域名
容器內的 Prometheus 通過這個域名連接到宿主機的 8888 端口
在生產環境中，應該使用服務發現（Kubernetes Service、Consul 等）

🏥 健康檢查實現

為什麼需要健康檢查？

場景：

Kubernetes 需要知道 Pod 是否存活（Liveness）
負載均衡器需要知道實例是否就緒（Readiness）
運維人員需要快速判斷服務狀態

Liveness Probe - 存活探針

用途：判斷應用進程是否存活，如果失敗，Kubernetes 會重啓 Pod。

// server/api/v1/system/health.go
func (h *HealthApi) GetHealth(c *gin.Context) {
    response.OkWithData(gin.H{
        "status":    "ok",
        "timestamp": time.Now().Format(time.RFC3339),
    }, c)
}

API 返回：

GET /api/health

{
  "code": 0,
  "data": {
    "status": "ok",
    "timestamp": "2026-01-05T10:15:30Z"
  },
  "msg": "success"
}

Kubernetes 配置示例：

livenessProbe:
  httpGet:
    path: /api/health
    port: 8888
  initialDelaySeconds: 30  # 啓動後 30 秒開始檢查
  periodSeconds: 10        # 每 10 秒檢查一次
  timeoutSeconds: 5        # 超時時間 5 秒
  failureThreshold: 3      # 連續失敗 3 次才重啓

Readiness Probe - 就緒探針

用途：判斷應用是否準備好接收流量，如果失敗，負載均衡器會摘除這個實例。

func (h *HealthApi) GetReadiness(c *gin.Context) {
    checks := make(map[string]string)
    allHealthy := true

    // 1. 檢查 MySQL 連接
    if err := checkMySQLConnection(); err != nil {
        checks["mysql"] = "error: " + err.Error()
        allHealthy = false
    } else {
        checks["mysql"] = "ok"
    }

    // 2. 檢查 Redis 連接
    if err := checkRedisConnection(); err != nil {
        checks["redis"] = "error: " + err.Error()
        allHealthy = false
    } else {
        checks["redis"] = "ok"
    }

    // 3. 返回結果
    if allHealthy {
        response.OkWithData(gin.H{
            "status": "ready",
            "checks": checks,
        }, c)
    } else {
        c.JSON(503, gin.H{
            "code":   503,
            "status": "not ready",
            "checks": checks,
        })
    }
}

API 返回示例：

成功時（HTTP 200）：

{
  "code": 0,
  "data": {
    "status": "ready",
    "checks": {
      "mysql": "ok",
      "redis": "ok"
    }
  }
}

失敗時（HTTP 503）：

{
  "code": 503,
  "status": "not ready",
  "checks": {
    "mysql": "error: connection refused",
    "redis": "ok"
  }
}

Kubernetes 配置示例：

readinessProbe:
  httpGet:
    path: /api/ready
    port: 8888
  initialDelaySeconds: 10   # 啓動後 10 秒開始檢查
  periodSeconds: 5          # 每 5 秒檢查一次
  timeoutSeconds: 3         # 超時時間 3 秒
  successThreshold: 1       # 成功 1 次即認為就緒
  failureThreshold: 3       # 連續失敗 3 次才摘除

健康檢查實現細節

// 檢查 MySQL 連接
func checkMySQLConnection() error {
    if global.EWP_DB == nil {
        return fmt.Errorf("Database connection not initialized")
    }
    
    sqlDB, err := global.EWP_DB.DB()
    if err != nil {
        return err
    }
    
    // 執行一個簡單的查詢來驗證連接
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
    defer cancel()
    
    return sqlDB.PingContext(ctx)
}

// 檢查 Redis 連接
func checkRedisConnection() error {
    if global.EWP_REDIS == nil {
        return fmt.Errorf("Redis connection not initialized")
    }
    
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
    defer cancel()
    
    return global.EWP_REDIS.Ping(ctx).Err()
}

關鍵設計：

超時控制：每個檢查都設置 2 秒超時，避免阻塞
依賴檢查：只有所有依賴都健康，才返回就緒狀態
詳細反饋：返回每個依賴的具體狀態，方便排查

🔍 Correlation ID 請求追蹤

為什麼需要 Correlation ID？

問題場景：

用户報告"登錄失敗"，但日誌裏有成千上萬條記錄，如何找到這個用户的請求？
一個請求經過了多個微服務，如何追蹤完整的調用鏈路？
如何將前端錯誤、後端日誌、數據庫慢查詢關聯起來？

Correlation ID 的答案：

為每個請求分配唯一的 UUID
貫穿請求的整個生命週期
記錄在日誌、響應頭、調用鏈中
支持分佈式追蹤

實現方式

// server/middleware/correlation.go
const CorrelationIDKey = "X-Request-ID"

func CorrelationID() gin.HandlerFunc {
    return func(c *gin.Context) {
        // 1. 嘗試從請求頭獲取 Correlation ID
        correlationID := c.GetHeader(CorrelationIDKey)
        
        // 2. 如果沒有，生成新的 UUID
        if correlationID == "" {
            correlationID = uuid.New().String()
        }
        
        // 3. 存儲到 Gin Context（供其他中間件使用）
        c.Set(CorrelationIDKey, correlationID)
        
        // 4. 設置響應頭（返回給客户端）
        c.Writer.Header().Set(CorrelationIDKey, correlationID)
        
        c.Next()
    }
}

使用場景

場景 1：單次請求追蹤

# 客户端發起請求（不帶 Request ID）
curl -i http://localhost:8888/api/health

# 響應頭包含自動生成的 Request ID
HTTP/1.1 200 OK
X-Request-ID: 3c5f6a8b-1e2d-4f9a-b3c7-8d6e5f4a9b2c
Content-Type: application/json
...

後端日誌中可以看到：

{
  "correlation_id": "3c5f6a8b-1e2d-4f9a-b3c7-8d6e5f4a9b2c",
  "method": "GET",
  "path": "/api/health",
  "status": 200,
  "duration": "2.5ms"
}

場景 2：請求鏈傳播

# 客户端主動帶上 Request ID（用於追蹤）
curl -H "X-Request-ID: my-custom-request-id" http://localhost:8888/api/user/getList

# 響應會保持相同的 Request ID
HTTP/1.1 200 OK
X-Request-ID: my-custom-request-id
...

分佈式場景：

前端 (Request ID: ABC123)
  ↓
API Gateway (透傳 ABC123)
  ↓
User Service (使用 ABC123 記錄日誌)
  ↓ 調用數據庫時在 SQL 註釋中包含 ABC123
  ↓
MySQL Slow Query Log (/* RequestID: ABC123 */ SELECT ...)

場景 3：日誌聚合與搜索

在 ELK/Loki 中搜索：

# 搜索某個請求的所有日誌
correlation_id:"3c5f6a8b-1e2d-4f9a-b3c7-8d6e5f4a9b2c"

# 結果：
# [Service A] 接收請求
# [Service A] 調用 Service B
# [Service B] 查詢數據庫
# [Service B] 返回結果
# [Service A] 返回響應

客户端支持：前端應該在重試、長輪詢時保持相同的 Request ID
下游傳播：調用其他服務時，必須傳遞 Correlation ID
數據庫註釋：在 SQL 查詢中添加註釋 /* RequestID: xxx */
錯誤報告：錯誤信息中包含 Correlation ID，方便用户反饋時快速定位

📝 結構化日誌系統

為什麼需要結構化日誌？

傳統文本日誌的問題：

2026-01-05 10:15:30 [INFO] User login from IP 192.168.1.100
2026-01-05 10:15:31 [INFO] API /api/user/getList took 45ms, status=200

❌ 難以解析和搜索
❌ 沒有統一格式
❌ 缺少關鍵信息（如 Request ID）
❌ 無法高效聚合分析

結構化日誌（JSON）的優勢：

{
  "timestamp": "2026-01-05T10:15:30Z",
  "level": "info",
  "correlation_id": "3c5f6a8b-1e2d-4f9a-b3c7-8d6e5f4a9b2c",
  "method": "GET",
  "path": "/api/user/getList",
  "status": 200,
  "duration": "45ms",
  "duration_ms": 45,
  "ip": "192.168.1.100",
  "user_agent": "Mozilla/5.0...",
  "user_id": "123"
}

✅ 機器可讀，易於解析
✅ 字段統一，便於搜索
✅ 包含完整上下文
✅ 支持高效聚合查詢

實現方式

// server/middleware/logger.go
func StructuredLogger() gin.HandlerFunc {
    return func(c *gin.Context) {
        // 1. 記錄開始時間
        start := time.Now()
        
        // 2. 執行業務邏輯
        c.Next()
        
        // 3. 計算請求耗時
        duration := time.Since(start)
        
        // 4. 獲取 Correlation ID
        correlationID, _ := c.Get(CorrelationIDKey)
        
        // 5. 獲取用户信息（如果已認證）
        userID := ""
        if claims, exists := c.Get("claims"); exists {
            if jwtClaims, ok := claims.(*systemReq.CustomClaims); ok {
                userID = strconv.Itoa(int(jwtClaims.BaseClaims.ID))
            }
        }
        
        // 6. 構造結構化日誌
        logData := map[string]interface{}{
            "timestamp":      time.Now().Format(time.RFC3339),
            "correlation_id": correlationID,
            "method":         c.Request.Method,
            "path":           c.Request.URL.Path,
            "status":         c.Writer.Status(),
            "duration":       duration.String(),
            "duration_ms":    duration.Milliseconds(),
            "ip":             c.ClientIP(),
            "user_agent":     c.Request.UserAgent(),
        }
        
        if userID != "" {
            logData["user_id"] = userID
        }
        
        // 7. 輸出 JSON 日誌
        logJSON, _ := json.Marshal(logData)
        global.EWP_LOG.Info(string(logJSON))
    }
}

日誌字段説明

字段	類型	説明	示例
`timestamp`	string	日誌時間（ISO 8601）	`2026-01-05T10:15:30Z`
`correlation_id`	string	請求追蹤 ID	`3c5f6a8b-1e2d-4f9a...`
`method`	string	HTTP 方法	`GET`, `POST`
`path`	string	請求路徑	`/api/user/getList`
`status`	int	HTTP 狀態碼	`200`, `404`, `500`
`duration`	string	人類可讀的耗時	`45ms`, `1.2s`
`duration_ms`	int	毫秒數（便於聚合）	`45`, `1200`
`ip`	string	客户端 IP	`192.168.1.100`
`user_agent`	string	瀏覽器標識	`Mozilla/5.0...`
`user_id`	string	用户 ID（如已登錄）	`123`

日誌查詢示例

在 ELK 中查詢：

// 查詢某個用户的所有請求
user_id:"123"

// 查詢慢請求（>1秒）
duration_ms:>1000

// 查詢錯誤請求
status:>=500

// 查詢某個時間段的請求
timestamp:[2026-01-05T10:00:00Z TO 2026-01-05T11:00:00Z]

// 聚合分析：統計各狀態碼的數量
{
  "aggs": {
    "status_codes": {
      "terms": { "field": "status" }
    }
  }
}

在 Loki 中查詢：

# 查詢某個 Request ID 的所有日誌
{job="ewp-backend"} | json | correlation_id="3c5f6a8b-1e2d-4f9a-b3c7-8d6e5f4a9b2c"

# 統計每分鐘的請求數
sum(rate({job="ewp-backend"}[1m]))

# 查詢 P99 響應時間
histogram_quantile(0.99, sum(rate({job="ewp-backend"} | json | __error__="" | unwrap duration_ms [5m])) by (le))

日誌級別規範

// 不同場景使用不同日誌級別
global.EWP_LOG.Debug(logJSON)   // 調試信息（生產環境不輸出）
global.EWP_LOG.Info(logJSON)    // 正常請求（我們的選擇）
global.EWP_LOG.Warn(logJSON)    // 警告信息（如慢查詢）
global.EWP_LOG.Error(logJSON)   // 錯誤信息（如 5xx）
global.EWP_LOG.Fatal(logJSON)   // 致命錯誤（進程退出）

日誌級別選擇：

Info：正常的 HTTP 請求（200, 201, 204）
Warn：可能有問題的請求（401, 403, 404, 請求超時）
Error：服務器錯誤（500, 502, 503, panic）

後續優化方向

1. 監控告警

# Prometheus 告警規則示例
groups:
  - name: ewp_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          
      - alert: HighLatency
        expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
        for: 5m
        annotations:
          summary: "API latency P95 > 1s"

2. 分佈式追蹤

集成 Jaeger 實現完整的分佈式追蹤：

// 使用 OpenTelemetry 標準
import "go.opentelemetry.io/otel"

func TracingMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
        ctx, span := tracer.Start(c.Request.Context(), c.FullPath())
        defer span.End()
        
        // 傳播 Trace Context
        c.Request = c.Request.WithContext(ctx)
        c.Next()
    }
}

3. 日誌聚合

將日誌發送到 ELK 或 Loki：

# Promtail 配置（Loki）
clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: ewp-backend
    static_configs:
      - targets:
          - localhost
        labels:
          job: ewp-backend
          __path__: /path/to/server.log

📚 相關文檔

技術文檔

Prometheus 官方文檔 - 指標收集與監控
Prometheus 最佳實踐 - 指標命名規範
OpenTelemetry Go SDK - 分佈式追蹤標準
Structured Logging in Go - Zap 日誌庫

Kubernetes 健康檢查

Configure Liveness, Readiness Probes - K8s 探針配置
Health Check Best Practices - Google 最佳實踐

可觀測性理論

The Three Pillars of Observability - O'Reilly 可觀測性理論
Logs vs Metrics vs Traces - 三者的區別與聯繫

🔗 項目地址

GitHub: https://github.com/Mythetic/web_platform

博客 / 詳情