概念解析
Kubernetes監控和日誌管理是保障集羣穩定運行和應用健康的關鍵組成部分。監控關注系統的性能指標和健康狀態,而日誌管理則關注應用和系統的詳細行為記錄。
監控概念
- 指標監控:收集和分析系統性能指標,如CPU、內存、網絡、磁盤使用率
- 健康檢查:監控應用和服務的可用性和響應狀態
- 告警機制:在檢測到異常時觸發通知和自動響應
- 可視化展示:通過圖表和儀表板直觀展示監控數據
日誌管理概念
- 應用日誌:應用程序產生的業務和錯誤日誌
- 系統日誌:Kubernetes組件和節點系統產生的日誌
- 審計日誌:記錄集羣API訪問和安全相關事件
- 日誌聚合:集中收集、存儲和分析分佈式系統的日誌
核心組件
- 數據收集:Prometheus、Fluentd、Logstash等
- 數據存儲:Prometheus TSDB、Elasticsearch、Loki等
- 數據展示:Grafana、Kibana等
- 告警系統:Alertmanager、ElastAlert等
核心特性
監控特性
- 多維度監控:支持節點、Pod、容器、服務等多層級監控
- 自動發現:自動發現和監控新部署的服務
- 指標聚合:支持指標的聚合、計算和轉換
- 歷史數據:長期存儲和查詢歷史監控數據
- 告警規則:靈活的告警規則配置和管理
日誌管理特性
- 實時收集:實時收集和傳輸日誌數據
- 結構化解析:自動解析和結構化日誌內容
- 全文檢索:支持高效的日誌搜索和查詢
- 日誌過濾:根據條件過濾和路由日誌
- 壓縮存儲:高效的日誌壓縮和存儲機制
實踐教程
部署Prometheus監控系統
# 使用Helm部署Prometheus
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 創建監控命名空間
kubectl create namespace monitoring
# 部署Prometheus
helm install prometheus prometheus-community/prometheus \
-n monitoring \
--set alertmanager.persistentVolume.enabled=false \
--set server.persistentVolume.enabled=false
部署Grafana可視化平台
# 使用Helm部署Grafana
helm repo add grafana https://grafana.github.io/helm-charts
# 部署Grafana
helm install grafana grafana/grafana \
-n monitoring \
--set persistence.enabled=false \
--set adminPassword=admin123
# 獲取Grafana登錄密碼
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
# 端口轉發訪問Grafana
kubectl port-forward svc/grafana 3000:80 -n monitoring
部署EFK日誌收集棧
# 部署Elasticsearch
helm install elasticsearch elastic/elasticsearch \
-n monitoring \
--set replicas=1 \
--set minimumMasterNodes=1
# 部署Fluentd
helm install fluentd fluent/fluentd \
-n monitoring
# 部署Kibana
helm install kibana elastic/kibana \
-n monitoring
配置應用監控
# 應用Deployment配置監控註解
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
labels:
app: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: my-web-app:1.0
ports:
- containerPort: 8080
真實案例
案例:電商平台全鏈路監控系統
某大型電商平台需要建立完整的監控和日誌管理系統,覆蓋基礎設施、應用服務、業務指標等多個層面:
# Prometheus配置
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
rule_files:
- "/etc/prometheus-rules/*.rules"
scrape_configs:
# Kubernetes節點監控
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
target_label: __address__
replacement: '${1}:10255'
# Kubernetes Pod監控
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 應用業務指標監控
- job_name: 'business-metrics'
static_configs:
- targets: ['business-metrics-exporter:8080']
---
# Alertmanager配置
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
config.yml: |
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alert@example.com'
smtp_auth_username: 'alert'
smtp_auth_password: 'password'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'team-X-mails'
receivers:
- name: 'team-X-mails'
email_configs:
- to: 'team-X@example.com'
- name: 'webhook'
webhook_configs:
- url: 'http://webhook-service:8080/alert'
---
# Grafana Dashboard配置
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
data:
kubernetes-cluster-dashboard.json: |
{
"dashboard": {
"id": null,
"title": "Kubernetes Cluster Monitoring",
"timezone": "browser",
"panels": [
{
"title": "Cluster CPU Usage",
"type": "graph",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (instance)",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Cluster Memory Usage",
"type": "graph",
"targets": [
{
"expr": "sum(container_memory_usage_bytes) by (instance)",
"legendFormat": "{{instance}}"
}
]
}
]
}
}
---
# Fluentd配置
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: monitoring
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
<match kubernetes.var.log.containers.app-**>
@type elasticsearch
host elasticsearch
port 9200
logstash_format true
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 5s
retry_forever
retry_max_interval 30
chunk_limit_size 2M
queue_limit_length 8
overflow_action block
</buffer>
</match>
---
# 自定義監控Exporter
apiVersion: apps/v1
kind: Deployment
metadata:
name: business-metrics-exporter
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: business-metrics-exporter
template:
metadata:
labels:
app: business-metrics-exporter
spec:
containers:
- name: exporter
image: business/metrics-exporter:1.0
ports:
- containerPort: 8080
env:
- name: DATABASE_HOST
value: "mysql-service"
- name: REDIS_HOST
value: "redis-service"
---
apiVersion: v1
kind: Service
metadata:
name: business-metrics-exporter
namespace: monitoring
spec:
selector:
app: business-metrics-exporter
ports:
- port: 8080
targetPort: 8080
---
# 監控告警規則
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
app.rules: |
groups:
- name: app.rules
rules:
- alert: HighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
labels:
severity: page
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
- alert: ApplicationDown
expr: up{job="kubernetes-pods"} == 0
for: 5m
labels:
severity: page
annotations:
summary: "Application instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
這種全鏈路監控系統的優勢:
- 全面覆蓋:從基礎設施到業務指標的全方位監控
- 實時告警:及時發現問題並通知相關人員
- 可視化展示:直觀的監控面板便於分析問題
- 日誌聚合:集中的日誌管理便於排查故障
- 自動發現:自動監控新部署的服務
- 可擴展性:支持自定義監控指標和告警規則
配置詳解
Prometheus配置詳解
global:
scrape_interval: 15s # 全局抓取間隔
evaluation_interval: 15s # 規則評估間隔
rule_files:
- "first_rules.yml" # 告警規則文件
- "second_rules.yml"
scrape_configs:
# 監控Prometheus自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# 監控Kubernetes節點
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# 監控Kubernetes服務Endpoints
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
Grafana配置詳解
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config
namespace: monitoring
data:
grafana.ini: |
[server]
domain = grafana.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/
serve_from_sub_path = true
[security]
admin_user = admin
admin_password = __FILE__/etc/grafana/secrets/admin_password
[users]
allow_sign_up = false
auto_assign_org = true
auto_assign_org_role = Viewer
[auth.anonymous]
enabled = false
[dashboards]
versions_to_keep = 20
[metrics]
enabled = true
interval_seconds = 10
[unified_alerting]
enabled = true
[alerting]
enabled = false
[log]
mode = console file
level = info
[paths]
logs = /var/log/grafana
plugins = /var/lib/grafana/plugins
provisioning = /etc/grafana/provisioning
Fluentd配置詳解
<source>
@type tail
@id in_tail_container_logs
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
@id filter_kube_metadata
kubernetes_url "#{ENV['FLUENT_FILTER_KUBERNETES_URL'] || 'https://kubernetes.default.svc:443/api'}"
bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token
ca_file /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
skip_labels false
skip_master_url false
skip_container_metadata false
</filter>
<filter kubernetes.var.log.containers.**_app-**>
@type parser
key_name log
reserve_data true
<parse>
@type json
</parse>
</filter>
<match kubernetes.var.log.containers.**_app-**>
@type rewrite_tag_filter
<rule>
key log_level
pattern ^ERROR$
tag error.${tag}
</rule>
</match>
<match error.**>
@type elasticsearch
host elasticsearch
port 9200
logstash_format true
logstash_prefix error-logs
<buffer>
@type file
path /var/log/fluentd-buffers/error.buffer
flush_mode interval
flush_interval 5s
flush_thread_count 2
retry_type exponential_backoff
retry_forever true
retry_max_interval 30
chunk_limit_size 2M
total_limit_size 512MB
overflow_action block
</buffer>
</match>
<match kubernetes.**>
@type elasticsearch
host elasticsearch
port 9200
logstash_format true
logstash_prefix kubernetes-logs
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.buffer
flush_mode interval
flush_interval 10s
flush_thread_count 4
retry_type exponential_backoff
retry_forever true
retry_max_interval 60
chunk_limit_size 4M
total_limit_size 1GB
overflow_action block
</buffer>
</match>
故障排除
常見問題及解決方案
-
監控指標缺失
# 檢查Prometheus目標狀態 kubectl port-forward svc/prometheus-server 9090:80 -n monitoring # 訪問 http://localhost:9090/targets 查看目標狀態 # 檢查服務註解 kubectl describe pod <pod-name> -n <namespace> # 檢查網絡策略 kubectl get networkpolicy -n <namespace> -
日誌收集不完整
# 檢查Fluentd日誌 kubectl logs <fluentd-pod> -n monitoring # 檢查日誌文件權限 kubectl exec <node-pod> -- ls -la /var/log/containers/ # 檢查Fluentd配置 kubectl describe configmap fluentd-config -n monitoring -
告警不觸發
# 檢查告警規則 kubectl port-forward svc/prometheus-server 9090:80 -n monitoring # 訪問 http://localhost:9090/rules 查看規則狀態 # 檢查告警狀態 # 訪問 http://localhost:9090/alerts 查看告警狀態 # 檢查Alertmanager配置 kubectl logs <alertmanager-pod> -n monitoring -
Grafana面板無數據
# 檢查數據源配置 kubectl port-forward svc/grafana 3000:80 -n monitoring # 在Grafana中檢查數據源連接狀態 # 檢查查詢語句 # 在Grafana查詢編輯器中驗證PromQL語句 # 檢查時間範圍 # 確認選擇了正確的時間範圍
最佳實踐
-
監控策略:
- 建立分層監控體系(基礎設施、平台、應用、業務)
- 設置合理的告警閾值和靜默期
- 定期審查和優化監控規則
- 建立監控指標基線和趨勢分析
-
日誌管理:
- 統一日誌格式和結構
- 合理設置日誌級別
- 實施日誌輪轉和清理策略
- 建立日誌分類和標籤體系
-
性能優化:
- 合理設置監控採集頻率
- 優化日誌收集和傳輸性能
- 實施數據採樣和降精度策略
- 使用高效的存儲和索引機制
-
安全考慮:
- 限制監控和日誌系統的訪問權限
- 加密敏感監控數據傳輸
- 定期審計監控和日誌訪問日誌
- 實施數據備份和災難恢復
-
運維管理:
- 建立監控和告警響應流程
- 定期進行監控系統健康檢查
- 維護監控文檔和操作手冊
- 培訓團隊成員使用監控工具
安全考慮
監控系統安全配置
apiVersion: v1
kind: Secret
metadata:
name: prometheus-secrets
namespace: monitoring
type: Opaque
data:
admin-password: <base64-encoded-password>
tls-cert: <base64-encoded-cert>
tls-key: <base64-encoded-key>
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-network-policy
namespace: monitoring
spec:
podSelector:
matchLabels:
app: prometheus
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
- podSelector:
matchLabels:
app: grafana
ports:
- protocol: TCP
port: 9090
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
- protocol: TCP
port: 10250
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
日誌系統安全配置
apiVersion: v1
kind: Secret
metadata:
name: elasticsearch-secrets
namespace: monitoring
type: Opaque
data:
elastic-password: <base64-encoded-password>
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: elasticsearch-network-policy
namespace: monitoring
spec:
podSelector:
matchLabels:
app: elasticsearch
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
- podSelector:
matchLabels:
app: fluentd
ports:
- protocol: TCP
port: 9200
- protocol: TCP
port: 9300
命令速查
| 命令 | 描述 |
|---|---|
kubectl top nodes |
查看節點資源使用情況 |
kubectl top pods |
查看Pod資源使用情況 |
kubectl logs <pod> |
查看Pod日誌 |
kubectl logs -f <pod> |
實時查看Pod日誌 |
kubectl port-forward svc/prometheus-server 9090:80 -n monitoring |
端口轉發訪問Prometheus |
kubectl port-forward svc/grafana 3000:80 -n monitoring |
端口轉發訪問Grafana |
kubectl exec -it <prometheus-pod> -n monitoring -- wget -qO- http://localhost:9090/metrics |
查看Prometheus指標 |
kubectl get servicemonitor -n monitoring |
查看ServiceMonitor資源 |
kubectl get prometheusrule -n monitoring |
查看PrometheusRule資源 |
kubectl logs <fluentd-pod> -n monitoring --tail=100 |
查看Fluentd日誌 |
總結
監控和日誌管理是Kubernetes集羣穩定運行的重要保障。通過本文檔的學習,你應該能夠:
- 理解監控和日誌管理的核心概念和重要性
- 部署和配置主流監控和日誌系統
- 設計和實施全鏈路監控解決方案
- 配置複雜的監控規則和告警策略
- 排查常見的監控和日誌問題
- 遵循監控和日誌管理的最佳實踐和安全考慮
這套完整的K8s技能文檔涵蓋了從基礎概念到高級特性的各個方面,為讀者提供了全面的K8s知識體系和實踐指導。