Monitoring a production Kubernetes cluster

12-layer production review — Golden Signals / RED / USE + SLO, áp xuống từng tầng của cluster với key metrics, PromQL và alert.

Observability · SRE production review

Monitoring a production Kubernetes cluster

Bối rối "monitor cái gì"? Đây là bản đồ 12 layer: ba framework nền tảng (Golden Signals · RED · USE) + SLO/error-budget, rồi áp xuống từng tầng của một production cluster — kèm key signals, PromQL mẫu, alert và cạm bẫy.

Vì sao cần monitor (và monitor đúng)

Kubernetes có rất nhiều thành phần phụ thuộc nhau — node, pod, deployment, service, ingress, storage, autoscaler, control plane. Lỗi nhỏ ở một layer lan ra cả hệ thống, và "running" không có nghĩa là "healthy":

Pod Running nhưng app trả lỗi / deadlock.
Node Ready nhưng memory pressure → pod bị evict.
Deploy thành công nhưng latency người dùng tăng vọt.
Service ổn nhưng ingress trả 5xx / timeout.

Mọi thiết kế monitoring chỉ cần trả lời 3 câu: What is broken? · Why is it broken? · Who/what is impacted?

0 · Nguyên tắc nền tảng

Ba "lens" bổ sung cho nhau — đừng chọn một, dùng đúng lens cho đúng đối tượng:

Framework	Đo gì	Dùng cho
Golden Signals	Latency · Traffic · Errors · Saturation	Mỗi service/component hướng người dùng (Google SRE)
RED	Rate · Errors · Duration	Request-driven services (API, microservice)
USE	Utilization · Saturation · Errors	Resources: node, disk, queue, cache

SLO + error budget

Định nghĩa SLO cho từng critical service (vd availability 99.9%, p99 < 300ms). Alert theo multi-window burn-rate (cửa sổ nhanh + chậm) thay vì ngưỡng thô — page khi đốt budget nhanh, mở ticket khi đốt chậm. Bớt false-positive, bắt đúng lúc đau.

Alert quality: mỗi alert phải actionable, có runbook link, một severity và một routing target. Dùng inhibition/dedup để khỏi fatigue. Quá nhiều alert → mệt mỏi; quá ít → điểm mù.

Kiến trúc thu thập

Pull-based: Prometheus scrape các exporter; logs và traces đi đường riêng; Alertmanager route ra on-call.

Collect

node-exporter

kube-state-metrics

cAdvisor

app /metrics

→

Store

Prometheus

Loki (logs)

Tempo (traces)

→

See / decide

Grafana

Alertmanager

→

Route

PagerDuty

Slack

Mô hình 12 layer

Layer	Focus	Primary method
L1 Control plane	apiserver / etcd / scheduler	Golden Signals (managed caveats)
L2 Nodes & capacity	host resources, autoscaler	USE
L3 Workloads	pods / deployments / statefulsets	workload health
L4 Application	per-service request health	Golden Signals / RED
L5 Data layer	DB / cache / queue / search	Golden Signals per datastore
L6 Network / DNS / Ingress	CoreDNS, ingress, LB	RED + health
L7 Logging	log pipeline & retention	coverage / cost
L8 Tracing / APM	distributed traces	coverage
L9 Alerting & on-call	routing, escalation	quality
L10 Dashboards & SLO	visualization & objectives	coverage
L11 Platform & Security	secrets, NetPol, PSA, GitOps	posture
L12 Cost	spend & waste	FinOps

L1 · Control plane

Focus apiserver · etcd · scheduler · controller-managerMethod Golden Signals

Trái tim cluster. Trên EKS/GKE control plane là managed → bạn chỉ thấy một phần metric (apiserver qua metrics endpoint, không thấy etcd nội bộ). Self-managed thì monitor đầy đủ.

Signal	Metric	Ngưỡng gợi ý
Availability	`up{job="apiserver"}`	= 0 → critical
Errors	`apiserver_request_total{code=~"5.."}`	5xx rate > 1%
Latency	`apiserver_request_duration_seconds` (p99)	p99 > 1s (read) / > 5s (write)
etcd	`etcd_disk_wal_fsync_duration_seconds`	p99 > 10ms → chậm
Scheduling	failed scheduling events · pending pods	> 0 kéo dài

# apiserver 5xx error ratio (5m)
sum(rate(apiserver_request_total{code=~"5.."}[5m]))
  / sum(rate(apiserver_request_total[5m]))

Managed control plane: đừng alert trên etcd/scheduler bạn không có quyền thấy. Tập trung apiserver availability + latency + failed scheduling (dấu hiệu hết capacity, sang L2).

L2 · Nodes & capacity

Focus host CPU/mem/disk/network · autoscalerMethod USE

Tài nguyên vật lý. USE: dùng bao nhiêu (Utilization), sắp nghẽn chưa (Saturation), có lỗi không (Errors). Nguồn: node-exporter + kube-state-metrics.

Signal	PromQL
CPU utilization	`1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))`
Memory available	`node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes`
Disk free	`node_filesystem_avail_bytes / node_filesystem_size_bytes`
Saturation	`node_load5` vs số core · disk/mem pressure conditions
Readiness	`kube_node_status_condition{condition="Ready",status="true"}`

# ALERT: node memory > 85% trong 10 phút
- alert: NodeMemoryHigh
  expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.85
  for: 10m
  labels: { severity: warning }
  annotations: { runbook: "https://runbooks/node-mem" }

Capacity = node + autoscaler. Theo dõi nodepool tự co/giãn và bin-pack — xem Karpenter scheduled bin-packing. Request sai làm autoscaler tính sai sức chứa.

L3 · Workloads

Focus pods · deployments · statefulsets · HPA · PDBMethod workload health

Trạng thái vận hành của workload — nguồn chính là kube-state-metrics (state) + cAdvisor (resource per container).

Signal	PromQL
Restarts	`increase(kube_pod_container_status_restarts_total[10m])`
Bad states	`kube_pod_status_phase{phase=~"Pending\|Failed"}` · CrashLoopBackOff / ImagePullBackOff
OOMKilled	`kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}`
Replica mismatch	`kube_deployment_status_replicas_available < kube_deployment_spec_replicas`
Rollout / HPA	rollout stuck · `kube_horizontalpodautoscaler_status_current_replicas`

# ALERT: pod restart > 3 lần / 10 phút
- alert: PodRestarting
  expr: increase(kube_pod_container_status_restarts_total[10m]) > 3
  for: 0m
  labels: { severity: warning }

OOMKilled lặp lại = mem limit thấp hoặc rò rỉ. Mem nên đặt request = limit (non-compressible) — xem CPU limits trên Kubernetes.

L4 · Application

Focus per-service request healthMethod Golden Signals / RED

Cái người dùng thực sự cảm nhận. App tự expose /metrics (histogram cho latency là bắt buộc để tính quantile).

Signal	PromQL
Latency p95	`histogram_quantile(0.95, sum by(le)(rate(http_request_duration_seconds_bucket[5m])))`
Error rate	`sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))`
Traffic	`sum(rate(http_requests_total[1m]))`
Saturation	in-flight requests · worker/queue pool đầy · connection pool
Dependencies	error/latency của downstream (DB, cache, API ngoài)

Đặt SLO trên chính các signal này (vd error ratio < 0.1%, p99 < 300ms) rồi alert burn-rate ở L9.

L5 · Data layer

Focus database · cache · queue · searchMethod Golden Signals per datastore

Stateful = nơi sự cố đau nhất. Dùng exporter riêng cho từng loại.

Datastore	Key signals
PostgreSQL	connections vs max · replication lag · slow queries · deadlocks · cache hit ratio · WAL/disk
Redis	hit ratio · evictions · memory · latency · connected clients · replication
RabbitMQ / Kafka	queue depth · consumer lag · publish/ack rate · unacked · partitions under-replicated
Elasticsearch	cluster status (green/yellow/red) · JVM heap · query latency · pending tasks

Replication lag + PITR là cốt lõi của khôi phục — xem PostgreSQL internals và DRP (RPO/RTO/PITR).

L6 · Network / DNS / Ingress

Focus CoreDNS · ingress controller · load balancerMethod RED + health

Đường đi của request. DNS hỏng là "lỗi toàn cục" âm thầm.

Thành phần	Key signals
CoreDNS	query rate · SERVFAIL/NXDOMAIN rate · request duration · cache hit
Ingress (nginx/ALB)	5xx/4xx rate · request latency · active connections · upstream errors
Load balancer	healthy host count · rejected connections · TLS handshake errors
NetworkPolicy	denied/dropped packets (CNI metrics)

# Ingress 5xx ratio (nginx ingress)
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m]))
  / sum(rate(nginx_ingress_controller_requests[5m]))

L7 · Logging

Focus log pipeline & retentionMethod coverage / cost

Metrics nói cái gì đang xảy ra; logs nói tại sao. Log nên structured (JSON) với field nhất quán:

timestamp · namespace · pod · container · app · env · request_id · level · error

Monitor chính pipeline: log volume (theo dõi chi phí), dropped/ingestion errors của Fluent Bit, retention đúng policy.

# LogQL: 5xx của payment trong 5 phút, group theo pod
sum by (pod) (count_over_time({app="payment"} | json | status=~"5.." [5m]))

Log debug ở production = hoá đơn lớn + nhiễu. Set level theo env, sample log volume cao, alert khi pipeline drop log (mất dữ liệu điều tra).

L8 · Tracing / APM

Focus distributed tracesMethod coverage

Trả lời "chậm ở đâu" khi một request đi qua nhiều service. OpenTelemetry → Tempo/Jaeger.

Sampling: head/tail-based; giữ 100% trace lỗi, sample phần còn lại để khỏi nổ chi phí.
Span error rate + latency breakdown per service → tìm nút thắt.
Exemplars: nối từ panel latency (metric) thẳng tới trace cụ thể; correlate trace ↔ log qua trace_id.

Metrics (đếm) + Logs (chi tiết) + Traces (hành trình) = ba trụ cột. Mỗi cái trả lời một câu khác nhau.

L9 · Alerting & on-call

Focus routing · escalationMethod quality

Alert theo impact, không theo mọi dao động. Ba tier:

Tier	Ví dụ	Hành động
Critical	app down · 5xx cao · nhiều pod crash · node NotReady · ingress down · DB unreachable	page ngay
Warning	CPU/mem cao · restarts tăng · disk gần đầy · response chậm	ticket, xem trong giờ
Info	deploy xong · autoscale · node mới · cert sắp hết hạn	chỉ ghi nhận

# Multi-window burn-rate (SLO 99.9% → budget 0.1%)
# Page khi đốt nhanh: 14.4x trong 1h VÀ 5m
- alert: ErrorBudgetFastBurn
  expr: |
    job:slo_errors:ratio_rate1h > (14.4 * 0.001)
    and job:slo_errors:ratio_rate5m > (14.4 * 0.001)
  labels: { severity: critical }

Context-poor

"High error rate detected."

Context-rich

"Payment API 5xx > 5% trong 5 phút sau deploy v1.2.4 — runbook đính kèm."

Mỗi alert: actionable · runbook · severity · routing. Dùng inhibition (node down thì nín alert pod trên node đó) và dedup để chống fatigue.

L10 · Dashboards & SLO

Focus visualization & objectivesMethod coverage

Dashboard cho con người dùng lúc 3 giờ sáng — tối giản, không nhồi panel. Bộ tối thiểu:

SLO / error-budget: attainment + burn-rate + budget còn lại (per critical service).
Cluster overview: node ready · CPU/mem tổng · pod running/pending/failed · recent events.
Application (RED): rate · errors · duration · top failing API · deploy version hiện tại.
Node (USE): CPU · mem · disk pressure · network · pod capacity.
Security: failed auth · suspicious API · RBAC changes · policy violations.

Dashboard nhiều panel = rối lúc incident. Mỗi dashboard nên trả lời một câu hỏi.

L11 · Platform & Security

Focus secrets · NetworkPolicy · Pod Security · GitOpsMethod posture

Monitoring không chỉ là tài nguyên — còn là posture (cấu hình an toàn không).

Secrets: truy cập bất thường tới Vault/Secret; secret mount as env (rò rỉ).
NetworkPolicy: % namespace có NetPol (coverage); packet bị deny.
Pod Security Admission: mode (enforce/audit) · pod chạy root · privileged · hostPath.
Image trust: registry không tin cậy · CVE nghiêm trọng (Trivy).
Runtime: Falco — privilege escalation, shell trong container, đọc file nhạy cảm.
GitOps drift: state thực tế lệch Git (xem Flux + Kyverno).

Audit log (CloudTrail / K8s audit) là nguồn cho phần lớn signal bảo mật — unauthorized API, RBAC change, privilege escalation.

L12 · Cost

Focus spend & wasteMethod FinOps

Cost là một signal vận hành. Theo dõi để cắt lãng phí mà không hy sinh độ tin cậy.

Cost theo namespace/workload (Kubecost / OpenCost / Cost Explorer tag).
Over/under-provision: request vs usage thật (p95) — node loãng = đốt tiền.
Spot coverage · on-demand chạy việc spot làm được.
Tài nguyên mồ côi: PV không gắn, LB idle, snapshot cũ.

Right-sizing + bin-pack là đòn bẩy lớn nhất — xem CPU limits và Karpenter consolidation.

Ví dụ điều tra một sự cố

App trên EKS, sau khi deploy bản mới user báo chậm. Có monitoring đúng → đi từ symptom xuống cause trong vài phút:

Latency có tăng sau deploy không? (so baseline)
5xx có tăng không? (L4)
Pod có restart / OOMKilled không? (L3)
CPU/mem có vượt baseline? (L2)
HPA có scale lên/xuống bất thường? (L3)
App logs nói gì? (L7 — lọc theo request_id)
DB/cache khoẻ không? connection, replication lag? (L5)
Ingress có timeout/5xx? (L6)

Không có monitoring: điều tra mất hàng giờ. Có dashboard + logs + alert đúng: xác định nhanh, giảm downtime.

Toolkit

Nhóm	Công cụ
Open-source	Prometheus · Grafana · Alertmanager · Loki · Fluent Bit · Tempo/Jaeger · kube-state-metrics · node-exporter · cAdvisor
AWS EKS	CloudWatch · Container Insights · AMP · AMG · CloudTrail · GuardDuty
Enterprise	Datadog · New Relic · Dynatrace · Splunk
Security	Falco · Prisma Cloud · Aqua · Sysdig · Trivy
Cost	Kubecost · OpenCost

Công cụ tốt nhất không phải cái đắt nhất — mà là cái team thực sự dùng được khi sự cố xảy ra.

References

Puja Maheshvari — How I would design monitoring for a production Kubernetes cluster: medium.com
Google SRE — Monitoring Distributed Systems (Four Golden Signals) & Alerting on SLOs (multi-window burn-rate).
Brendan Gregg — The USE Method. · Tom Wilkie — The RED Method.
Docs: Prometheus · kube-state-metrics · node-exporter · Grafana Loki / Tempo.