MongoDB Replica Set Guide

Kiến trúc chuyên sâu — Replica Set, WiredTiger, Data flow, và Simulation.

🍃 MongoDB Architecture Deep Dive

MongoDB Replica Set
Kiến Trúc Chuyên Sâu

Từ Golang client SDK → MongoDB driver → Wire Protocol → WiredTiger → disk. Hiểu mọi thứ diễn ra bên trong để vận hành tự tin và debug nhanh trong production.

Modules chuyên sâu

Data flow scenarios

12+

SVG diagrams

Live simulations

Kiến Trúc Tổng Quan

Replica Set topology, vai trò từng node, và toàn cảnh component

Một MongoDB Replica Set không đơn giản chỉ là "3 cái database chạy song song". Đây là một distributed consensus system với từng node đảm nhận vai trò cụ thể, giao tiếp liên tục qua heartbeat, đồng bộ data thông qua oplog stream, và có khả năng tự bầu lãnh đạo mới khi Primary gặp sự cố — tất cả hoàn toàn tự động.

Replica Set Topology

🏗 Replica Set — Topology Tổng Thể & Communication Paths

Vai Trò Từng Node

👑

Primary

Duy nhất nhận write operations. Ghi vào WiredTiger storage và đồng thời ghi oplog entry. Tất cả writes phải qua Primary — driver tự biết Primary là ai qua server discovery.

Khi fail: Secondaries bắt đầu election, chọn 1 Secondary mới làm Primary trong ~10–30 giây.

📡

Secondary

Liên tục kéo oplog từ Primary và apply vào local storage. Có thể serve reads nếu client set readPreference: secondary. Có thể có replication lag.

Priority: Quyết định ai được ưu tiên trở thành Primary trong election. Priority=0 không bao giờ thành Primary.

🗳️

Arbiter

Không lưu data, không serve reads/writes. Chỉ tham gia voting để tránh split-brain khi số node chẵn.

Trade-off: Tiết kiệm chi phí hardware nhưng không giúp read scalability và không failover được. Dùng khi không đủ budget cho 3 full data nodes.

Full Component Map — Bên Trong Mỗi Node

🔬 Anatomy của một MongoDB Node — Tất cả Internal Components

Driver & Wire Protocol

Golang client SDK → BSON encoding → Wire Protocol → cluster

Trước khi một dòng code Golang chạm được vào MongoDB, dữ liệu phải trải qua nhiều lớp xử lý. Hiểu các lớp này giúp bạn debug connection issues, optimize performance, và đặt đúng write concern.

🔌 Golang Driver Stack — Từ Code đến Network

Golang Driver — Connection Setup & Server Discovery

GoKết nối Golang Driver với Replica Set

package main

import (
    "context"
    "time"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    "go.mongodb.org/mongo-driver/mongo/readpref"
)

func connectReplicaSet() *mongo.Client {
    // Driver tự discover toàn bộ topology từ bất kỳ seed nào
    uri := "mongodb://mongo1:27017,mongo2:27017,mongo3:27017/?replicaSet=rs0"

    opts := options.Client().ApplyURI(uri).
        SetMaxPoolSize(100).       // Max connections per node
        SetMinPoolSize(5).        // Warmup connections
        SetMaxConnIdleTime(10 * time.Minute).
        SetServerSelectionTimeout(5 * time.Second).
        SetHeartbeatInterval(10 * time.Second).  // Client-side monitoring
        SetReadPreference(readpref.SecondaryPreferred()).  // Read from secondary nếu có
        SetWriteConcern(/* w:majority, j:true */)

    client, _ := mongo.Connect(context.Background(), opts)

    // Driver ngay lập tức bắt đầu Server Monitoring (SDAM):
    // 1. Connect đến seed nodes
    // 2. Chạy "isMaster" command → nhận topology info
    // 3. Discover tất cả replica set members
    // 4. Monitor liên tục (heartbeat interval)

    return client
}

BSON Encoding — Binary JSON

BSON (Binary JSON) là format wire duy nhất giữa client và MongoDB. Khác JSON ở chỗ: có schema rõ ràng, length-prefixed (O(1) skip), và hỗ trợ types mà JSON không có (ObjectID, Date, Binary, Decimal128...).

🔢 BSON Binary Layout — Document Structure

Wire Protocol — OP_MSG Format

Wire ProtocolOP_MSG frame — MongoDB Wire Protocol v6

// ── OP_MSG Frame Structure ─────────────────────────────────
// MsgHeader (16 bytes)
messageLength:  4 bytes  // Total message size
requestID:     4 bytes  // Client-assigned, dùng để match response
responseTo:    4 bytes  // 0 cho requests, requestID của request cho responses
opCode:        4 bytes  // 2013 (0x07DD) = OP_MSG

// OP_MSG body
flagBits:      4 bytes  // checksumPresent(bit0), moreToCome(bit1)

// Section type 0 — Command document (BSON)
// Ví dụ InsertOne command:
{
  "insert":    "users",          // collection name
  "documents": [{ "name":"Alice", "age":30 }],
  "ordered":   true,
  "writeConcern": { "w": "majority", "j": true, "wtimeout": 5000 },
  "lsid":      { "id": /* UUID */ },    // Logical session ID
  "$db":       "mydb",
  "$readPreference": { "mode": "primary" }
}

// Server response (OP_MSG):
// { "n": 1, "ok": 1.0 }  — thành công
// { "ok": 0, "code": 11000, "errmsg": "duplicate key..." }  — lỗi

💡

Tại sao phải hiểu Wire Protocol?

Khi debug slow queries hoặc connection issues với mongotop, mongostat, hoặc network captures (tcpdump), bạn đang nhìn thẳng vào Wire Protocol frames. Hiểu requestID/responseTo matching giúp trace một request cụ thể qua toàn bộ system. writeConcern trong frame là thứ quyết định khi nào server trả response về client.

WiredTiger Storage Engine

B-Tree storage, cache management, journal, checkpoint — data tồn tại trên disk như thế nào

WiredTiger là storage engine mặc định từ MongoDB 3.2. Nó quyết định mọi thứ về performance: read/write throughput, memory usage, disk I/O pattern, và crash recovery behavior. DevOps cần hiểu WiredTiger để tune đúng chỗ thay vì "mò mẫm" tuning parameters.

💾 WiredTiger Write Path — Từ RAM đến Disk

MVCC — Snapshot Isolation trong WiredTiger

WiredTiger dùng MVCC (Multi-Version Concurrency Control) — mỗi read transaction thấy một snapshot nhất quán của data tại thời điểm nó bắt đầu, bất kể concurrent writes đang diễn ra. Không có read lock → reads không block writes, writes không block reads.

MongoDBWiredTiger tuning — mongodb.conf

storage:
  engine: wiredTiger
  wiredTiger:
    engineConfig:
      # Cache size — critical tuning parameter
      # Rule: (RAM - 1GB) * 0.5, nhưng để lại cho OS file cache
      cacheSizeGB: 4          # Với 16GB RAM → 4-6GB WT cache

      journalCompressor: snappy  # snappy | zlib | none

    collectionConfig:
      blockCompressor: snappy    # Compression cho collection data
      # snappy: fast + decent compression (default)
      # zlib: slower + better compression (archive data)
      # zstd: best ratio (MongoDB 4.2+)

    indexConfig:
      prefixCompression: true    # Compress index keys với common prefix

# Kiểm tra cache usage:
# db.serverStatus().wiredTiger.cache
# "bytes currently in the cache" / "maximum bytes configured" = eviction pressure
# Nếu > 80%: tăng cacheSizeGB hoặc scale RAM

Oplog & Replication Engine

Cơ chế đồng bộ dữ liệu giữa Primary và Secondary

Oplog (Operations Log) là trái tim của replication. Đây là một capped collection đặc biệt (local.oplog.rs) trên mỗi node, ghi lại mọi write operation theo dạng idempotent — nghĩa là apply cùng operation nhiều lần vẫn cho kết quả giống nhau.

📋 Oplog Structure & Replication Stream

Replication Lag — Hiểu và Monitor

MongoDB ShellKiểm tra replication lag

// Xem replication status
rs.printReplicationInfo()
// Output:
// configured oplog size:   10240 MB
// log length start to end: 86400 secs (24.0 hrs)
// oplog first event time:  2024-01-15 00:00:00 GMT
// oplog last event time:   2024-01-16 00:00:00 GMT
// now:                     2024-01-16 00:00:00 GMT

rs.printSecondaryReplicationInfo()
// source: mongo2:27017
//         syncedTo: Mon Jan 16 2024 09:59:58 GMT
//         0 secs (0 hrs) behind the primary

// Chi tiết lag per member:
rs.status().members.forEach(m => {
  if (m.state === 2) {  // 2 = SECONDARY
    lag = m.optimeDate - rs.status().members.find(x => x.state===1).optimeDate;
    print(m.name + " lag: " + lag/1000 + "s");
  }
});

// Alert nếu lag > threshold (ví dụ từ monitoring script)
// Lag cao → secondaries không kịp apply → đọc stale data khi dùng readPreference:secondary

Election & Heartbeat Mechanism

Cách MongoDB tự động bầu Primary mới khi failover

MongoDB dùng một biến thể của Raft consensus algorithm để elect Primary. Không phải Raft thuần túy — có nhiều thay đổi để phù hợp với distributed database requirements — nhưng nguyên lý cốt lõi giống nhau: majority votes, term-based leadership.

💓 Heartbeat Cycle & Election Trigger — Step by Step

Điều Kiện Để Được Bầu Làm Primary

✅ Đủ điều kiện ứng cử (Candidate)

Priority > 0 — priority=0 không bao giờ candidate
Oplog up-to-date — lastOptime ≥ majority của members
Không bị blacklisted (đã vote trong 30s qua)
Không có newer term đang active
Member ở trạng thái SECONDARY (không ROLLBACK, RECOVERING)

⚡ Timing trong thực tế

electionTimeout: mặc định 10s — sau 10s không có heartbeat → election bắt đầu
Election timeout: 10–30s tổng cộng tùy network
Golang driver sẽ nhận ServerSelectionTimeout error trong thời gian này
Cần retry logic trong app code cho writes
Sau election: driver tự discover Primary mới qua SDAM

Data Flow Scenarios

Từng bước diễn ra bên trong — Write, Read, Failure, Partition, Initial Sync

Scenario 1 — Write Path Đầy Đủ: Golang → ACK

App Code gọi InsertOne()

Golang driver nhận document, serialize sang BSON. Driver đang maintain connection pool và topology view. Vì operation là write → driver chọn Primary node từ topology map.

Driver gửi OP_MSG qua TCP đến Primary

Frame chứa: insert command + BSON document + write concern {w:"majority", j:true} + session ID. Nếu connection pool chưa có idle connection đến Primary → tạo mới (TCP handshake + MongoDB auth).

Primary nhận và xử lý query

Query engine validate document (field count, BSON size < 16MB). Generate ObjectID nếu _id chưa có. Kiểm tra unique index constraints. Nếu pass → ghi vào WiredTiger cache (dirty page).

Primary ghi vào Oplog ĐỒNG THỜI với storage write

Cùng transaction: document được ghi vào db.collection VÀ oplog entry được ghi vào local.oplog.rs. Đây là atomic operation — không thể có một cái mà thiếu cái kia. Journal flush xảy ra nếu j:true.

Secondary fetch oplog và apply

Secondaries đang maintain tailing cursor trên oplog của Primary. Mỗi khi có entry mới → fetch về và apply vào local WiredTiger. Apply song song (multi-threaded oplog application từ MongoDB 4.0+).

w:majority acknowledgment

Primary chờ đến khi ít nhất ⌊N/2⌋+1 nodes (bao gồm chính mình) đã apply oplog entry. Với 3-node cluster → 2 nodes phải confirm. Sau đó Primary gửi response về driver.

Driver trả kết quả về application

InsertOneResult{InsertedID: ObjectID("...")}. Application nhận được và data đảm bảo đã tồn tại trên majority nodes → durable ngay cả khi Primary crash ngay lập tức sau đó.

Scenario 2 — Read từ Secondary (readPreference)

GoRead preference routing trong Golang

// ── Read Preference Options ────────────────────────────────

// 1. primary (default) — luôn đọc từ Primary
//    Đảm bảo đọc data mới nhất, không lag
opts := options.Find().SetReadPreference(readpref.Primary())

// 2. secondary — chỉ đọc từ Secondary
//    Có thể lag → stale data. Tăng read throughput.
opts := options.Find().SetReadPreference(readpref.Secondary())

// 3. secondaryPreferred — Secondary nếu có, fallback Primary
//    Best for read-heavy workloads với acceptable staleness
opts := options.Find().SetReadPreference(readpref.SecondaryPreferred())

// 4. nearest — node có RTT thấp nhất (Primary hoặc Secondary)
//    Dùng cho geo-distributed clusters
opts := options.Find().SetReadPreference(readpref.Nearest())

// 5. primaryPreferred — Primary nếu available, fallback Secondary
//    Graceful degradation khi Primary overloaded
opts := options.Find().SetReadPreference(readpref.PrimaryPreferred())

// ── Với MaxStaleness (giới hạn stale data) ────────────────
rp, _ := readpref.New(
    readpref.SecondaryMode,
    readpref.WithMaxStaleness(90*time.Second),  // Tối đa lag 90s
)
// Driver sẽ không route đến Secondary có lag > 90s

Scenario 3 — Primary Failure & Automatic Failover

⚡ Timeline — Primary Down đến System Recovery

Scenario 4 — Network Partition (Split-Brain Prevention)

❌ Partition: Primary cô lập (mất minority)

▸

Primary bị partition, chỉ còn tự mình. Còn lại 2 Secondaries kết nối nhau.

▸

2 Secondaries elect Primary mới (có majority: 2/3 votes).

▸

Old Primary nhận ra không có majority → step down thành Secondary, từ chối nhận writes.

✓

Kết quả tốt: Không có split-brain. Data consistency được đảm bảo.

⚠️ Partition: Cân bằng 50/50 (với Arbiter)

▸

3 nodes: Primary + Secondary + Arbiter. Partition tách Primary+Secondary vs Secondary+Arbiter.

▸

Mỗi bên có 2 votes. Bên nào bầu trước và nhận vote của bên kia sẽ win.

▸

Arbiter không có data nên không serve reads/writes. Chỉ tham gia vote.

⚠

Risk: Nếu partition kéo dài, writes bị block trên bên không có majority.

Scenario 5 — Initial Sync (Node Mới Join Cluster)

Node mới announce vào replica set

rs.add("mongo4:27017") — node nhận cấu hình replica set và bắt đầu initial sync. Trạng thái: STARTUP2.

Chọn sync source

Node mới chọn sync source — thường là Secondary ít lag nhất (không phải Primary để tránh tăng tải Primary). Kết nối và clone toàn bộ data.

Cloning phase

Full clone tất cả collections từ sync source. Song song với đó, primary vẫn nhận writes bình thường. Node mới record oplog start timestamp T_start.

Oplog replay để catch up

Sau khi clone xong, replay tất cả oplog entries từ T_start đến hiện tại. Đây có thể tốn nhiều thời gian nếu clone kéo dài và oplog đã rotate.

Transition sang SECONDARY

Khi lag < threshold, node chuyển sang SECONDARY và bắt đầu tailing oplog bình thường. Có thể nhận reads và participate trong votes.

⚠️

Initial Sync Risk: Oplog Overflow

Nếu cluster có write rate cao và oplog size nhỏ, oplog có thể rotate trước khi initial sync hoàn thành. Node mới sẽ fail sync và phải bắt đầu lại từ đầu. Solution: Tạm thời tăng oplog size trước khi add node mới, hoặc dùng mongodump/mongorestore để bootstrap node mới với data trực tiếp từ snapshot.

Interactive Simulation

Replica Set State Machine & Topology Changes

Replica Set State Machine

mongo1: PRIMARY (t=1)

mongo2: SECONDARY (t=1)

mongo3: SECONDARY (t=1)

term = 1

[11:35:33] [system] Simulation initialized. Term=1.

DevOps Practical Guide

Metrics, logs, và troubleshooting dựa trên hiểu biết kiến trúc

Metrics Quan Trọng — Và Tại Sao (Architectural Mapping)

Metric	Command/Source	Ngưỡng Alert	Architectural Meaning
Replication Lag	`rs.printSecondaryReplicationInfo()`	Alert >30s Critical >5min	Secondary oplog applier không kịp Primary. Nguyên nhân: I/O bottleneck, CPU quá tải, network bandwidth, long-running write operations.
WT Cache Usage	`db.serverStatus().wiredTiger.cache`	Alert >80%	WiredTiger bắt đầu evict dirty pages → background I/O tăng → latency spike. Cần tăng cacheSizeGB hoặc scale RAM.
Oplog Window	`rs.printReplicationInfo()`	Alert <24h	Nếu window < expected recovery time → Initial Sync sẽ fail. Tăng oplog size hoặc giảm write rate.
Connections	`db.serverStatus().connections`	Alert >80% max	Connection pool exhaustion. Mỗi connection dùng ~1MB RAM. Quá nhiều → OOM risk, query queuing.
Query Executors	`db.serverStatus().metrics.queryExecutor`	collScans > 0 trên writes	COLLSCAN trên write queries → toàn bộ collection phải locked. Cần index.
Document Locks	`db.serverStatus().locks`	timeAcquiringMicros tăng	Lock contention trên collection hoặc database. WiredTiger dùng document-level locking, nhưng DDL (index creation) vẫn có collection lock.

Đọc MongoDB Logs Qua Lens Kiến Trúc

MongoDB LogsCác log patterns quan trọng và ý nghĩa

// ── Log 1: Slow Query (default threshold: 100ms) ──────────
{"t":{"$date":"2024-01-15T14:37:22.123Z"},"s":"I","c":"COMMAND","id":51803,
 "attr":{"type":"command","ns":"mydb.orders","command":{"find":"orders",
 "filter":{"status":"pending"}},
 "planSummary":"COLLSCAN",   ← KHÔNG CÓ INDEX! → add index ngay
 "durationMillis":2847}}     ← 2.8 giây — quá chậm

// ── Log 2: Replication lag warning ────────────────────────
{"s":"W","c":"REPL","id":21793,
 "msg":"Replication lag is over the threshold",
 "attr":{"lagMillis":35000,    ← 35 giây lag → secondary có vấn đề I/O
         "threshold":10000}}   ← Alert threshold: 10s

// ── Log 3: Election triggered ─────────────────────────────
{"s":"I","c":"ELECTION","id":21335,
 "msg":"Starting an election",
 "attr":{"reason":"election timeout",   ← Primary không respond 10s
         "term":4}}

{"s":"I","c":"ELECTION","id":21353,
 "msg":"Election succeeded",
 "attr":{"electionId":"7fffffff0000000000000004",
         "term":4}}  ← Primary mới được bầu với term=4

// ── Log 4: WiredTiger cache pressure ──────────────────────
{"s":"W","c":"WTRECOV","id":22430,
 "msg":"WiredTiger cache currently full",
 "attr":{"cacheUsagePercent":95}}  ← 95% → eviction pressure cao, tăng RAM

// ── Log 5: Rollback triggered ──────────────────────────────
{"s":"W","c":"REPL","id":21600,
 "msg":"Rollback required",
 "attr":{"commonPoint":"7fff...:123",  ← Điểm chung của 2 oplog history
         "source":"mongo2:27017"}}
// → Old Primary có writes chưa replicate → phải rollback
// → Các writes này được ghi vào rollback/ directory

Troubleshooting Guide — Từ Symptom đến Root Cause

🔴 Symptom: Write latency tăng đột ngột

Kiến trúc check:
1. WiredTiger cache >80%? → I/O bound, tăng RAM hoặc giảm cacheSizeGB
2. w:majority đang chờ Secondary? → Check replication lag
3. Slow queries blocking? → db.currentOp({active:true, secs_running:{$gt:5}})
4. Index missing → COLLSCAN trên write path triggers full scan
5. Disk IOPS saturation? → db.serverStatus().wiredTiger.block-manager

🟡 Symptom: Replication lag tăng

Kiến trúc check:
1. Secondary disk I/O? → Oplog applier bị block bởi slow disk
2. Large write operations? → Một write >16MB bloats oplog
3. Network bandwidth giữa nodes? → ping + iperf3
4. Nhiều indexes trên Secondary? → Apply oplog cần update nhiều indexes
5. CPU overload trên Secondary? → Oplog applier threads bị starved

🟣 Symptom: Connection pool exhausted

Kiến trúc check:
1. maxPoolSize quá thấp trong client config?
2. Slow queries giữ connections lâu?
3. Không có connection timeout → leaked connections
4. Microservices tất cả connect đến 1 mongod → mỗi service có pool riêng
5. Giải pháp: MongoDB Connection Pooling + mongos proxy cho sharded clusters

🔵 Symptom: Stale reads từ Secondary

Kiến trúc check:
1. Dùng readPreference: secondary nhưng lag cao
2. Giải pháp: Set maxStalenessSeconds trong read preference
3. Dùng readConcern: majority để đọc data đã commit
4. Đọc sau write: dùng readPreference: primary cho operation đó
5. Causal consistency: dùng sessions để đảm bảo read-your-writes

Essential Commands — Vận Hành Hàng Ngày

rs.status()

Toàn bộ trạng thái replica set: role, lag, health, votes

rs.conf()

Cấu hình replica set: priority, votes, hidden, slaveDelay

db.serverStatus().repl

Replication metrics: ismaster, optime, lastWrite

db.currentOp({active:true})

Operations đang chạy, detect long-running queries

db.collection.explain("executionStats").find({...})

Query plan + execution stats, detect COLLSCAN

db.adminCommand({getLog:"global"})

Xem logs trong memory (recent entries)

rs.printReplicationInfo()

Oplog size và window (thời gian có thể PITR)

rs.stepDown(60)

Force Primary step down (maintenance, 60s cooldown)

GoProduction-ready Golang MongoDB client với retry logic

package db

import (
    "context"
    "time"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    "go.mongodb.org/mongo-driver/mongo/readpref"
    "go.mongodb.org/mongo-driver/mongo/writeconcern"
)

// WriteWithRetry — retry logic cho election window
func WriteWithRetry[T any](
    ctx context.Context,
    col *mongo.Collection,
    doc T,
    maxRetries int,
) (*mongo.InsertOneResult, error) {
    var lastErr error

    for attempt := 0; attempt < maxRetries; attempt++ {
        if attempt > 0 {
            // Exponential backoff: 1s, 2s, 4s...
            backoff := time.Duration(1<Sleep(backoff)
        }

        result, err := col.InsertOne(ctx, doc)
        if err == nil {
            return result, nil
        }

        // Retry chỉ với transient errors (election, network)
        if mongo.IsTimeout(err) ||
           mongo.IsNetworkError(err) {
            lastErr = err
            continue
        }

        // Non-retriable: duplicate key, validation error...
        return nil, err
    }

    return nil, lastErr
}

// NewMongoClient — production config
func NewMongoClient(uri string) (*mongo.Client, error) {
    wc := writeconcern.Majority()
    wc.WTimeout = 5 * time.Second
    wc.Journal = boolPtr(true)

    return mongo.Connect(context.Background(),
        options.Client().ApplyURI(uri).
            SetMaxPoolSize(100).
            SetMinPoolSize(5).
            SetMaxConnIdleTime(10*time.Minute).
            SetServerSelectionTimeout(30*time.Second). // Cover election window
            SetReadPreference(readpref.SecondaryPreferred()).
            SetWriteConcern(wc),
    )
}

✅

Production Replica Set Checklist

☐ Luôn dùng odd number of voting members (3, 5, 7) để tránh split-vote
☐ Set priority cao hơn cho nodes mạnh hơn — Primary sẽ thường là server tốt nhất
☐ Write concern: majority cho mọi critical write — đừng dùng w:1
☐ Monitor oplog window > 24h — tăng nếu cần Initial Sync
☐ App code phải có retry logic với backoff để handle election (30s window)
☐ Đặt serverSelectionTimeout > 30s trong driver để survive election
☐ Monitor replication lag — alert khi >30s, critical >5min
☐ Dùng hidden secondary (slaveDelay=1h) như buffer cho human error
☐ WiredTiger cache: test với 50% RAM, tune dựa trên cache pressure metrics
☐ Khi thêm node mới: tăng oplog size trước để tránh sync fail