Phần 8: Tối ưu hóa cho NoSQL và Redis

Trong bảy bài đầu tiên của series, chúng ta đã tìm hiểu về các khía cạnh khác nhau của tối ưu hóa cơ sở dữ liệu SQL. Bài viết này sẽ chuyển hướng sang thế giới NoSQL và Redis.

Các cơ sở dữ liệu NoSQL (Not Only SQL) có nhiều loại khác nhau, mỗi loại được tối ưu hóa cho các use cases cụ thể. Trong bài viết này, chúng ta sẽ tập trung vào các hệ thống phổ biến nhất: MongoDB (document store), Cassandra (column-family store), và Redis (key-value store với nhiều tính năng nâng cao).

MongoDB: Document design, aggregation pipeline optimization, sharding

MongoDB là cơ sở dữ liệu NoSQL phổ biến nhất, lưu trữ dữ liệu dưới dạng documents JSON-like (BSON). Nó được thiết kế để cung cấp hiệu năng cao, tính sẵn sàng cao và khả năng mở rộng tự động.

MongoDB Architecture


  graph TD
    A[MongoDB Deployment] --> B[Replica Set]
    A --> C[Sharded Cluster]

    B --> D[Primary Node]
    B --> E[Secondary Node 1]
    B --> F[Secondary Node 2]

    C --> G[Config Servers<br>Replica Set]
    C --> H[Mongos Routers]
    C --> I[Shard 1<br>Replica Set]
    C --> J[Shard 2<br>Replica Set]
    C --> K[Shard N<br>Replica Set]

Document Design Optimization

Thiết kế document hợp lý là yếu tố quan trọng nhất ảnh hưởng đến hiệu năng MongoDB:

Embedding vs Referencing:

MongoDB cho phép hai cách để biểu diễn mối quan hệ giữa các dữ liệu:


  graph TD
    A[Data Modeling] --> B[Embedding<br>Nested Documents]
    A --> C[Referencing<br>Document References]

    B --> D[Pros:<br>- Single query retrieval<br>- Better read performance]
    B --> E[Cons:<br>- Document size limit<br>- Duplication]

    C --> F[Pros:<br>- No duplication<br>- Smaller documents]
    C --> G[Cons:<br>- Multiple queries<br>- Join in application]

Embedding (Nested Documents):

// Embedding comments trong post
{
  "_id": ObjectId("5f8a76b3e6b5a1d8e77c1234"),
  "title": "MongoDB Optimization",
  "content": "This is a post about MongoDB...",
  "author": {
    "name": "John Doe",
    "email": "john@example.com"
  },
  "comments": [
    {
      "user": "Alice",
      "text": "Great post!",
      "date": ISODate("2023-01-15T10:30:00Z")
    },
    {
      "user": "Bob",
      "text": "Thanks for sharing",
      "date": ISODate("2023-01-15T14:20:00Z")
    }
  ]
}

Referencing (Document References):

// Post document
{
  "_id": ObjectId("5f8a76b3e6b5a1d8e77c1234"),
  "title": "MongoDB Optimization",
  "content": "This is a post about MongoDB...",
  "author_id": ObjectId("5f8a76b3e6b5a1d8e77c5678")
}

// Author document
{
  "_id": ObjectId("5f8a76b3e6b5a1d8e77c5678"),
  "name": "John Doe",
  "email": "john@example.com"
}

// Comment documents
{
  "_id": ObjectId("5f8a76b3e6b5a1d8e77c9012"),
  "post_id": ObjectId("5f8a76b3e6b5a1d8e77c1234"),
  "user": "Alice",
  "text": "Great post!",
  "date": ISODate("2023-01-15T10:30:00Z")
}

Nguyên tắc lựa chọn:

Sử dụng embedding khi:
- Dữ liệu “con” luôn được truy cập cùng với dữ liệu “cha”
- Dữ liệu “con” không tăng trưởng không giới hạn
- Cần hiệu năng đọc cao
Sử dụng referencing khi:
- Dữ liệu “con” có thể được truy cập độc lập
- Dữ liệu “con” tăng trưởng không giới hạn
- Cần tránh duplication

Document Size và Structure:

Giới hạn kích thước document: MongoDB có giới hạn 16MB cho mỗi document
Tránh arrays quá lớn: Arrays tăng trưởng không giới hạn có thể gây vấn đề
Sử dụng field names ngắn gọn: Trong collections lớn, field names ngắn giúp tiết kiệm không gian đáng kể

// Thay vì
{
  "very_long_descriptive_field_name": "value",
  "another_unnecessarily_long_field_name": 123
}

// Nên sử dụng
{
  "vldn": "value",  // very_long_descriptive_field_name
  "aulfn": 123      // another_unnecessarily_long_field_name
}

Schema Validation:

Mặc dù MongoDB là schemaless, việc sử dụng schema validation giúp đảm bảo tính nhất quán của dữ liệu:

db.createCollection("products", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["name", "price", "category"],
      properties: {
        name: {
          bsonType: "string",
          description: "must be a string and is required",
        },
        price: {
          bsonType: "number",
          minimum: 0,
          description: "must be a non-negative number and is required",
        },
        category: {
          bsonType: "string",
          description: "must be a string and is required",
        },
        tags: {
          bsonType: "array",
          items: {
            bsonType: "string",
          },
        },
      },
    },
  },
});

Indexing Strategies trong MongoDB

Indexes trong MongoDB tương tự như trong SQL databases, nhưng có một số đặc điểm riêng:

Types of Indexes:

Single Field Index:

db.products.createIndex({ name: 1 }); // 1 for ascending, -1 for descending

Compound Index:

db.products.createIndex({ category: 1, price: -1 });

Multikey Index (cho arrays):

db.products.createIndex({ tags: 1 });

Text Index:

db.products.createIndex({ description: "text" });

Geospatial Index:

db.locations.createIndex({ coordinates: "2dsphere" });

Hashed Index:

db.users.createIndex({ _id: "hashed" });

Index Properties:

Unique Index:

db.users.createIndex({ email: 1 }, { unique: true });

Partial Index:

db.orders.createIndex(
  { orderDate: 1 },
  { partialFilterExpression: { status: "active" } }
);

TTL Index (Time-To-Live):

db.sessions.createIndex(
  { lastModified: 1 },
  { expireAfterSeconds: 3600 } // Auto-delete after 1 hour
);

Index Analysis và Optimization:

Explain Plan:

db.products
  .find({ category: "electronics", price: { $gt: 100 } })
  .sort({ price: -1 })
  .explain("executionStats");

Index Statistics:

db.products.stats();
db.products.aggregate([{ $indexStats: {} }]);

Missing Indexes (slow queries):

db.currentOp({
  secs_running: { $gt: 3 },
  op: "query",
});

Index Best Practices:

Tạo indexes hỗ trợ các queries phổ biến
Đặt các fields có high cardinality trước trong compound indexes
Đặt các fields sử dụng trong equality conditions trước fields sử dụng trong range conditions
Tránh tạo quá nhiều indexes (mỗi index làm chậm write operations)
Sử dụng covered queries khi có thể (queries chỉ trả về indexed fields)

Aggregation Pipeline Optimization

Aggregation Pipeline là công cụ mạnh mẽ để xử lý và phân tích dữ liệu trong MongoDB:

Pipeline Stages Order:

Thứ tự các stages trong pipeline ảnh hưởng lớn đến hiệu năng:


  graph LR
    A["$match<br>(Filter early)"] --> B["$project<br>(Reduce fields)"]
    B --> C["$unwind<br>(Expand arrays)"]
    C --> D["$group<br>(Aggregate data)"]
    D --> E["$sort<br>(Order results)"]
    E --> F["$limit<br>(Reduce output)"]

// Không tối ưu
db.orders.aggregate([
  { $unwind: "$items" },
  { $match: { status: "completed", "items.price": { $gt: 100 } } },
  { $group: { _id: "$customer_id", total: { $sum: "$items.price" } } },
  { $sort: { total: -1 } },
  { $limit: 10 },
]);

// Tối ưu
db.orders.aggregate([
  { $match: { status: "completed" } }, // Filter sớm
  { $unwind: "$items" },
  { $match: { "items.price": { $gt: 100 } } },
  { $group: { _id: "$customer_id", total: { $sum: "$items.price" } } },
  { $sort: { total: -1 } },
  { $limit: 10 },
]);

Redis: Memory Management, Persistence, Clustering

Redis là cơ sở dữ liệu key-value store với nhiều tính năng nâng cao, được thiết kế để cung cấp hiệu năng cao và độ tin cậy cao.

Redis Memory Management (tiếp)

Memory Policies:

Maxmemory: Giới hạn memory usage
Eviction policies: Cách xử lý khi đạt giới hạn memory

# Cấu hình trong redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru

Các eviction policies phổ biến:

noeviction: Trả về lỗi khi memory đầy
allkeys-lru: Xóa least recently used keys
volatile-lru: Xóa least recently used keys có expiration
allkeys-random: Xóa keys ngẫu nhiên
volatile-ttl: Xóa keys có expiration sắp hết hạn


  graph TD
    A[Redis Memory Full] --> B{Has Expiration?}
    B -->|Yes| C{Policy?}
    B -->|No| D{Policy?}

    C -->|volatile-lru| E[Evict LRU Key<br>with Expiration]
    C -->|volatile-ttl| F[Evict Key with<br>Shortest TTL]
    C -->|volatile-random| G[Evict Random Key<br>with Expiration]

    D -->|allkeys-lru| H[Evict LRU Key]
    D -->|allkeys-random| I[Evict Random Key]
    D -->|noeviction| J[Return Error]

Redis Memory Fragmentation:

Memory fragmentation xảy ra khi có sự khác biệt giữa memory được cấp phát và memory thực sự sử dụng:

# Kiểm tra fragmentation ratio
INFO memory
# mem_fragmentation_ratio = used_memory_rss / used_memory

Ratio > 1.5: Fragmentation cao
Ratio < 1.0: Redis đang swap, hiệu năng kém

Giải pháp:

Restart Redis server (trong maintenance window)
Sử dụng activedefrag trong Redis 4.0+

# Cấu hình trong redis.conf
activedefrag yes

Redis Persistence Options

Redis cung cấp nhiều options để lưu trữ dữ liệu xuống disk:


  graph TD
    A[Redis Persistence] --> B[RDB<br>Point-in-time Snapshots]
    A --> C[AOF<br>Append-only File]
    A --> D[Hybrid<br>RDB+AOF]

    B --> B1[Pros:<br>- Compact files<br>- Faster restart<br>- Good for backups]
    B --> B2[Cons:<br>- Potential data loss<br>- Fork process overhead]

    C --> C1[Pros:<br>- Better durability<br>- Append-only operations]
    C --> C2[Cons:<br>- Larger files<br>- Slower restart<br>- Potential slower writes]

    D --> D1[Pros:<br>- Best of both worlds]
    D --> D2[Cons:<br>- More complex<br>- More disk space]

RDB (Redis Database):

# Cấu hình trong redis.conf
save 900 1      # Save after 900 sec if at least 1 key changed
save 300 10     # Save after 300 sec if at least 10 keys changed
save 60 10000   # Save after 60 sec if at least 10000 keys changed

# Manual snapshot
SAVE      # Blocking
BGSAVE    # Non-blocking (fork)

AOF (Append-Only File):

# Cấu hình trong redis.conf
appendonly yes
appendfsync everysec  # Options: always, everysec, no

# Rewrite AOF file (compact)
BGREWRITEAOF

Hybrid Approach (RDB + AOF):

# Cấu hình trong redis.conf
appendonly yes
aof-use-rdb-preamble yes

Persistence Best Practices:

Production servers: Hybrid approach với appendfsync everysec
Cache-only use case: Disable persistence hoặc infrequent RDB
Critical data: AOF với appendfsync always (hiệu năng thấp hơn)
Backup strategy: Sử dụng RDB snapshots

Redis Clustering và High Availability

Redis cung cấp nhiều options cho high availability và scaling:

Redis Sentinel:


  graph TD
    A[Client] --> B[Sentinel 1]
    A --> C[Sentinel 2]
    A --> D[Sentinel 3]

    B --- C
    C --- D
    D --- B

    B --> E[Redis Master]
    C --> E
    D --> E

    E --> F[Redis Replica 1]
    E --> G[Redis Replica 2]

    B -.-> F
    B -.-> G
    C -.-> F
    C -.-> G
    D -.-> F
    D -.-> G

# Cấu hình sentinel.conf
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1

Redis Cluster:

Redis Cluster là một hệ thống phân tán và đồng bộ dữ liệu giữa các Redis nodes. Nó cung cấp khả năng mở rộng và độ tin cậy cao.


  graph TD
    A[Redis Cluster] --> B[Node 1]
    A --> C[Node 2]
    A --> D[Node 3]
    A --> E[Node 4]
    A --> F[Node 5]
    A --> G[Node 6]

    B --> H[Redis Master]
    C --> H
    D --> H
    E --> H
    F --> H
    G --> H

Redis Cluster cung cấp các tính năng:

Phân tán dữ liệu giữa các nodes qua cơ chế 16384 hash slots (mỗi key thuộc một slot theo CRC16(key) mod 16384).
Replication: mỗi master có 0–N replica, failover tự động nếu master chết.
Cluster-aware client: client hiện đại (lettuce, ioredis, go-redis, redis-py) biết topology và redirect request MOVED/ASK.

Có thể khởi tạo cluster bằng redis-cli --cluster create hoặc tự cấu hình trong redis.conf:

cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 15000
cluster-require-full-coverage no   # cho phép partial read khi 1 slot down
cluster-allow-reads-when-down no
cluster-migration-barrier 1

Resharding online (thêm/xoá node):

# Thêm node mới
redis-cli --cluster add-node 10.0.0.10:7000 10.0.0.1:7000

# Reshard 4096 slots sang node mới
redis-cli --cluster reshard 10.0.0.1:7000 --cluster-from all \
  --cluster-to <new-node-id> --cluster-slots 4096 --cluster-yes

# Kiểm tra phân bố slot
redis-cli --cluster check 10.0.0.1:7000

Redis Stack, Valkey, và bức tranh 2026

1. Valkey, fork mã nguồn mở của Redis (2024)

Tháng 3/2024, Redis Labs đổi license Redis core từ BSD sang RSALv2 + SSPL (không phải OSI). Cộng đồng (Linux Foundation, AWS, Google, Oracle…) đã fork Valkey từ Redis 7.2.4, giữ license BSD. Đến tháng 4/2026:

Valkey 7.2 / 8.0 là phiên bản OSI được hầu hết cloud managed service chuyển sang (AWS ElastiCache, Google MemoryStore đã mặc định Valkey).
API/command tương thích 100% với Redis, driver không đổi.
Vì thế, khi đọc “Redis” trong bài này, mọi thứ đều áp dụng cho Valkey; chỉ khác license và vendor.

2. Redis Stack, bundle module của Redis

Redis core chỉ có string/list/hash/set/sorted-set/stream. Redis Stack thêm:

RediSearch: full-text search + secondary index + vector search (HNSW).
RedisJSON: JSON.SET, JSON.GET, JSON.ARRAPPEND… cho JSON document native.
RedisTimeSeries: time-series store với retention, downsampling.
RedisBloom: Bloom filter, Cuckoo filter, Count-min sketch, Top-K.

Lưu ý license 2024–2026: Redis Stack bán kèm tính năng commercial. Valkey đang có module mở thay thế (valkey-search, valkey-json, valkey-bloom, các dự án của cộng đồng). Trước khi dùng, hãy xác minh license với use case.

3. Redis ACL, phân quyền đa người dùng (Redis 6+)

Từ Redis 6, thay vì AUTH password duy nhất, có ACL đầy đủ:

# Tạo user "readonly" chỉ được GET và SCAN trên prefix "cache:*"
ACL SETUSER readonly on >SecretPass ~cache:* +get +scan +ping

# Tạo user "workers" cho worker pool
ACL SETUSER workers on >WorkerPass ~jobs:* +@write +@read +@list +@stream

# Xem user hiện tại
ACL WHOAMI
# Liệt kê tất cả ACL
ACL LIST
# Persist ra file (nếu bật aclfile)
ACL SAVE

redis.conf:

aclfile /etc/redis/users.acl

Dùng ACL thay cho requirepass, vừa an toàn hơn (per-service credentials) vừa audit được (ACL LOG).

4. Các anti-pattern production hay gặp

Anti-pattern	Hậu quả	Thay bằng
`KEYS pattern*` trên production	Block toàn bộ Redis O(N) với N = số key	`SCAN 0 MATCH pattern* COUNT 1000`
`FLUSHALL`/`FLUSHDB` làm “xoá cache”	Đôi khi xoá key của service khác	Xoá theo prefix qua SCAN + DEL / UNLINK
Dùng Redis làm primary DB	Mất data khi AOF sync = everysec + crash	Dùng làm cache / queue / ephemeral; bản ghi cần durable → DB khác
`EXPIRE` sau `SET` (2 round-trip)	Race condition: key tồn tại vô thời hạn nếu crash giữa	`SET key val EX 600` (1 command)
`MULTI/EXEC` cho heavy work	Block thread, cluster không atomic xuyên slot	Dùng Lua script / Redis Functions, chia batch nhỏ
Lưu session PHP/web ở Redis không TTL	Memory leak dần	Đảm bảo `EX` theo session lifetime

Cassandra: Data model, partition key, compaction

Cassandra là wide-column store (dạng BigTable), khác biệt cốt lõi so với MongoDB/Redis: không có join, không có transaction xuyên partition, nhưng ghi cực nhanh nhờ kiến trúc log-structured merge tree (LSM) và replication master-less.

Data model, “query-first, denormalize by default”

Cassandra yêu cầu thiết kế schema theo query pattern, không theo dữ liệu. Primary key = (partition_key, clustering_key):

-- Ví dụ: ứng dụng chat cần query "messages của room X theo thời gian giảm dần"
CREATE KEYSPACE chat WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'dc_us': 3,
  'dc_eu': 3
};

USE chat;

CREATE TABLE messages_by_room (
  room_id     UUID,
  message_ts  TIMESTAMP,
  message_id  TIMEUUID,
  sender_id   UUID,
  body        TEXT,
  PRIMARY KEY ((room_id), message_ts, message_id)
) WITH CLUSTERING ORDER BY (message_ts DESC, message_id DESC)
  AND default_time_to_live = 2592000   -- auto-delete sau 30 ngày
  AND compaction = { 'class': 'TimeWindowCompactionStrategy',
                     'compaction_window_unit': 'DAYS',
                     'compaction_window_size': 1 };

-- Query chính (match PRIMARY KEY): rất nhanh, hit một partition
SELECT * FROM messages_by_room WHERE room_id = ? LIMIT 50;

Quy tắc vàng:

Partition key phải phân tán tốt (tránh hot partition) nhưng gom đủ row để query không phải hit nhiều node.
Mỗi partition không nên > ~100MB hoặc > 100k rows, ngược lại sẽ có “wide partition” làm chậm compaction và read.
Nếu cần query theo nhiều chiều → tạo nhiều bảng (denormalize), không dùng secondary index (chậm, anti-pattern với scale lớn).

Consistency level, trade-off latency vs durability

Cassandra dùng tunable consistency: mỗi request chọn CL riêng.

CL	Ý nghĩa (N replicas)	Use case
`ONE`	1 replica ACK	Write nhanh, tolerate stale read
`QUORUM`	`N/2 + 1` replicas ACK	Consistency mạnh khi đọc+ghi đều QUORUM
`LOCAL_QUORUM`	QUORUM trong cùng datacenter	Multi-DC: giảm cross-DC latency
`EACH_QUORUM`	QUORUM trong mọi DC	Write chắc chắn đến mọi DC
`ALL`	Tất cả replicas ACK	Hiếm dùng, một node down → fail

Quy tắc để có strong consistency: R + W > N (đọc QUORUM + ghi QUORUM với RF=3 là phổ biến).

-- Session level
CONSISTENCY LOCAL_QUORUM;

-- Per-query (driver Python ví dụ)
-- session.execute(query, consistency_level=ConsistencyLevel.LOCAL_QUORUM)

Compaction strategy, chọn đúng theo workload

Cassandra là LSM tree → ghi append-only vào SSTable, compact định kỳ để dọn tombstone và merge overlap. 3 strategy chính:

Strategy	Phù hợp với	Nhược
SizeTieredCompaction (STCS), mặc định	Write-heavy, phân bố key đều	Amplification cao, peak disk gấp 2x
LeveledCompaction (LCS)	Read-heavy, update key hiện có	Write amplification, CPU cao
TimeWindowCompaction (TWCS)	Time-series (log, sensor, message) với TTL	Không phù hợp update key cũ

Time-series ở ví dụ message trên dùng TWCS + TTL → tự xoá gọn partition cũ, không cần DELETE thủ công.

Tuning checklist Cassandra

JVM: G1GC với heap 8-16 GB; không > 32 GB (mất compressed oops).
commitlog trên ổ riêng (SSD), data_file_directories trên ổ khác.
Tombstone: tránh DELETE thường xuyên; nếu cần, giảm gc_grace_seconds và pin compaction kịp.
Wide partition canh bằng nodetool tablehistograms, p99 partition size < 100MB.
Repair định kỳ (nodetool repair qua Reaper) để fix inconsistency.
Monitoring: nodetool tpstats, nodetool compactionstats, pending mutation, read/write latency p99.

Các lựa chọn thay thế 2026:

ScyllaDB: C++ rewrite của Cassandra, 2-5x throughput, drop-in replacement CQL.
Amazon Keyspaces: managed Cassandra-compatible trên AWS.
Astra DB / DataStax: Cassandra as-a-Service cloud-native.

MongoDB, Cassandra, Redis, mỗi công cụ giải một bài toán khác nhau

NoSQL/Redis không phải là “bỏ SQL” mà là chọn công cụ theo tính chất dữ liệu:

MongoDB, document linh hoạt, schema thay đổi thường, aggregation phức tạp.
Cassandra/ScyllaDB, write-heavy, partition theo thời gian/entity, multi-DC replication.
Redis/Valkey, cache, pub-sub, queue, leaderboard, rate limit, ephemeral state.

Dù chọn engine nào, 3 nguyên tắc chung vẫn đúng:

Thiết kế theo query pattern, không theo dữ liệu (đặc biệt với Cassandra).
Không bao giờ để một DB “làm tất cả”, mix SQL cho OLTP, Redis cho cache/session, Cassandra cho time-series, pgvector cho embeddings.
Đo lường trước khi tối ưu, benchmark trên dữ liệu thật, không trust default config vendor cho workload của bạn.

Bài tiếp theo sẽ đi vào monitoring, troubleshooting và maintenance, phần “vận hành lâu dài” mà bỏ sót sẽ khiến mọi tối ưu ở các bài trước trở thành vô nghĩa khi production gặp sự cố.

Câu hỏi hay gặp

MongoDB embed document hay reference (normalize)?

Trả lời: Embed khi: data luôn đọc cùng nhau, 1:few relationship, không cần update con độc lập. Reference khi: many:many, document con có thể vượt 16MB limit, hoặc cần update con mà không chạm parent. Quy tắc: “what you query together, store together”.

Redis single-threaded, làm sao scale?

Trả lời: Redis 7+ có I/O threading (multi-thread cho network I/O, command vẫn single-thread). Scale bằng: (1) Redis Cluster (hash slot partitioning); (2) read replica cho read-heavy; (3) nhiều instance trên cùng server (mỗi instance dùng 1 CPU core). Dùng pipeline/Lua giảm round-trip.

Cassandra write nhanh nhưng read chậm, có phải thiết kế sai?

Trả lời: Chưa chắc. Cassandra tối ưu cho write. Read chậm thường do: partition key không match query pattern (đọc scatter trên nhiều node); tombstone chồng chất (do xóa nhiều); consistency level quá cao (ALL thay vì LOCAL_QUORUM). Kiểm tra nodetool tablehistograms và tracing on.

Bài tiếp theo: Monitoring, troubleshooting và bảo trì liên tục, KPIs, alerting, incident response.