Background and Problem Context
Why MongoDB Issues Get Complex at Scale
MongoDB's schemaless design accelerates development but can lead to unbounded document growth, unpredictable query shapes, and fragmented indexes. At scale, factors like sharding strategy, replication lag, and storage engine settings magnify the impact of suboptimal decisions. Problems often arise not during development but months after go-live, when data volume and access patterns shift beyond initial projections.
High-Impact Failure Modes
- Unbounded array growth causing document size limit breaches (16 MB).
- Slow queries from unindexed fields or low-selectivity indexes.
- Replication lag due to write-heavy workloads saturating secondaries.
- WiredTiger cache pressure causing eviction storms and latency spikes.
- Chunk migrations in sharded clusters blocking writes.
- Disk I/O saturation due to unbounded collection growth.
Architectural Considerations
Storage Engine
Most production MongoDB deployments use WiredTiger, which offers document-level locking and compression. Misconfigured cache size can either starve queries or cause OS-level swapping. For read-heavy workloads, ensuring sufficient working set fits in RAM is critical.
Replica Sets
Replication provides redundancy and failover but adds latency sensitivity. Write concern and read preference choices influence consistency, durability, and performance. Inconsistent settings across services can cause surprising anomalies.
Sharding
Sharding distributes data across multiple shards. Choosing a poor shard key—low cardinality, monotonically increasing values—can lead to hotspotting, uneven data distribution, and bottlenecks on chunk migrations.
Diagnostics Playbook
Step 1: Baseline Metrics
Collect db.serverStatus()
, db.stats()
, and db.currentOp()
. Track WiredTiger cache usage, replication optime lag, disk I/O, and index hit ratios. Store baselines for comparison during incidents.
Step 2: Identify Slow Queries
Enable the profiler or set slowms
threshold to log slow queries. Use db.system.profile.find()
to locate offending operations. Always correlate slow queries with their execution plans (explain()
).
Step 3: Analyze Index Usage
Use explain("executionStats")
to confirm whether queries are covered by indexes. Look for COLLSCAN in execution plans and high "nReturned" vs "nExamined" ratios.
Step 4: Monitor Replication Health
Run rs.printSlaveReplicationInfo()
to detect lag. On lagging secondaries, inspect db.serverStatus().opcounters
and disk I/O utilization. Check for blocking index builds or large oplog entries.
Step 5: Sharding-Specific Checks
Use sh.status()
to inspect chunk distribution. Look for chunks concentrated on a single shard. Monitor migration queue length with sh.getBalancerState()
and balancer logs.
Step 6: Resource Contention
Correlate CPU, memory, and disk metrics from MongoDB with system-level monitoring (e.g., iostat, vmstat). WiredTiger eviction metrics often indicate cache misconfiguration.
Common Pitfalls
- Relying on default indexes without analyzing query workload.
- Using large documents with deeply nested arrays without bounding growth.
- Allowing unbounded collections to grow without TTL indexes.
- Choosing shard keys without considering write distribution.
- Running mixed read/write workloads without proper read preference strategy.
Step-by-Step Fixes
Optimizing Indexing
// Create a compound index to match query shape db.orders.createIndex({ customerId: 1, status: 1, orderDate: -1 }) // Ensure covered queries to avoid fetching documents db.orders.find({ customerId: 123, status: "OPEN" }, { _id: 0, customerId: 1, status: 1 }) .hint({ customerId: 1, status: 1 })
Managing WiredTiger Cache
# Set cache size to 50% of RAM in mongod.conf storage: wiredTiger: engineConfig: cacheSizeGB: 32
Reducing Replication Lag
// Adjust write concern for latency-sensitive writes db.orders.insert({ ... }, { writeConcern: { w: 1 } }) // On secondary: check oplog window size db.getReplicationInfo()
Sharding Improvements
// Use a hashed shard key to distribute writes evenly sh.shardCollection("sales.transactions", { transactionId: "hashed" })
TTL Indexes for Data Expiry
// Automatically expire documents after 90 days db.events.createIndex({ createdAt: 1 }, { expireAfterSeconds: 7776000 })
Best Practices for Long-Term Stability
- Design schemas to match query patterns; avoid deep nesting beyond operational limits.
- Regularly review and prune unused indexes.
- Align shard key choice with anticipated growth and access patterns.
- Monitor oplog size to ensure it can handle replication delays.
- Use change streams and capped collections for real-time use cases instead of polling.
- Automate backups and test restore procedures regularly.
Conclusion
MongoDB's power at scale depends on disciplined schema design, indexing strategy, and operational vigilance. Many large-scale issues—replication lag, cache pressure, slow queries—are predictable with the right baselines and monitoring in place. By following the diagnostics playbook, addressing architectural misalignments, and enforcing best practices like bounded growth and thoughtful shard key selection, enterprises can maintain high performance and reliability even as workloads evolve.
FAQs
1. How can I detect unbounded array growth before it causes problems?
Monitor document sizes using the $bsonSize
aggregation operator. Set alerts for documents approaching the 16 MB limit.
2. What is the optimal WiredTiger cache size?
Typically 50% of physical RAM, adjusted for other processes. Too small increases disk I/O; too large risks OS swapping.
3. How do I choose the right shard key?
Pick a high-cardinality field that distributes writes evenly and supports common query filters. Avoid monotonically increasing keys for write-heavy workloads.
4. Why is my secondary lagging even with low write volume?
Check for blocking operations like index builds, large batch updates, or disk bottlenecks. Also verify oplog size is sufficient to cover replication delay windows.
5. Should I always use hashed shard keys?
No. Hashed keys balance writes but may hinder range queries. Choose based on whether range scans or even write distribution is more critical.