Background: What Causes Intermittent 503s and Write Stalls in ArangoDB?
ArangoDB Cluster Roles and Data Flow
ArangoDB's cluster consists of three distinct roles: Agents (RAFT consensus, configuration and supervision), Coordinators (query planning, routing, scatter/gather), and DB-Servers (shard owners storing data in RocksDB). A client write typically enters at a Coordinator, which resolves the shard, forwards to the leader DB-Server for that shard, and—under synchronous replication—waits for follower acknowledgments before returning success. This synchronous path is governed by writeConcern
and replicationFactor
.
Root-Cause Patterns Seen in the Wild
- Follower lag under synchronous replication: Followers fall behind due to I/O pressure or compaction pauses, causing leader waits and timeouts.
- Shard hot-spotting: Non-uniform key distributions or SmartGraph edge layouts overload a subset of shard leaders.
- RocksDB write stalls: Saturated memtables or compaction debt triggers global or per-column-family stalls, throttling writes.
- Coordinator fan-out overload: Large scatter/gather AQL queries over many shards overwhelm Coordinators, especially with many concurrent clients.
- Agency instability: RAFT leader churn or slow disk on Agents makes supervision and plan syncing sluggish, amplifying failover times.
- Networking and MTU edge cases: Packet loss, jumbo frames mismatch, or SNAT exhaustion lead to transient 503s that mimic database faults.
Architecture Deep Dive: Synchronous Replication Meets Storage Backpressure
WriteConcern vs. ReplicationFactor
replicationFactor
defines how many copies exist; writeConcern
defines how many must acknowledge a write before success. In busy clusters, setting both high improves durability but increases latency sensitivity to follower health. If a follower lags or stalls, leaders accumulate in-flight operations and Coordinators back up.
RocksDB's Role in Stalls
ArangoDB stores data in RocksDB. When writes outpace compaction, L0 files accumulate, memtables fill, and RocksDB can enter a stall or throttle mode. This propagates upstream: follower applies slow → leader waits for ack → Coordinator times out → clients see 503 or latency spikes. Understanding memtable sizing, background job counts, and block cache sizing is crucial to restoring headroom.
Shard Leadership and Hot Paths
Each shard has a leader. If your sharding key or graph partitioning is skewed, a handful of leaders handle a disproportionate share of writes. Even with abundant cluster capacity, a few hot leaders can trigger queues and timeouts.
Diagnostics: Build a 360° Picture Before Changing Knobs
1) Quick Health Triage
Start with the obvious, but inspect with intent:
curl -s http://COORDINATOR:8529/_admin/server/availability curl -s http://COORDINATOR:8529/_admin/metrics | grep -E \"arangodb_rocksdb|arangodb_replication|arangodb_scheduler|arangodb_aql\" arangosh --server.endpoint tcp://COORDINATOR:8529 --javascript.execute-string \"print(JSON.stringify(db._engineStats()));\"
Look for RocksDB write stall counters, scheduler queue lengths, replication apply lag, and AQL query counts per state.
2) Followers, Lag, and Synchronous Replication
Check shard distribution and follower status. This reveals leaders under duress and followers that fail to keep up.
curl -s http://COORDINATOR:8529/_admin/cluster/shardDistribution | jq . curl -s http://COORDINATOR:8529/_admin/metrics | grep arangodb_replication_applier_tailing_lag_msec curl -s http://COORDINATOR:8529/_admin/metrics | grep arangodb_replication_client_\(inserts\|writes\)
Spikes in tailing_lag_msec
on followers correlated with 503s usually implicate synchronous replication wait timeouts.
3) RocksDB Pressure Indicators
Observe L0 files, memtable counts, and stall state per DB-Server:
curl -s http://DBSERVER:8529/_admin/metrics | grep -E \"rocksdb_num_files_at_level_0|rocksdb_memtable|rocksdb_stall_micros\" arangosh --server.endpoint tcp://DBSERVER:8529 --javascript.execute-string \"var s=db._engineStats();print(JSON.stringify(s));\"
High and growing rocksdb_num_files_at_level_0
and non-zero stall durations indicate compaction debt. If block cache hit rate is low, reads are also stressing disks.
4) Coordinator Pressure and Scheduler Queues
Coordinators run the query optimizer and merge results. Under bursty workloads, their scheduler queues can back up.
curl -s http://COORDINATOR:8529/_admin/metrics | grep -E \"arangodb_scheduler_queue\" curl -s http://COORDINATOR:8529/_admin/metrics | grep -E \"arangodb_aql_active_queries|arangodb_aql_slow_query_count\"
Large scheduler queues or many slow AQL queries point to coordination bottlenecks or unsuitable AQL plans (e.g., early materialization).
5) AQL Plan Inspection (Late Materialization, Index Choice)
Use EXPLAIN
and profiling to verify index usage and materialization behavior:
arangosh --server.endpoint tcp://COORDINATOR:8529 --javascript.execute-string \"var q=\u0027FOR d IN mycol FILTER d.a==@x AND d.b IN @vals RETURN d\u0027; print(require(\u0027@arangodb\u0027).aql.explain(q,{bindVars:{x:123,vals:[1,2,3]},options:{inspect:true}}));\"
Missing selective indexes or disabled late materialization can cause heavy document fetch and over-fan-out.
6) Agency Stability and Disk Health
Agents must be on fast, reliable disks. Confirm RAFT leader stability and write-ahead log persistence:
curl -s http://AGENT:8529/_admin/metrics | grep -E \"agency_leader|agency_store|raft\" iostat -x 1 10 # on Agent hosts dmesg | grep -i error
Frequent leader changes or slow disks on Agents correlate with longer failovers and plan updates.
7) Network Sanity (MTU, Loss, NAT)
Intermittent 503s sometimes trace back to networking. Validate end-to-end MTU and packet loss:
ping -M do -s 8972 PEER_IP # jumbo frame path test iperf3 -c PEER_IP -u -b 0 # UDP loss ss -s | grep -i orphan # socket pressure
If MTU black holes or NAT exhaustion exist, the database will manifest application-level symptoms.
Common Pitfalls and Anti-Patterns
- Maxing out writeConcern by default: Setting
writeConcern == replicationFactor
universally increases failure sensitivity in mixed workloads. - Under-provisioned block cache: A tiny RocksDB cache drives random reads to disk, slowing compaction and apply phases.
- Single coarse shard: Few shards mean fewer leaders; any hot shard creates a single-file bottleneck.
- Unbounded AQL fan-out: Cartesian JOINs or
COLLECT
without indexes create massive scatter/gather overhead. - Agents on slow disks: RAFT persistence on HDDs or busy volumes causes plan/supervision lag.
- Mixing jumbo frames without end-to-end support: Partial jumbo enablement introduces elusive packet drops.
Step-by-Step Fixes
1) Right-Size Synchronous Replication
Start by aligning writeConcern
with your SLOs. Consider a lower writeConcern
for high-throughput collections and a higher one for transactional or audit data. Confirm and change via collection properties:
arangosh --server.endpoint tcp://COORDINATOR:8529 --javascript.execute-string \"var c=db._collection(\u0027orders\u0027);print(JSON.stringify(c.properties()));\" arangosh --server.endpoint tcp://COORDINATOR:8529 --javascript.execute-string \"db._collection(\u0027orders\u0027).properties({writeConcern:2});\"
Also review syncTimeout
settings at the server level to avoid premature timeouts during short-lived bursts.
2) Expand and Rebalance Shards
Increase shard count for hot collections to distribute leader responsibility, then trigger a rebalance:
arangosh --server.endpoint tcp://COORDINATOR:8529 --javascript.execute-string \"db._collection(\u0027events\u0027).properties({numberOfShards:32});\" curl -X POST http://COORDINATOR:8529/_admin/cluster/rebalance
For SmartGraphs, revisit smart attribute selection to improve edge locality without overloading leaders.
3) Tune RocksDB for Headroom
Focus on three areas: memtables, compaction parallelism, and cache. Typical starting points on DB-Servers with SSDs:
# arangod.conf (DB-Server) [rocksdb] write-buffer-size = 134217728 # 128 MiB per memtable max-write-buffer-number = 6 # helps absorb bursts min-write-buffer-number-to-merge = 2 max-background-jobs = 8 # align with CPU/IO max-subcompactions = 2 block-cache-size = 641728512 # ~600 MiB (adjust to RAM) level-compaction-dynamic-level-bytes = true bytes-per-sync = 1048576 wal-file-timeout = 300
After changes, monitor L0 files and stall counters. Decreases indicate compaction is keeping up.
4) Elevate Indexing and Late Materialization
Ensure queries hit selective indexes and avoid fetching full documents too early. Consider inverted indexes for complex predicates and analyzers:
-- create inverted index db._query( `RETURN db._createIndex(\u0027mycol\u0027, { type: \u0027inverted\u0027, name: \u0027inv_a_b\u0027, fields: [ { name: \u0027a\u0027 }, { name: \u0027b\u0027 } ] })` ); -- profile a query db._profile(`FOR d IN mycol FILTER d.a==@x AND d.b IN @vals RETURN d`, {x:123, vals:[1,2,3]});
Enable optimizer rules that promote late materialization if previously disabled and verify via EXPLAIN
.
5) Protect Coordinators from Fan-Out Storms
Cap concurrency at the client tier and size Coordinator pools to expected fan-out. Use query timeouts and guardrails:
-- limit execution time db._query(`FOR d IN mycol FILTER d.ts > @t RETURN d`, {t: 1680000000}, { maxRuntime: 5 });
Consider splitting read-heavy traffic to read-only Coordinators to isolate write paths.
6) Harden the Agency
Place Agents on dedicated, low-latency SSDs; set RAFT election and heartbeat timeouts appropriate to your network. Watch for leader churn and disk saturation.
curl -s http://AGENT:8529/_admin/metrics | grep agency_leader iostat -x 1 30
Stability at the control plane reduces failover tail latency during incidents.
7) Fix Network Footguns
Make MTU consistent end-to-end; if in doubt, prefer standard 1500 unless every hop supports jumbo. Ensure SNAT pools are sufficient in NATed deployments and that connection tracking tables are not overflowing.
ip link show | grep mtu sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
8) Observe, Alert, and Correlate
Export and alert on the following metrics; correlate spikes with application incidents:
arangodb_replication_applier_tailing_lag_msec
per followerrocksdb_num_files_at_level_0
,rocksdb_stall_micros
arangodb_scheduler_queue_length
on Coordinators- AQL slow query counts and per-collection write rates
curl -s http://NODE:8529/_admin/metrics | grep -E \"arangodb_|rocksdb_\"
9) Transaction Scope and TTL Hygiene
Long-running streaming transactions and aggressive TTL purges can collide. Stagger TTL jobs and avoid holding large write transactions open during purge windows.
db._query(`FOR d IN mycol FILTER d.expiry < DATE_NOW() REMOVE d IN mycol OPTIONS { ignoreErrors: true }`, {}, {maxRuntime: 10});
10) Capacity Plan for Peaks, Not Averages
Provision memtable, cache, and background jobs for burst traffic. Run controlled load tests with realistic leader/follower placements to validate headroom before peak events.
Pitfalls During Remediation (and How to Avoid Them)
- Turning off synchronous replication cluster-wide: This might hide symptoms but undermines durability guarantees. Prefer targeted tuning per collection.
- Over-sharding without rebalancing: Increasing shards without a rebalance leaves hot leaders in place; always follow with a planned rebalance.
- Oversizing block cache blindly: Starving memtables for cache can worsen stalls. Balance both according to RAM and workload.
- Ignoring AQL plan shape: Storage and replication fixes cannot compensate for pathological query plans. Fix the plan first for quick wins.
- Changing many knobs at once: Apply one change at a time and measure, or you'll struggle to attribute improvements.
Hands-On Playbooks
Playbook A: Followers Lagging, Writes Timing Out
- Confirm lag:
grep arangodb_replication_applier_tailing_lag_msec
on followers. - Reduce
writeConcern
on hot collections (temporarily) and re-evaluate latency. - Increase memtables and background jobs on lagging followers; verify
rocksdb_num_files_at_level_0
declines. - Rebalance shards so hot leaders gain more peers and spread load.
- Once stable, restore the desired
writeConcern
and re-test.
Playbook B: 503 Spikes During Analytics Queries
- Profile offending AQL queries; ensure appropriate indexes and late materialization.
- Introduce
maxRuntime
and throttle client concurrency. - Scale Coordinators and isolate analytics traffic if necessary.
- Monitor
arangodb_scheduler_queue_length
and adjust accordingly.
Playbook C: RocksDB Stalls on DB-Servers
- Track L0 and stall metrics; if growing, increase
max-background-jobs
andwrite-buffer-size
. - Ensure SSD-backed storage and sufficient IOPS; move WAL and data to fast media.
- Reduce compaction debt by temporarily limiting ingestion rate at clients.
- Re-check stall counters and latencies; iterate until steady-state is healthy.
AQL Optimization Nuggets That Matter Under Load
Prefer Index-Covered Projections
Retrieve only necessary attributes to keep gathers light:
FOR d IN mycol FILTER d.a==@x RETURN {a:d.a, b:d.b}
Use COLLECT WITH COUNT INTO
for Fast Aggregations
FOR d IN mycol FILTER d.type==@t COLLECT WITH COUNT INTO c RETURN c
Avoid Exploding Fan-Out with IN
on Large Arrays
Pre-stage values into a temporary collection with an index and JOIN via index rather than sending huge arrays to IN
.
FOR v IN vals FOR d IN mycol FILTER d.k==v.k RETURN d
Operational Guardrails and SLOs
- Latency SLOs: Track p50/p95 for writes per collection; alert when crossing budget.
- Replication SLOs: Alert if follower lag exceeds N milliseconds for M minutes.
- Storage SLOs: Keep
rocksdb_num_files_at_level_0
under a steady threshold; keep stall time negligible. - Control-plane SLOs: Agency leader stability and heartbeat latencies within tight bounds.
Case Study: SmartGraph Edges Causing Hot Leaders
Symptoms
Edge inserts to a SmartGraph intermittently timed out at p99 during peak. Coordinators showed modest CPU; DB-Servers displayed bursts of RocksDB stalls and replication lag spikes.
Findings
- Smart attribute skew created two disproportionately hot shard leaders.
- Followers for those shards lagged under compaction pressure.
- AQL queries fetched full edges early, inflating scatter/gather costs.
Remediation
- Re-selected smart attribute to distribute writes more evenly.
- Increased
numberOfShards
for the edge collection, executed a rebalance. - Tuned RocksDB memtables and background jobs on hot DB-Servers.
- Refactored AQL to late materialize edge documents after index filtering.
Outcome
Follower lag dropped by 85%, p99 write latency improved from 1.8s to 220ms, and no 503s were observed during the next peak event.
Security and Compliance Considerations
Lowering writeConcern
or relaxing timeouts can undermine durability guarantees. Document any temporary changes and set automated tasks to restore intended policies. Use collection-level policies to distinguish between compliance-critical data and high-volume telemetry. Keep audit trails of configuration changes, and ensure backups are consistent with current replication settings.
Conclusion
Intermittent 503s and write stalls in ArangoDB are rarely a single bug; they are emergent behavior at the intersection of synchronous replication, shard leadership, and storage backpressure. The disciplined approach is to measure first—replication lag, RocksDB pressure, Coordinator queues—then apply targeted fixes: right-size writeConcern
, rebalance shards, tune RocksDB for your hardware, and optimize AQL plans to minimize fan-out and early materialization. Stabilize the Agency and the network, and treat capacity as a burst budget, not an average. With these practices, you can transform a flaky cluster into a predictable platform that meets enterprise SLOs under sustained load.
FAQs
1. Should I always match writeConcern to replicationFactor?
No. Use a writeConcern that aligns with durability needs per collection. For ultra-high-throughput telemetry, a lower writeConcern can be acceptable, while regulatory data might require full acknowledgments.
2. How do I detect shard hot-spotting quickly?
Inspect _admin/cluster/shardDistribution
and per-shard write metrics. Leaders with disproportionate write rates and scheduler queues are red flags; increase shards and rebalance.
3. Can bigger block cache alone fix stalls?
Not by itself. You must balance cache with memtables and compaction threads. Stalls typically reflect compaction debt, which requires parallelism and I/O headroom, not just cache.
4. Are Coordinators a common bottleneck?
They can be when queries scatter to many shards or materialize documents early. Scale Coordinators, cap concurrency, and optimize AQL plans to reduce fan-out.
5. Do Agents affect query latency?
Indirectly. Unstable or slow Agents prolong failovers and supervision actions, which can surface as timeouts during reconfiguration or recovery events. Keep them fast and stable.