Background: What Causes Intermittent 503s and Write Stalls in ArangoDB?

ArangoDB Cluster Roles and Data Flow

ArangoDB's cluster consists of three distinct roles: Agents (RAFT consensus, configuration and supervision), Coordinators (query planning, routing, scatter/gather), and DB-Servers (shard owners storing data in RocksDB). A client write typically enters at a Coordinator, which resolves the shard, forwards to the leader DB-Server for that shard, and—under synchronous replication—waits for follower acknowledgments before returning success. This synchronous path is governed by writeConcern and replicationFactor.

Root-Cause Patterns Seen in the Wild

  • Follower lag under synchronous replication: Followers fall behind due to I/O pressure or compaction pauses, causing leader waits and timeouts.
  • Shard hot-spotting: Non-uniform key distributions or SmartGraph edge layouts overload a subset of shard leaders.
  • RocksDB write stalls: Saturated memtables or compaction debt triggers global or per-column-family stalls, throttling writes.
  • Coordinator fan-out overload: Large scatter/gather AQL queries over many shards overwhelm Coordinators, especially with many concurrent clients.
  • Agency instability: RAFT leader churn or slow disk on Agents makes supervision and plan syncing sluggish, amplifying failover times.
  • Networking and MTU edge cases: Packet loss, jumbo frames mismatch, or SNAT exhaustion lead to transient 503s that mimic database faults.

Architecture Deep Dive: Synchronous Replication Meets Storage Backpressure

WriteConcern vs. ReplicationFactor

replicationFactor defines how many copies exist; writeConcern defines how many must acknowledge a write before success. In busy clusters, setting both high improves durability but increases latency sensitivity to follower health. If a follower lags or stalls, leaders accumulate in-flight operations and Coordinators back up.

RocksDB's Role in Stalls

ArangoDB stores data in RocksDB. When writes outpace compaction, L0 files accumulate, memtables fill, and RocksDB can enter a stall or throttle mode. This propagates upstream: follower applies slow → leader waits for ack → Coordinator times out → clients see 503 or latency spikes. Understanding memtable sizing, background job counts, and block cache sizing is crucial to restoring headroom.

Shard Leadership and Hot Paths

Each shard has a leader. If your sharding key or graph partitioning is skewed, a handful of leaders handle a disproportionate share of writes. Even with abundant cluster capacity, a few hot leaders can trigger queues and timeouts.

Diagnostics: Build a 360° Picture Before Changing Knobs

1) Quick Health Triage

Start with the obvious, but inspect with intent:

curl -s http://COORDINATOR:8529/_admin/server/availability
curl -s http://COORDINATOR:8529/_admin/metrics | grep -E \"arangodb_rocksdb|arangodb_replication|arangodb_scheduler|arangodb_aql\"
arangosh --server.endpoint tcp://COORDINATOR:8529 --javascript.execute-string \"print(JSON.stringify(db._engineStats()));\"

Look for RocksDB write stall counters, scheduler queue lengths, replication apply lag, and AQL query counts per state.

2) Followers, Lag, and Synchronous Replication

Check shard distribution and follower status. This reveals leaders under duress and followers that fail to keep up.

curl -s http://COORDINATOR:8529/_admin/cluster/shardDistribution | jq .
curl -s http://COORDINATOR:8529/_admin/metrics | grep arangodb_replication_applier_tailing_lag_msec
curl -s http://COORDINATOR:8529/_admin/metrics | grep arangodb_replication_client_\(inserts\|writes\)

Spikes in tailing_lag_msec on followers correlated with 503s usually implicate synchronous replication wait timeouts.

3) RocksDB Pressure Indicators

Observe L0 files, memtable counts, and stall state per DB-Server:

curl -s http://DBSERVER:8529/_admin/metrics | grep -E \"rocksdb_num_files_at_level_0|rocksdb_memtable|rocksdb_stall_micros\"
arangosh --server.endpoint tcp://DBSERVER:8529 --javascript.execute-string 
\"var s=db._engineStats();print(JSON.stringify(s));\"

High and growing rocksdb_num_files_at_level_0 and non-zero stall durations indicate compaction debt. If block cache hit rate is low, reads are also stressing disks.

4) Coordinator Pressure and Scheduler Queues

Coordinators run the query optimizer and merge results. Under bursty workloads, their scheduler queues can back up.

curl -s http://COORDINATOR:8529/_admin/metrics | grep -E \"arangodb_scheduler_queue\"
curl -s http://COORDINATOR:8529/_admin/metrics | grep -E \"arangodb_aql_active_queries|arangodb_aql_slow_query_count\"

Large scheduler queues or many slow AQL queries point to coordination bottlenecks or unsuitable AQL plans (e.g., early materialization).

5) AQL Plan Inspection (Late Materialization, Index Choice)

Use EXPLAIN and profiling to verify index usage and materialization behavior:

arangosh --server.endpoint tcp://COORDINATOR:8529 --javascript.execute-string 
\"var q=\u0027FOR d IN mycol FILTER d.a==@x AND d.b IN @vals RETURN d\u0027; 
print(require(\u0027@arangodb\u0027).aql.explain(q,{bindVars:{x:123,vals:[1,2,3]},options:{inspect:true}}));\"

Missing selective indexes or disabled late materialization can cause heavy document fetch and over-fan-out.

6) Agency Stability and Disk Health

Agents must be on fast, reliable disks. Confirm RAFT leader stability and write-ahead log persistence:

curl -s http://AGENT:8529/_admin/metrics | grep -E \"agency_leader|agency_store|raft\"
iostat -x 1 10   # on Agent hosts
dmesg | grep -i error

Frequent leader changes or slow disks on Agents correlate with longer failovers and plan updates.

7) Network Sanity (MTU, Loss, NAT)

Intermittent 503s sometimes trace back to networking. Validate end-to-end MTU and packet loss:

ping -M do -s 8972 PEER_IP   # jumbo frame path test
iperf3 -c PEER_IP -u -b 0   # UDP loss
ss -s | grep -i orphan       # socket pressure

If MTU black holes or NAT exhaustion exist, the database will manifest application-level symptoms.

Common Pitfalls and Anti-Patterns

  • Maxing out writeConcern by default: Setting writeConcern == replicationFactor universally increases failure sensitivity in mixed workloads.
  • Under-provisioned block cache: A tiny RocksDB cache drives random reads to disk, slowing compaction and apply phases.
  • Single coarse shard: Few shards mean fewer leaders; any hot shard creates a single-file bottleneck.
  • Unbounded AQL fan-out: Cartesian JOINs or COLLECT without indexes create massive scatter/gather overhead.
  • Agents on slow disks: RAFT persistence on HDDs or busy volumes causes plan/supervision lag.
  • Mixing jumbo frames without end-to-end support: Partial jumbo enablement introduces elusive packet drops.

Step-by-Step Fixes

1) Right-Size Synchronous Replication

Start by aligning writeConcern with your SLOs. Consider a lower writeConcern for high-throughput collections and a higher one for transactional or audit data. Confirm and change via collection properties:

arangosh --server.endpoint tcp://COORDINATOR:8529 --javascript.execute-string 
\"var c=db._collection(\u0027orders\u0027);print(JSON.stringify(c.properties()));\"
arangosh --server.endpoint tcp://COORDINATOR:8529 --javascript.execute-string 
\"db._collection(\u0027orders\u0027).properties({writeConcern:2});\"

Also review syncTimeout settings at the server level to avoid premature timeouts during short-lived bursts.

2) Expand and Rebalance Shards

Increase shard count for hot collections to distribute leader responsibility, then trigger a rebalance:

arangosh --server.endpoint tcp://COORDINATOR:8529 --javascript.execute-string 
\"db._collection(\u0027events\u0027).properties({numberOfShards:32});\"
curl -X POST http://COORDINATOR:8529/_admin/cluster/rebalance

For SmartGraphs, revisit smart attribute selection to improve edge locality without overloading leaders.

3) Tune RocksDB for Headroom

Focus on three areas: memtables, compaction parallelism, and cache. Typical starting points on DB-Servers with SSDs:

# arangod.conf (DB-Server)
[rocksdb]
write-buffer-size = 134217728            # 128 MiB per memtable
max-write-buffer-number = 6              # helps absorb bursts
min-write-buffer-number-to-merge = 2
max-background-jobs = 8                  # align with CPU/IO
max-subcompactions = 2
block-cache-size = 641728512             # ~600 MiB (adjust to RAM)
level-compaction-dynamic-level-bytes = true
bytes-per-sync = 1048576
wal-file-timeout = 300

After changes, monitor L0 files and stall counters. Decreases indicate compaction is keeping up.

4) Elevate Indexing and Late Materialization

Ensure queries hit selective indexes and avoid fetching full documents too early. Consider inverted indexes for complex predicates and analyzers:

-- create inverted index
db._query(
`RETURN db._createIndex(\u0027mycol\u0027, { type: \u0027inverted\u0027, name: \u0027inv_a_b\u0027, fields: [ { name: \u0027a\u0027 }, { name: \u0027b\u0027 } ] })`
);

-- profile a query
db._profile(`FOR d IN mycol FILTER d.a==@x AND d.b IN @vals RETURN d`, {x:123, vals:[1,2,3]});

Enable optimizer rules that promote late materialization if previously disabled and verify via EXPLAIN.

5) Protect Coordinators from Fan-Out Storms

Cap concurrency at the client tier and size Coordinator pools to expected fan-out. Use query timeouts and guardrails:

-- limit execution time
db._query(`FOR d IN mycol FILTER d.ts > @t RETURN d`, {t: 1680000000}, { maxRuntime: 5 });

Consider splitting read-heavy traffic to read-only Coordinators to isolate write paths.

6) Harden the Agency

Place Agents on dedicated, low-latency SSDs; set RAFT election and heartbeat timeouts appropriate to your network. Watch for leader churn and disk saturation.

curl -s http://AGENT:8529/_admin/metrics | grep agency_leader
iostat -x 1 30

Stability at the control plane reduces failover tail latency during incidents.

7) Fix Network Footguns

Make MTU consistent end-to-end; if in doubt, prefer standard 1500 unless every hop supports jumbo. Ensure SNAT pools are sufficient in NATed deployments and that connection tracking tables are not overflowing.

ip link show | grep mtu
sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max

8) Observe, Alert, and Correlate

Export and alert on the following metrics; correlate spikes with application incidents:

  • arangodb_replication_applier_tailing_lag_msec per follower
  • rocksdb_num_files_at_level_0, rocksdb_stall_micros
  • arangodb_scheduler_queue_length on Coordinators
  • AQL slow query counts and per-collection write rates
curl -s http://NODE:8529/_admin/metrics | grep -E \"arangodb_|rocksdb_\"

9) Transaction Scope and TTL Hygiene

Long-running streaming transactions and aggressive TTL purges can collide. Stagger TTL jobs and avoid holding large write transactions open during purge windows.

db._query(`FOR d IN mycol FILTER d.expiry < DATE_NOW() REMOVE d IN mycol OPTIONS { ignoreErrors: true }`, {}, {maxRuntime: 10});

10) Capacity Plan for Peaks, Not Averages

Provision memtable, cache, and background jobs for burst traffic. Run controlled load tests with realistic leader/follower placements to validate headroom before peak events.

Pitfalls During Remediation (and How to Avoid Them)

  • Turning off synchronous replication cluster-wide: This might hide symptoms but undermines durability guarantees. Prefer targeted tuning per collection.
  • Over-sharding without rebalancing: Increasing shards without a rebalance leaves hot leaders in place; always follow with a planned rebalance.
  • Oversizing block cache blindly: Starving memtables for cache can worsen stalls. Balance both according to RAM and workload.
  • Ignoring AQL plan shape: Storage and replication fixes cannot compensate for pathological query plans. Fix the plan first for quick wins.
  • Changing many knobs at once: Apply one change at a time and measure, or you'll struggle to attribute improvements.

Hands-On Playbooks

Playbook A: Followers Lagging, Writes Timing Out

  1. Confirm lag: grep arangodb_replication_applier_tailing_lag_msec on followers.
  2. Reduce writeConcern on hot collections (temporarily) and re-evaluate latency.
  3. Increase memtables and background jobs on lagging followers; verify rocksdb_num_files_at_level_0 declines.
  4. Rebalance shards so hot leaders gain more peers and spread load.
  5. Once stable, restore the desired writeConcern and re-test.

Playbook B: 503 Spikes During Analytics Queries

  1. Profile offending AQL queries; ensure appropriate indexes and late materialization.
  2. Introduce maxRuntime and throttle client concurrency.
  3. Scale Coordinators and isolate analytics traffic if necessary.
  4. Monitor arangodb_scheduler_queue_length and adjust accordingly.

Playbook C: RocksDB Stalls on DB-Servers

  1. Track L0 and stall metrics; if growing, increase max-background-jobs and write-buffer-size.
  2. Ensure SSD-backed storage and sufficient IOPS; move WAL and data to fast media.
  3. Reduce compaction debt by temporarily limiting ingestion rate at clients.
  4. Re-check stall counters and latencies; iterate until steady-state is healthy.

AQL Optimization Nuggets That Matter Under Load

Prefer Index-Covered Projections

Retrieve only necessary attributes to keep gathers light:

FOR d IN mycol FILTER d.a==@x RETURN {a:d.a, b:d.b}

Use COLLECT WITH COUNT INTO for Fast Aggregations

FOR d IN mycol FILTER d.type==@t COLLECT WITH COUNT INTO c RETURN c

Avoid Exploding Fan-Out with IN on Large Arrays

Pre-stage values into a temporary collection with an index and JOIN via index rather than sending huge arrays to IN.

FOR v IN vals FOR d IN mycol FILTER d.k==v.k RETURN d

Operational Guardrails and SLOs

  • Latency SLOs: Track p50/p95 for writes per collection; alert when crossing budget.
  • Replication SLOs: Alert if follower lag exceeds N milliseconds for M minutes.
  • Storage SLOs: Keep rocksdb_num_files_at_level_0 under a steady threshold; keep stall time negligible.
  • Control-plane SLOs: Agency leader stability and heartbeat latencies within tight bounds.

Case Study: SmartGraph Edges Causing Hot Leaders

Symptoms

Edge inserts to a SmartGraph intermittently timed out at p99 during peak. Coordinators showed modest CPU; DB-Servers displayed bursts of RocksDB stalls and replication lag spikes.

Findings

  • Smart attribute skew created two disproportionately hot shard leaders.
  • Followers for those shards lagged under compaction pressure.
  • AQL queries fetched full edges early, inflating scatter/gather costs.

Remediation

  • Re-selected smart attribute to distribute writes more evenly.
  • Increased numberOfShards for the edge collection, executed a rebalance.
  • Tuned RocksDB memtables and background jobs on hot DB-Servers.
  • Refactored AQL to late materialize edge documents after index filtering.

Outcome

Follower lag dropped by 85%, p99 write latency improved from 1.8s to 220ms, and no 503s were observed during the next peak event.

Security and Compliance Considerations

Lowering writeConcern or relaxing timeouts can undermine durability guarantees. Document any temporary changes and set automated tasks to restore intended policies. Use collection-level policies to distinguish between compliance-critical data and high-volume telemetry. Keep audit trails of configuration changes, and ensure backups are consistent with current replication settings.

Conclusion

Intermittent 503s and write stalls in ArangoDB are rarely a single bug; they are emergent behavior at the intersection of synchronous replication, shard leadership, and storage backpressure. The disciplined approach is to measure first—replication lag, RocksDB pressure, Coordinator queues—then apply targeted fixes: right-size writeConcern, rebalance shards, tune RocksDB for your hardware, and optimize AQL plans to minimize fan-out and early materialization. Stabilize the Agency and the network, and treat capacity as a burst budget, not an average. With these practices, you can transform a flaky cluster into a predictable platform that meets enterprise SLOs under sustained load.

FAQs

1. Should I always match writeConcern to replicationFactor?

No. Use a writeConcern that aligns with durability needs per collection. For ultra-high-throughput telemetry, a lower writeConcern can be acceptable, while regulatory data might require full acknowledgments.

2. How do I detect shard hot-spotting quickly?

Inspect _admin/cluster/shardDistribution and per-shard write metrics. Leaders with disproportionate write rates and scheduler queues are red flags; increase shards and rebalance.

3. Can bigger block cache alone fix stalls?

Not by itself. You must balance cache with memtables and compaction threads. Stalls typically reflect compaction debt, which requires parallelism and I/O headroom, not just cache.

4. Are Coordinators a common bottleneck?

They can be when queries scatter to many shards or materialize documents early. Scale Coordinators, cap concurrency, and optimize AQL plans to reduce fan-out.

5. Do Agents affect query latency?

Indirectly. Unstable or slow Agents prolong failovers and supervision actions, which can surface as timeouts during reconfiguration or recovery events. Keep them fast and stable.