Enterprise ScyllaDB Troubleshooting: Root Causes, Diagnostics, and Durable Performance Fixes

Details: Category: Databases; By Mindful Chase; 13.Aug; Hits: 114

ScyllaDB is a high-performance, Cassandra-compatible database built on the Seastar framework with a shard-per-core architecture. In enterprise settings, teams adopt ScyllaDB for ultra-low latency and predictable throughput, yet subtle production issues can appear months after launch: p99 latency spikes, compaction stalls, unbalanced shards, tombstone storms, streaming bottlenecks during repair, or cross-DC consistency anomalies. These failures are rarely caused by a single misconfiguration. Instead, they emerge from the interaction of data modeling, driver behavior, resource isolation, compaction strategy, and network topology. This article provides senior architects and tech leads with a rigorous troubleshooting playbook that connects symptoms to root causes, explains architectural implications, and delivers durable fixes suitable for large-scale, mission-critical ScyllaDB clusters.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Background: How ScyllaDB's Architecture Shapes Failure Modes

Shard-per-Core and Seastar

ScyllaDB runs one reactor (shard) per CPU core, avoiding shared memory and locks. Each shard owns a slice of the token range and its own I/O queues, memtables, and caches. This design eliminates global contention but makes resource imbalance across shards immediately visible as tail latency. If traffic or data skew concentrates on a subset of tokens, a few shards saturate while others remain idle, raising p99 latency even when average CPU looks fine.

I/O Scheduler and Backpressure

ScyllaDB's I/O scheduler prioritizes foreground reads/writes over background compaction and streaming, but mis-tuned disks or noisy neighbors can still starve critical operations. Over-aggressive concurrency in drivers or batch workloads can outpace per-shard admission, leading to timeouts and queue buildup.

Compatibility Layer

Because ScyllaDB is protocol- and SSTable-compatible with Cassandra, most Cassandra data models work. However, differences in implementation—shard awareness, prepared statement routing, repair pipeline, and compaction concurrency—change how issues manifest and how you fix them.

High-Impact Symptoms at Scale

Symptom A: p99/p999 Latency Spikes Under Load

Often correlated with shard imbalance, tombstone-heavy reads, compaction backlog, or disk throttling. Spikes may appear only during repair or when TTL expirations create tombstone churn.

Symptom B: Write Timeouts Despite Low Average CPU

Per-shard hotspots or unbalanced token assignments can overload a subset of shards. Overly strict consistency levels and small timeouts amplify the effect.

Symptom C: Read Performance Degrades Over Weeks

Compaction debt accumulates, caches fragment, and tombstones pile up. Queries progressively read more SSTables, increasing read amplification.

Symptom D: Repair and Rebalance Take Too Long

Streaming bandwidth caps, under-provisioned network paths, oversized shards or large partitions slow down repair. Balancer moves or topology changes can extend operational risk windows.

Symptom E: Cross-DC Consistency Surprises

Mismatched LOCAL_QUORUM vs QUORUM, mixed driver policies, or uneven RF across DCs cause anomalous read-after-write behavior under failure.

Diagnostic Playbook

Step 1: Establish Ground Truth

Collect the following before any changes:

Cluster topology (DCs, racks, nodes, tokens per node, shard count).
Keyspaces, RF, table compaction strategies, bloom filter settings.
Workload profile: request rates, CLs, partitions touched, payload sizes.
Scylla version, kernel, filesystem, disk type/RAID, NUMA policy.

Export dashboards from Scylla Monitoring Stack (Prometheus/Grafana) to capture CPU per shard, latency histograms, compaction backlog, cache hit rate, and disk latency. Make this your baseline for comparison.

Step 2: Confirm Shard Balance

Skew concentrates heat on specific shards. Inspect per-shard metrics: CPU, queue length, cache hit. On-node, verify CPU pinning and IRQ distribution to avoid cross-NUMA penalties.

# Check shards and NUMA pinning
ps -eLo pid,cmd | grep scylla
lscpu | grep -E "NUMA|CPU\(s\)"

Use nodetool cfstats and per-table metrics to check read/write distribution. If a small number of partitions dominate, your model or driver routing may be the culprit.

Step 3: Identify Tombstone and Large Partition Risks

Reads that scan many tombstones or oversized partitions are latency landmines. Use CQL tracing and system tables to quantify:

// Enable tracing for a representative query
CONSISTENCY LOCAL_QUORUM;
TRACING ON;
SELECT * FROM events WHERE account_id = ? AND day = ? LIMIT 1000;
TRACING OFF;

// Inspect system_traces for tombstones and sstable reads
SELECT activity, source, parameters FROM system_traces.events
 WHERE session_id = <trace_id>;

Step 4: Measure Compaction Debt and Read Amplification

Backlog is visible as many SSTables per read and high pending compactions. Read latency tracks the number of SSTables touched.

# Compaction and sstable stats
nodetool compactionstats
nodetool tablehistograms ks table
nodetool cfstats ks.table | egrep "SSTable count|Space used"

Step 5: Check I/O Health and Scheduler Pressure

Verify that disks deliver expected throughput and latency under the Scylla I/O scheduler. Compare iostat -x with Scylla metrics (reactor I/O queue delays). Ensure XFS and RAID settings conform to Scylla guidelines.

# Quick disk sanity (run during load)
iostat -x 5
lsblk -o NAME,SIZE,ROTA,TYPE,MOUNTPOINT
cat /sys/block/<dev>/queue/scheduler

Step 6: Validate Driver and Client Behavior

Ensure shard-aware drivers, prepared statements with proper partition keys, idempotency flags for retries, and sane concurrency limits. Non-shard-aware drivers induce cross-shard hops, increasing tail latency.

Step 7: Inspect Repair/Streaming State

Long repairs indicate bandwidth caps, large partitions, or many small SSTables. If you use Scylla Manager, review job history and bandwidth limits.

# Scylla Manager CLI examples
sctool status
sctool repair --status
sctool repair --list

Common Root Causes and How to Prove Them

Root Cause 1: Token and Shard Imbalance

Cause: Legacy tokens, uneven vnode distribution, or new nodes added with mismatched tokens. Evidence: One or more shards show higher CPU and queue lengths; nodetool ring reveals uneven ownership.

# Inspect ownership
nodetool ring

# On Scylla 4.x+, consider consistent hashing with many tokens per node
# Validate after scale-out with compaction settled

Root Cause 2: Tombstone Storms

Cause: Heavy use of TTL with frequent updates, or deletes of wide partitions. Evidence: Trace shows 'tombstones read' counts high; tablehistograms show long tails; compaction backlog grows.

Root Cause 3: Compaction Debt from Wrong Strategy

Cause: Using STCS for time-series data, or LCS on very large objects with poor locality. Evidence: Many SSTables per read; slow compactions; disk space ballooning.

Root Cause 4: Driver Over-Concurrency and Missing Shard Awareness

Cause: High async concurrency with unlimited in-flight requests, round-robin routing, no token-aware shard routing. Evidence: High server-side queueing; CPU low but p99 high; client sees frequent timeouts.

Root Cause 5: Large Partitions and Hot Partitions

Cause: Poor partition key design or skewed keys. Evidence: nodetool cfstats shows large max partition size; tracing highlights frequent access to a single partition key; hotspot shards align with those keys.

Root Cause 6: Repair and Streaming Bottlenecks

Cause: Insufficient network, throttled streaming, or concurrent compaction. Evidence: Repair speed flatlines; streaming metrics capped; compactionstats shows competing tasks.

Root Cause 7: Cross-DC Consistency Misuse

Cause: Mixing QUORUM and LOCAL_QUORUM, or uneven RF. Evidence: Read-after-write anomalies under DC failure; coordinator logs show different CLs across services.

Step-by-Step Fixes

1) Make the Driver Shard-Aware and Back-Pressure Friendly

Use a Scylla-optimized driver (Java, Go, Python) with shard awareness enabled. Cap in-flight requests per connection, and use token-aware routing. Enforce sane timeouts and retries only for idempotent statements.

// Java driver (DataStax/Scylla) snippet
PoolingOptions pooling = new PoolingOptions()
  .setConnectionsPerHost(HostDistance.LOCAL, 1, 1)
  .setMaxRequestsPerConnection(HostDistance.LOCAL, 1024);
QueryOptions q = new QueryOptions().setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);
SocketOptions s = new SocketOptions().setReadTimeoutMillis(2000);
Cluster cluster = Cluster.builder()
  .addContactPoint("db.local")
  .withQueryOptions(q)
  .withSocketOptions(s)
  .withPoolingOptions(pooling)
  .build();

Ensure prepared statements include the full partition key so the coordinator routes to the owning shard.

2) Fix Data Model Hotspots and Partition Size

Redesign keys to bound partition size and spread writes. For time-series, use bucketing to cap per-partition rows. Avoid monotonically increasing keys without a bucketing component.

// Before: hot partition by device_id only
CREATE TABLE metrics_raw (
  device_id text, ts timestamp, v double,
  PRIMARY KEY (device_id, ts)
) WITH CLUSTERING ORDER BY (ts DESC);

// After: daily buckets distribute writes, cap partition size
CREATE TABLE metrics_raw (
  device_id text, day date, ts timestamp, v double,
  PRIMARY KEY ((device_id, day), ts)
) WITH CLUSTERING ORDER BY (ts DESC);

3) Choose the Right Compaction Strategy

Pick strategies by access pattern:

LCS: Read-heavy workload with high cache hit expectations and small SSTables; better for low read amplification.
TWCS: Time-series with TTL and strict time windows; minimizes overlap and tombstone reads.
STCS: Write-heavy, large rows; acceptable when read amplification is less critical.

// Switch a table to TWCS for time-bucketed data
ALTER TABLE metrics_raw WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'DAYS',
  'compaction_window_size': '1'
} AND default_time_to_live = 604800;

After changes, allow compaction to settle before interpreting new performance. Monitor 'SSTables per read' and disk space.

4) Tame Tombstones

Prefer TTLs with TWCS over mass deletes; avoid updating expired rows repeatedly. Set gc_grace_seconds based on failure model and repair frequency—too high delays tombstone purges; too low risks resurrection after failures.

// Calibrate gc_grace_seconds with frequent repairs
ALTER TABLE metrics_raw WITH gc_grace_seconds = 86400;

Use tracing to confirm tombstone counts drop after compaction cycles.

5) Reduce Read Amplification

Ensure clustering columns match query order; avoid ALLOW FILTERING. Add narrow materialized views or secondary indexes only when strictly necessary and measurable.

// Denormalize common query to avoid ALLOW FILTERING
CREATE TABLE metrics_by_day (
  day date, device_id text, ts timestamp, v double,
  PRIMARY KEY ((day), device_id, ts)
) WITH CLUSTERING ORDER BY (device_id ASC, ts DESC);

6) Right-Size I/O and Storage

Scylla benefits from fast NVMe, XFS, and proper RAID0 over multiple devices. Disable swap for determinism. Isolate database disks from unrelated workloads. Confirm fio baselines meet SLA.

# Example fio baseline
fio --name=randrw --filename=/var/lib/scylla/data/testfile --ioengine=libaio \
 --bs=4k --iodepth=64 --rw=randrw --rwmixread=70 --numjobs=4 --time_based --runtime=60

7) Network and Cross-DC Hygiene

Pin traffic classes (client 9042, inter-node 7000/7001) to reliable links. Ensure MTU consistency. For multi-DC, prefer LOCAL_QUORUM for both reads and writes to minimize inter-DC latency dependencies; align RF per DC.

// Align keyspace RF across DCs
ALTER KEYSPACE prod WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'DC1': '3',
  'DC2': '3'
};

8) Make Repair Predictable

Use Scylla Manager's scheduled, incremental repairs with bandwidth caps. Avoid running repair during peak business hours. Coordinate with compaction to prevent resource conflict.

# Schedule weekly incremental repair at off-peak
sctool repair --keyspace prod --interval 7d --parallel 1 --small-table-threshold 1GiB \
  --start-date "Sun 02:00" --dc DC1,DC2 --intensity 0.5

9) Tune Timeouts, Concurrency, and Admission

Increase server and client timeouts only after removing root causes. Adjust concurrent_reads and concurrent_writes conservatively, watching queue depth and p99. Use admission control to prevent background work from starving foreground traffic.

# scylla.yaml snippets (illustrative)
concurrent_reads: 64
concurrent_writes: 128
compaction_enforce_min_threshold: true
streaming_encryption_options: {}

10) Enforce Operational Guardrails

Set per-tenant rate limits via proxies or client-side quotas. Require load tests with production-like data and token distributions before capacity changes.

Pitfalls to Avoid

Relying on average CPU; tail latency is governed by the hottest shard.
Using STCS on TTL-heavy time series and then blaming tombstones.
Designing partitions with unbounded cardinality or growth (e.g., a global partition).
Mixing CL across services (some using QUORUM, others LOCAL_QUORUM), producing hard-to-debug anomalies.
Running repair, major compaction, and schema migrations concurrently.
Ignoring NUMA boundaries; cross-NUMA memory access erodes Scylla's single-NUMA-shard performance assumptions.
Letting drivers auto-reconnect storm a recovering node with unlimited concurrency.

Deep Dives

Tombstones, gc_grace_seconds, and Resurrection Risk

Tombstones mark deletions. They must be replicated and retained for at least gc_grace_seconds to protect against missed replicas. If your operations include frequent node outages or late repairs, keep gc_grace_seconds high enough to cover the worst-case detection window. If you repair reliably (e.g., weekly) and have stable nodes, you can reduce it to accelerate purging. Validate with failure drills; never guess.

Compaction Strategy Trade-offs

LCS minimizes read amplification but increases compaction write amplification; ensure enough disk I/O. TWCS organizes SSTables by time windows, making expired data cheap to drop but potentially increasing read amplification for cross-window scans. STCS is simplest but can produce many overlapping SSTables. Pick one per table based on access paths; there is no one-size-fits-all.

Large Partitions: Why They Hurt and How to Find Them

Queries against large partitions exhibit long GC pauses and cache churn. nodetool cfstats exposes maximum and average partition sizes. The fix is a data model change (bucketing, salting, or additional partition keys). Avoid band-aids like raising timeouts.

Shard-Aware Routing

Scylla-aware drivers compute the target shard from token and connection id, sending the request directly to the owning shard. This eliminates cross-shard handoffs inside a node. If shard-awareness is off, the coordinator forwards requests between shards, wasting CPU cycles and increasing tail latency.

Streaming Mechanics

Repair and bootstrap rely on SSTable streaming. Throttling protects foreground traffic but can make maintenance windows slip. Size shards consistently and provision predictable bandwidth. If streaming stalls, check for throttling, TLS overhead, and disk contention. Consider temporarily increasing streaming concurrency during planned maintenance with guarded windows.

Observability and Instrumentation

Dashboards that Matter

Latency histograms by read/write and by coordinator vs replica.
CPU and run-queue length per shard.
Cache hit ratio, row cache (if enabled), and SSTables per read.
Compaction throughput, pending tasks, and disk utilization.
Repair progress, streaming throughput, dropped mutations.
Per-table metrics: tombstones read, partition size distribution.

Logging and Tracing Discipline

Enable CQL tracing for exemplar slow queries, not blanket tracing. Retain logs across nodes in a centralized system. Correlate coordinator timeouts with replica-side saturation to avoid misattributing failures to clients.

Operational Runbooks

Runbook A: Sudden p99 Spike During Peak

Check shard CPU dispersion and queue depth: if a few shards are hot, identify the partition keys hitting them (enable request logging or sample tracing).
Verify compaction backlog; if high, temporarily increase compaction throughput while monitoring foreground latency.
Confirm driver concurrency limits and shard-awareness. Reduce in-flight if queues remain elevated.
Inspect tombstone metrics for the accessed tables; if high, prioritize TWCS or model changes.

Runbook B: Write Timeouts with Low Average CPU

Compare per-shard CPU/queues; hot shards indicate skew.
Check coordinator vs replica latency; high coordinator time suggests queuing before dispatch.
Reduce CL where safe (e.g., from QUORUM to LOCAL_QUORUM) and ensure RF supports it.
Audit token ownership and consider rebalancing if imbalance is structural.

Runbook C: Repair Never Finishes

Verify sctool repair bandwidth and parallelism; raise intensity off-peak.
Ensure no major compactions or schema changes run concurrently.
Check network paths for bottlenecks (MTU, QoS, firewall shaping).
Identify large partitions; split them before retrying repair.

Runbook D: Latency Creep Over Weeks

Trend SSTables per read and pending compactions; switch compaction strategy if mismatched.
Prune or reduce TTL churn; adjust gc_grace_seconds after verifying repair cadence.
Rebuild caches by warming key workloads after maintenance windows (optional).
Audit schema for queries that perform range scans across many SSTables.

Configuration Examples

scylla.yaml Highlights (Illustrative)

# scylla.yaml key settings (illustrative; validate against your version)
num_threads_io: 8
concurrent_reads: 64
concurrent_writes: 128
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
enable_sstable_data_integrity_check: true
enable_cache: true
# Networking
listen_address: 10.0.0.11
rpc_address: 10.0.0.11
native_transport_port: 9042
# Snitch and topology
endpoint_snitch: GossipingPropertyFileSnitch
# Compaction global throttle (use with care)
compaction_static_shares: 200

Table Tuning

// Example table tuned for time-series with TTL and TWCS
CREATE TABLE ts.sensor (
  sensor_id text, bucket date, ts timestamp, v double,
  PRIMARY KEY ((sensor_id, bucket), ts)
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'HOURS',
  'compaction_window_size': '6'
} AND default_time_to_live = 2592000
  AND bloom_filter_fp_chance = 0.01
  AND gc_grace_seconds = 86400;

Client-Side Resiliency

// Pseudocode for bounded concurrency and idempotent retries
pool = new SessionPool(maxInFlight=2048, perConnection=512)
stmt = session.prepare("INSERT INTO ts.sensor (...) VALUES (...) IF NOT EXISTS");
stmt.setIdempotent(true)
retryPolicy = ExponentialBackoff(maxRetries=3, base=20ms, jitter=true)
timeout = 2s

Best Practices for Long-Term Stability

Model to bound partition sizes; add time buckets for time-series workloads.
Use shard-aware, token-aware drivers; limit in-flight per connection.
Pick compaction per table based on access pattern (LCS, TWCS, STCS); measure after changes.
Schedule incremental repairs with Scylla Manager; avoid overlapping heavy tasks.
Keep gc_grace_seconds aligned with repair cadence and failure assumptions.
Standardize hardware: NVMe + XFS, no swap, NUMA-aware CPU pinning.
Use LOCAL_QUORUM consistently across services in multi-DC deployments.
Instrument everything: per-shard metrics, SSTables per read, tombstone rates.
Load test with realistic data distributions and token ownership before capacity changes.
Automate schema reviews to flag risky patterns (ALLOW FILTERING, unbounded clustering).

Conclusion

ScyllaDB's performance stems from a shard-per-core architecture that rewards balanced, predictable workloads. The flip side is that small skews or strategy mismatches can inflate tail latency or extend maintenance operations. The most effective fixes are architectural: shard-aware routing, partitioning that caps growth, compaction strategies matched to access patterns, and disciplined repairs. With the diagnostics and runbooks in this guide—plus strong observability—you can turn chronic p99 spikes, compaction debt, and streaming slowdowns into predictable, well-governed operations. Treat schemas, drivers, and operations as a single system, and ScyllaDB will consistently deliver the low-latency, high-throughput behavior it is known for.

FAQs

1. How do I know if drivers are shard-aware and actually routing correctly?

Enable driver logs to confirm shard-id calculation and inspect server-side metrics: cross-shard forward counts should be minimal. Compare latency before and after enabling shard-awareness; the p99 improvement is often immediate.

2. When should I pick TWCS over LCS?

Choose TWCS for strictly time-bucketed, TTL-heavy data accessed mostly by recent windows. Prefer LCS when you need consistently low read amplification across broad time ranges and can afford higher compaction write costs.

3. What's a safe lower bound for gc_grace_seconds?

Base it on your worst-case repair interval plus failure detection lag. If you repair weekly and have reliable operations, 24–48 hours may be reasonable; if outages or delayed repairs are common, keep it longer to prevent resurrection.

4. How do I eliminate hot partitions without a breaking schema change?

Add a salting or bucketing component to the partition key and dual-write during a migration period. Backfill asynchronously, then flip reads to the new table or view. Validate with tracing that the hot shards cool down.

5. Why does repair throttle even when my cluster looks idle?

Scylla protects foreground traffic and respects configured streaming limits. Increase repair intensity and bandwidth only during controlled windows, and ensure compaction isn't contending for the same I/O budget.

Contact Us