Understanding Druid's Core Architecture

Node Types and Responsibilities

Druid consists of multiple specialized nodes:

  • Coordinator: Handles segment management and balancing
  • Overlord: Manages ingestion tasks and task slots
  • Historical: Serves immutable segments for queries
  • MiddleManager: Executes ingestion tasks (stream or batch)
  • Broker: Routes queries to Historical or real-time nodes
  • Router: Optional reverse proxy and query planner

Segment Granularity and Partitioning

Segment definition (intervals, shard specs) directly impacts performance. Poorly defined segment granularity can cause high query latencies or unnecessary data scanning.

Common Issues in Production Druid Deployments

1. Segment Imbalance

Overloaded Historical nodes while others remain underutilized indicates poor segment assignment. This results in query hotspots and degraded parallelism.

2. High Query Latency

Latency spikes may occur due to:

  • Long-running queries from external BI tools
  • Missing indexes or poor filtering specs
  • Memory GC pauses in Historical or Broker nodes

3. Ingestion Delays or Failures

Stream ingestion may lag due to:

  • Insufficient MiddleManager slots
  • Improper tuning configs (e.g., heap size, task partitions)
  • Kafka topic lag accumulating over time

4. Memory and GC Pressure

Historical or Broker nodes with inadequate heap tuning can suffer from GC thrashing, leading to timeouts and dropped queries.

Diagnostics and Tools

Using the Druid Console

Monitor:

  • Segment distribution under Coordinator → Servers
  • Query latencies under Broker metrics
  • Ingestion task backlog under Overlord

JMX and System Metrics

Enable JMX exporters to collect:

  • JVM heap usage
  • Query count and latency (druid.broker.query.*)
  • Segment scan time and row count (druid.historical.*)

Checking Kafka Lag

Use Kafka consumer group commands:

kafka-consumer-groups.sh --bootstrap-server kafka:9092 --describe --group druid-ingestion

Step-by-Step Fixes

1. Rebalancing Segments

  • Check if Coordinator balancing is enabled:
{"coordinator": {"killDataSourceWhitelist": [], "balancerComputeCost": true}}

Trigger a rebalance manually if needed:

curl -X POST http://COORDINATOR:8081/druid/coordinator/v1/loadqueuepeons/rebalance

2. Reducing Query Latency

  • Use filters early in query specs to reduce scanned rows
  • Enable bitmap indexing on high-cardinality dimensions
  • Adjust druid.broker.cache settings to use caching effectively

3. Resolving Ingestion Bottlenecks

Review tuningConfig parameters:

{
  "type": "kafka",
  "tuningConfig": {
    "maxRowsPerSegment": 5000000,
    "maxPendingPersists": 0,
    "maxRowsInMemory": 100000,
    "maxTotalRows": 20000000
  }
}

Ensure taskCount and replicas are sufficient for Kafka partitions.

4. Optimizing JVM for Historical Nodes

-Xmx16g -Xms16g -XX:+UseG1GC -XX:MaxGCPauseMillis=200

Pin memory-resident segments using druid.segmentCache.numLoadingThreads and druid.processing.buffer.sizeBytes.

Best Practices for Long-Term Stability

  • Align segment granularity with query intervals
  • Use auto-compaction to consolidate small segments
  • Regularly audit segment distribution and disk usage
  • Monitor Kafka ingestion lag and scale partitions appropriately
  • Isolate long-running queries using priority tiers or throttling
  • Enable Broker-level query caching for repeated dashboard use

Conclusion

Operating Druid at scale requires deep visibility into its ingestion pipelines, memory profiles, and query patterns. While Druid provides powerful real-time analytics, its distributed nature means that even subtle misconfigurations can have ripple effects. With robust monitoring, segment-aware planning, and proper JVM tuning, enterprises can ensure consistent performance, minimal query latency, and stable ingestion throughput.

FAQs

1. Why are segments not evenly distributed?

Coordinator balancer may be disabled, or Historical nodes may lack capacity. Trigger rebalance manually or check disk usage constraints.

2. How can I speed up ingestion in Kafka indexing tasks?

Increase taskCount and maxRowsInMemory, and ensure that your Kafka partition count is sufficient for parallelism.

3. What causes random query timeouts?

Likely JVM GC pauses or broker overloading. Tune heap settings and monitor query concurrency under load.

4. Can I compact segments automatically?

Yes, enable auto-compaction under Coordinator settings and define compaction intervals and thresholds.

5. Should I use G1GC or CMS for Druid?

G1GC is generally preferred for better pause time management, but tune according to your heap size and GC logs.