Understanding GoCD Architecture
Server-Agent Model
GoCD uses a central server that orchestrates builds and distributed agents that execute jobs. Each agent has a set of resources that determine job eligibility. Misalignment between job requirements and available agents can stall pipeline execution.
Pipeline as Code and Material Tracking
GoCD supports configuration via YAML/JSON and tightly tracks materials (e.g., Git, Mercurial) across pipeline dependencies. Material polling failures or cache corruption often lead to pipelines not triggering correctly.
Common Symptoms
- Build pipelines stuck in "waiting for agent" indefinitely
- Pipeline stages failing to fetch artifacts from upstream jobs
- Changes not triggering pipeline runs despite material updates
- Job logs showing hung processes or unresponsive commands
- Slow dashboard performance or server-side memory spikes
Root Causes
1. Agent Resource Mismatch
Jobs configured with specific resources
won’t run if no agent with matching tags is available. This is common in fan-out pipelines with tight concurrency settings.
2. Material Polling or Checksum Corruption
GoCD caches material revisions to detect changes. If the material folder is corrupted or out of sync, pipelines may never trigger even when commits exist.
3. Stuck Jobs Due to Detached or Full Agents
Agents that are disconnected or overwhelmed (e.g., full disk, zombie processes) may appear idle but fail to execute jobs properly, leaving them in a queued state.
4. Artifact Fetch Failures
Incorrect fetchartifact
paths or deleted upstream artifacts result in 404 errors during stage execution. This can break downstream jobs unexpectedly.
5. Inefficient Fan-in/Fan-out Pipeline Design
Overuse of dependencies without consolidation stages leads to long execution chains and unnecessary waits due to strict dependency resolution rules.
Diagnostics and Monitoring
1. Review Agent Status
Use the Agents tab to monitor registration, disk space, and heartbeat. Look for agents marked as "lost contact" or with high build queue time.
2. Inspect Material Logs
Check gocd-server.log
and gocd-agent.log
for material update errors or checksum mismatches that may prevent triggering.
3. Enable Job Timeout and Logging
Use job-level timeout
configurations to prevent indefinite hangs and add detailed logging with verbose flags to debug command failures.
4. Profile Server Performance
Monitor JVM heap via JMX or enable GoCD Metrics Plugin. Look for thread exhaustion or high GC times indicating performance bottlenecks.
5. Audit Pipeline Config History
Changes in resources
, environment variables
, or fetchartifact
paths can silently break pipelines. Use GoCD's config repository history and audit trails to trace regressions.
Step-by-Step Fix Strategy
1. Normalize Agent Resources
resources: ["docker", "linux"]
Ensure enough agents match each job’s resource requirements. Avoid overly narrow labels unless isolation is critical.
2. Clear Material Caches
Stop the server, delete the pipelines/materials
directory, and restart. This forces GoCD to refresh material state and resolve hidden corruption.
3. Harden FetchArtifact Usage
<fetchartifact stage="build" job="compile" src="/dist/app.jar" />
Validate that the specified file exists and wasn’t purged. Use keepArtifacts
in upstream jobs for long-lived builds.
4. Decommission Broken Agents
Unregister agents with persistent failure signs (e.g., stale, offline, or corrupted environments) and re-provision clean instances if needed.
5. Flatten Pipelines Where Possible
Group related jobs into stages to reduce cross-pipeline fetch calls. Use dependency material consolidation to shorten execution chains.
Best Practices
- Tag agents with broad and fallback resources
- Automate config repo validations with pipeline linter tools
- Set artifact expiry policies to avoid disk bloat
- Monitor agent and material health via GoCD APIs
- Backup and version control all pipeline definitions
Conclusion
GoCD offers a highly visual and controllable CI/CD platform, but complex environments require careful orchestration of agents, resources, and artifacts. Execution delays, artifact fetch issues, and pipeline failures often stem from subtle misalignments in agent configuration or pipeline logic. Through consistent monitoring, validation, and resource planning, DevOps teams can maintain resilient and high-throughput delivery pipelines in GoCD.
FAQs
1. Why is my GoCD pipeline stuck on "waiting for agent"?
The job likely requires a resource not matched by any active agent. Reassign or reconfigure agent tags to resolve the mismatch.
2. How do I fix pipeline material not triggering?
Clear the material cache and verify SCM connectivity. Review material logs for errors and ensure polling is not disabled.
3. What causes artifact fetch failures?
Deleted upstream artifacts, wrong fetch paths, or purged artifact retention policies. Confirm upstream job status and use keepArtifacts
as needed.
4. Can GoCD handle parallel pipeline execution?
Yes, but sufficient agents with matching resources must be available. Grouping and staging strategies help manage concurrency.
5. How do I clean up broken or inactive agents?
From the Agents tab, disable and delete offline or unresponsive agents. Restart the agent service or re-register if needed.