Troubleshooting Pipeline Execution and Agent Issues in GoCD CI/CD

Details: Category: CI/CD (Continuous Integration/Continuous Deployment); By Mindful Chase; 21.Apr; Hits: 249

GoCD is an open-source CI/CD server developed by ThoughtWorks that facilitates complex build and release workflows using pipelines as code. Known for its visual pipeline modeling and support for deployment fan-in/fan-out, GoCD can handle enterprise-grade delivery requirements. However, large-scale GoCD environments often face issues such as "pipeline execution delays, stuck agents, stale materials, and artifact fetch failures due to misconfigured resources, SCM polling issues, or agent saturation". This article offers a comprehensive troubleshooting guide for stabilizing GoCD pipelines and optimizing deployment workflows.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Understanding GoCD Architecture

Server-Agent Model

GoCD uses a central server that orchestrates builds and distributed agents that execute jobs. Each agent has a set of resources that determine job eligibility. Misalignment between job requirements and available agents can stall pipeline execution.

Pipeline as Code and Material Tracking

GoCD supports configuration via YAML/JSON and tightly tracks materials (e.g., Git, Mercurial) across pipeline dependencies. Material polling failures or cache corruption often lead to pipelines not triggering correctly.

Common Symptoms

Build pipelines stuck in "waiting for agent" indefinitely
Pipeline stages failing to fetch artifacts from upstream jobs
Changes not triggering pipeline runs despite material updates
Job logs showing hung processes or unresponsive commands
Slow dashboard performance or server-side memory spikes

Root Causes

1. Agent Resource Mismatch

Jobs configured with specific resources won’t run if no agent with matching tags is available. This is common in fan-out pipelines with tight concurrency settings.

2. Material Polling or Checksum Corruption

GoCD caches material revisions to detect changes. If the material folder is corrupted or out of sync, pipelines may never trigger even when commits exist.

3. Stuck Jobs Due to Detached or Full Agents

Agents that are disconnected or overwhelmed (e.g., full disk, zombie processes) may appear idle but fail to execute jobs properly, leaving them in a queued state.

4. Artifact Fetch Failures

Incorrect fetchartifact paths or deleted upstream artifacts result in 404 errors during stage execution. This can break downstream jobs unexpectedly.

5. Inefficient Fan-in/Fan-out Pipeline Design

Overuse of dependencies without consolidation stages leads to long execution chains and unnecessary waits due to strict dependency resolution rules.

Diagnostics and Monitoring

1. Review Agent Status

Use the Agents tab to monitor registration, disk space, and heartbeat. Look for agents marked as "lost contact" or with high build queue time.

2. Inspect Material Logs

Check gocd-server.log and gocd-agent.log for material update errors or checksum mismatches that may prevent triggering.

3. Enable Job Timeout and Logging

Use job-level timeout configurations to prevent indefinite hangs and add detailed logging with verbose flags to debug command failures.

4. Profile Server Performance

Monitor JVM heap via JMX or enable GoCD Metrics Plugin. Look for thread exhaustion or high GC times indicating performance bottlenecks.

5. Audit Pipeline Config History

Changes in resources, environment variables, or fetchartifact paths can silently break pipelines. Use GoCD's config repository history and audit trails to trace regressions.

Step-by-Step Fix Strategy

1. Normalize Agent Resources

resources: ["docker", "linux"]

Ensure enough agents match each job’s resource requirements. Avoid overly narrow labels unless isolation is critical.

2. Clear Material Caches

Stop the server, delete the pipelines/materials directory, and restart. This forces GoCD to refresh material state and resolve hidden corruption.

3. Harden FetchArtifact Usage

<fetchartifact stage="build" job="compile" src="/dist/app.jar" />

Validate that the specified file exists and wasn’t purged. Use keepArtifacts in upstream jobs for long-lived builds.

4. Decommission Broken Agents

Unregister agents with persistent failure signs (e.g., stale, offline, or corrupted environments) and re-provision clean instances if needed.

5. Flatten Pipelines Where Possible

Group related jobs into stages to reduce cross-pipeline fetch calls. Use dependency material consolidation to shorten execution chains.

Best Practices

Tag agents with broad and fallback resources
Automate config repo validations with pipeline linter tools
Set artifact expiry policies to avoid disk bloat
Monitor agent and material health via GoCD APIs
Backup and version control all pipeline definitions

Conclusion

GoCD offers a highly visual and controllable CI/CD platform, but complex environments require careful orchestration of agents, resources, and artifacts. Execution delays, artifact fetch issues, and pipeline failures often stem from subtle misalignments in agent configuration or pipeline logic. Through consistent monitoring, validation, and resource planning, DevOps teams can maintain resilient and high-throughput delivery pipelines in GoCD.

FAQs

1. Why is my GoCD pipeline stuck on "waiting for agent"?

The job likely requires a resource not matched by any active agent. Reassign or reconfigure agent tags to resolve the mismatch.

2. How do I fix pipeline material not triggering?

Clear the material cache and verify SCM connectivity. Review material logs for errors and ensure polling is not disabled.

3. What causes artifact fetch failures?

Deleted upstream artifacts, wrong fetch paths, or purged artifact retention policies. Confirm upstream job status and use keepArtifacts as needed.

4. Can GoCD handle parallel pipeline execution?

Yes, but sufficient agents with matching resources must be available. Grouping and staging strategies help manage concurrency.

5. How do I clean up broken or inactive agents?

From the Agents tab, disable and delete offline or unresponsive agents. Restart the agent service or re-register if needed.

Contact Us