Understanding Concourse CI Architecture
Workers, TSA, and ATC
Concourse is composed of a central ATC (API & Scheduler), TSA (SSH gateway), and multiple workers. Pipelines are managed declaratively via YAML and executed using containers spawned per task through Garden or containerd.
Resources, Tasks, and Jobs
Pipelines consist of resources (e.g., git, s3), jobs (stages), and tasks (units of work). Each task runs in isolation and fetches its input via resource types. Resources use check and in/out scripts for versioning and state transfer.
Common Concourse CI Issues
1. Stuck or Hanging Builds
Occurs when a job is waiting on a resource check that never completes. Root causes include failing check
scripts, overloaded workers, or external service timeouts.
2. Resource Check Failures
Errors in check
scripts or credential misconfigurations result in resource script failed
or unexpected end of JSON input
errors during pipeline refresh or job triggering.
3. Volume GC and Disk Exhaustion
Concourse manages build artifacts via volumes on workers. Improper cleanup or build retention policies can lead to full disks and broken build pipelines with no space left on device
errors.
4. Secret Leakage in Logs
Secrets passed as environment variables or YAML literals may leak into logs if not redacted. Improper usage of echo $SECRET
in task scripts can expose sensitive data.
5. Pipeline Not Triggering Automatically
Commonly caused by a missing trigger: true
flag, incorrect resource versions, or paused resources. Also occurs if check_every
is too infrequent or worker containers are down.
Diagnostics and Debugging Techniques
Enable TSA and Worker Logs
Inspect TSA logs for worker registration and SSH connectivity issues. Worker logs provide insight into volume lifecycle, container provisioning, and resource fetch behavior.
Check Pipeline via Fly CLI
Use fly validate-pipeline
to lint YAML and fly get-pipeline
to confirm resource/job configuration. fly watch
helps trace real-time task execution.
Inspect Resource Versions
Run fly resource-versions
to see version history and identify stale or broken states. Use fly check-resource
to manually trigger version detection.
Use Task Debug Shell
Add privileged: true
and an interactive command like sleep 9999
to debug inside a task container using fly intercept
.
Monitor Volume and Container Counts
Use fly volumes
and fly containers
to monitor orphaned or zombie volumes. Clean them with fly prune-worker
or restart the worker for GC to take effect.
Step-by-Step Resolution Guide
1. Unblock Hanging Builds
Check the status of dependent resources. Manually trigger with fly check-resource
. Restart workers if build containers are unresponsive.
2. Fix Resource Check Errors
Enable debug mode in resource scripts. Confirm that credentials are passed correctly via credential managers (e.g., Vault, SSM) or fly set-pipeline
with vars.
3. Resolve Volume Saturation
Adjust build_logs_to_retain
and gc_interval
. Use fly prune-worker
and consider increasing disk space or deploying ephemeral workers for cleanup.
4. Prevent Secret Exposure
Use params
in task definitions and avoid printing secret vars directly. Mask secrets with credential manager integrations and redact output in custom scripts.
5. Ensure Pipeline Triggers Correctly
Set trigger: true
in get
steps. Confirm that resources are not paused and check check_every
values. Restart ATC if resources are stale.
Best Practices for Concourse CI
- Use resource types maintained by the community or Concourse core.
- Define secrets via credential managers and avoid hardcoding in YAML.
- Keep task scripts idempotent and externalized for reuse.
- Split pipelines by domain to minimize graph complexity and runtime overhead.
- Monitor container and volume growth with alerts to avoid disk-related failures.
Conclusion
Concourse CI offers a powerful, declarative approach to CI/CD, but requires disciplined pipeline design, resource management, and observability to avoid runtime issues. By applying structured debugging techniques, using the Fly CLI effectively, and maintaining clean worker environments, teams can ensure reliable and scalable continuous delivery with Concourse.
FAQs
1. Why do my builds hang at 'checking for new versions'?
This indicates a stuck resource check. Run fly check-resource
and examine resource scripts and credentials for failures.
2. How can I debug a failed task container?
Add sleep 9999
to the command and use fly intercept
to open an interactive shell for inspection.
3. What causes 'no space left on device' errors?
Excessive volume retention or failed garbage collection. Use fly prune-worker
and monitor volume usage regularly.
4. Why is my secret exposed in logs?
Directly echoing secrets or not using masked params. Avoid printing secrets and use credential managers for injection.
5. Can I safely delete old builds and containers?
Yes, via fly destroy-build
and fly prune-worker
. Use retention policies to automate cleanup.