Understanding Concourse CI Architecture

Workers, TSA, and ATC

Concourse is composed of a central ATC (API & Scheduler), TSA (SSH gateway), and multiple workers. Pipelines are managed declaratively via YAML and executed using containers spawned per task through Garden or containerd.

Resources, Tasks, and Jobs

Pipelines consist of resources (e.g., git, s3), jobs (stages), and tasks (units of work). Each task runs in isolation and fetches its input via resource types. Resources use check and in/out scripts for versioning and state transfer.

Common Concourse CI Issues

1. Stuck or Hanging Builds

Occurs when a job is waiting on a resource check that never completes. Root causes include failing check scripts, overloaded workers, or external service timeouts.

2. Resource Check Failures

Errors in check scripts or credential misconfigurations result in resource script failed or unexpected end of JSON input errors during pipeline refresh or job triggering.

3. Volume GC and Disk Exhaustion

Concourse manages build artifacts via volumes on workers. Improper cleanup or build retention policies can lead to full disks and broken build pipelines with no space left on device errors.

4. Secret Leakage in Logs

Secrets passed as environment variables or YAML literals may leak into logs if not redacted. Improper usage of echo $SECRET in task scripts can expose sensitive data.

5. Pipeline Not Triggering Automatically

Commonly caused by a missing trigger: true flag, incorrect resource versions, or paused resources. Also occurs if check_every is too infrequent or worker containers are down.

Diagnostics and Debugging Techniques

Enable TSA and Worker Logs

Inspect TSA logs for worker registration and SSH connectivity issues. Worker logs provide insight into volume lifecycle, container provisioning, and resource fetch behavior.

Check Pipeline via Fly CLI

Use fly validate-pipeline to lint YAML and fly get-pipeline to confirm resource/job configuration. fly watch helps trace real-time task execution.

Inspect Resource Versions

Run fly resource-versions to see version history and identify stale or broken states. Use fly check-resource to manually trigger version detection.

Use Task Debug Shell

Add privileged: true and an interactive command like sleep 9999 to debug inside a task container using fly intercept.

Monitor Volume and Container Counts

Use fly volumes and fly containers to monitor orphaned or zombie volumes. Clean them with fly prune-worker or restart the worker for GC to take effect.

Step-by-Step Resolution Guide

1. Unblock Hanging Builds

Check the status of dependent resources. Manually trigger with fly check-resource. Restart workers if build containers are unresponsive.

2. Fix Resource Check Errors

Enable debug mode in resource scripts. Confirm that credentials are passed correctly via credential managers (e.g., Vault, SSM) or fly set-pipeline with vars.

3. Resolve Volume Saturation

Adjust build_logs_to_retain and gc_interval. Use fly prune-worker and consider increasing disk space or deploying ephemeral workers for cleanup.

4. Prevent Secret Exposure

Use params in task definitions and avoid printing secret vars directly. Mask secrets with credential manager integrations and redact output in custom scripts.

5. Ensure Pipeline Triggers Correctly

Set trigger: true in get steps. Confirm that resources are not paused and check check_every values. Restart ATC if resources are stale.

Best Practices for Concourse CI

  • Use resource types maintained by the community or Concourse core.
  • Define secrets via credential managers and avoid hardcoding in YAML.
  • Keep task scripts idempotent and externalized for reuse.
  • Split pipelines by domain to minimize graph complexity and runtime overhead.
  • Monitor container and volume growth with alerts to avoid disk-related failures.

Conclusion

Concourse CI offers a powerful, declarative approach to CI/CD, but requires disciplined pipeline design, resource management, and observability to avoid runtime issues. By applying structured debugging techniques, using the Fly CLI effectively, and maintaining clean worker environments, teams can ensure reliable and scalable continuous delivery with Concourse.

FAQs

1. Why do my builds hang at 'checking for new versions'?

This indicates a stuck resource check. Run fly check-resource and examine resource scripts and credentials for failures.

2. How can I debug a failed task container?

Add sleep 9999 to the command and use fly intercept to open an interactive shell for inspection.

3. What causes 'no space left on device' errors?

Excessive volume retention or failed garbage collection. Use fly prune-worker and monitor volume usage regularly.

4. Why is my secret exposed in logs?

Directly echoing secrets or not using masked params. Avoid printing secrets and use credential managers for injection.

5. Can I safely delete old builds and containers?

Yes, via fly destroy-build and fly prune-worker. Use retention policies to automate cleanup.