Mercurial at Scale: Troubleshooting Divergence, Performance, and Integrity

Details: Category: Version Control; By Mindful Chase; 27.Aug; Hits: 107

Mercurial remains a robust, battle-tested DVCS for enterprises that value integrity, flexible workflows, and predictable performance. Yet in large monorepos or multi-repo federations, senior engineers often face subtle failures: divergent heads that resist reconciliation, slow clones over high-latency links, hidden changesets caused by phases or obsolescence, lock contention during CI bursts, and mysterious working-copy corruption after abrupt crashes. This guide presents a pragmatic, deeply technical troubleshooting playbook focused on root-cause analysis, architectural implications, and durable fixes that improve reliability over the long term.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Background and Architectural Context

Mercurial fundamentals at scale

Mercurial stores history in revlogs under the .hg/store directory. Each file and changeset is stored as a revlog with deltas and periodic full snapshots. Modern repository formats use revlog-ng, side-data, and generaldelta to speed operations on deep histories. At enterprise scale, design choices—such as server protocol, network topology, filesystem semantics, and CI fanout—shape performance and failure modes.

Key concepts that drive troubleshooting

Phases: draft, public, secret. Phase transitions determine what can be rewritten. Misunderstood phases create hidden or immutable changesets that complicate merges.
Obsolescence and the evolve model: obsolete markers track rewritten history. Inconsistent marker propagation produces confusing pruned or orphan states.
Named branches vs bookmarks: named branches are permanent lineage labels; bookmarks are movable pointers. Mixing them without policy leads to diverging tips.
Working copy dirstate: a compact index of tracked paths, mtimes, and states. Corruption or skew here causes spurious status output and failed merges.
Transaction journals: Mercurial commits and pulls are transactional. Crash-recovery replays journal files; stale journals indicate interrupted writes.

Typical Failure Modes and Why They Matter

1) Divergent heads and unresolved branch tips

Multiple heads on a named branch or bookmark often arise from concurrent pushes, partial rewrites, or CI-generated commits. Divergence increases merge risk, build time, and cognitive load for reviewers.

2) Hidden or unexpected changesets

Phases and obsolescence can hide changesets from common commands. Engineers “lose” a branch or rewrite commits locally, push with partial marker propagation, and then discover missing or pruned revisions on peers.

3) Slow clone, pull, or push in WAN or monorepo contexts

Large histories, binary blobs, narrow networks, and legacy wire protocol settings cause I/O amplification. CI farms suffer when every job reclones and re-resolves identical deltas.

4) Working-copy corruption after crashes or anti-virus interference

Unclean shutdowns or external processes that lock or quarantine files inside .hg can damage dirstate or journals. Symptoms include inconsistent hg status, failed updates, and inexplicable conflicts.

5) Lock contention and server-side saturation

Concurrent hooks, pretxnchangegroup checks, or repack operations hold locks while CI flood-pushes. Result: timeouts, aborted transactions, and frustrated developers.

6) Subrepo and largefiles drift

Subrepositories pin specific revisions; careless updates or shallow synchronization create mismatches. The largefiles extension can produce missing standins or content mismatches if caches are not synchronized.

Architecture-Driven Diagnostics

Repository format and requirements

Confirm format and enabled features to understand performance ceilings and tool compatibility.

cat .hg/requires
# Look for entries like: fncache, dotencode, generaldelta, treemanifest, sparserevlog

Outdated formats limit delta strategies and might prevent adopting newer optimizations.

Wire protocol, transport, and caching

Measure protocol overhead and server configuration. Ensure SSH multiplexing, HTTP keep-alive, and server-side bundle caching are configured appropriately.

hg paths
hg showconfig ui.ssh
# For SSH: configure ControlMaster and ControlPersist in ~/.ssh/config

Filesystem and OS nuances

Windows path length, case sensitivity, and antivirus scans are frequent contributors to performance issues. On Linux, ext4/XFS mount options, noatime, and barrier settings can subtly affect throughput for CI runners.

Deep-Dive Troubleshooting Playbook

Identify and explain multiple heads

Start with a branch-level inventory.

hg heads -t
hg heads -r "branch(default)"
hg log -r "heads()" --template "{node|short} {branches}\n"

If heads exploded recently, diff push windows and CI activity. Head proliferation typically follows parallel pushes to the same named branch or divergent bookmarks.

Locate hidden, obsolete, or pruned changesets

Reveal what phases or obsolescence have concealed.

hg log -r "hidden()"
hg log -r "obsolete()"
hg log -r "orphan() or unstable() or conflicted()"
hg phase -r .

If the current commit is draft locally but public on the server, rewrites will fail. That mismatch points to phase propagation issues or a misconfigured server hook.

Verify transaction health and crash residue

Check for stale journals or incomplete transactions, especially after power loss or interrupted pulls.

ls .hg/store/undo* .hg/store/journal* 2>/dev/null
hg recover

hg recover replays journals and cleans locks. If recover frequently triggers, investigate storage or process termination patterns.

Measure clone/pull bottlenecks

Separate network from server generation time by using bundles and logging.

# Server: generate a stream clone bundle if supported
hg bundle --type stream-v2 /tmp/repo.bundle
# Client: test local clone duration
time hg clone /tmp/repo.bundle repo-test
# Compare to network clone
time hg clone ssh://server/repo repo-net

If local bundle clones are fast but network clones crawl, focus on latency, protocol settings, or server CPU. If both are slow, inspect store layout and delta chains.

Detect dirstate or working-copy anomalies

When hg status reports phantom changes or merges fail with inexplicable conflicts, examine dirstate, timestamps, and case collisions.

hg debugstate | head
hg status -v
hg debugpathcomplete some/path
hg purge --all --print

If anti-virus or backup agents touch .hg, add exclusions and retest.

Audit server hooks and CI

Misbehaving hooks stall pushes or create inconsistent phases. CI robots that rewrite history without publishing obsolescence markers create orphaned histories.

hg showconfig hooks
# Look for pretxnchangegroup, prepushkey, changegroup

Root Causes and Corrective Actions

Cause: concurrent pushes creating divergent heads

Developers and CI push to the same named branch, each with different tips. Bookmarks are not exchanged consistently.

Fix:

Adopt a single-writer policy per named branch. Route all writes through a gatekeeper service that merges and pushes.
Switch to bookmark-centric workflows for short-lived lines, keeping named branches for long-lived release lines.
Enable server-side hook: pretxnchangegroup to reject new heads on protected branches.

# Example: reject multiple heads on default
[hooks]
pretxnchangegroup.rejectheads = python:hooks.rejectmultipleheads
[hooks.rejectmultipleheads]
branches = default, release

Cause: hidden or obsolete changesets confuse merges

Phase drift or missing obsolescence markers cause developers to see different DAGs. People attempt merges with incomplete knowledge.

Fix:

Standardize phase policies: server publishes to public on integration; developers work in draft phases only.
Propagate obsolescence markers by pushing with --pushvars if required and ensuring evolve is installed across teams.
Expose hidden revisions temporarily to repair history, then restore policy.

hg phase --public -r REV
hg debugobsolete OLDREV NEWREV
hg log -r "hidden()"
hg debugvisible --verbose

Cause: slow clones due to WAN latency and large histories

Long delta chains and repeated negotiation across slow links multiply round-trips. CI amplifies pain with fresh clones per job.

Fix:

Use stream clones or pre-generated bundles on the server; cache them on CI workers.
Enable HTTP/2 or persistent SSH multiplexing; increase server max-connections.
Adopt narrow and sparse checkouts for large monorepos.

# Server: weekly full bundle for CI cache
hg -R /srv/repo bundle --type stream-v2 /srv/cache/repo-full.hg
# Client: use the cached bundle first
hg clone /srv/cache/repo-full.hg repo
hg pull -u ssh://server/repo
# Narrow clone (requires extension)
hg clone --narrow ssh://server/repo repo-narrow -r default --include path:src/

Cause: working-copy corruption or stale locks after crash

Interrupted operations leave journals and locks behind. Anti-virus quarantines files inside .hg.

Fix:

Run hg recover to apply journals; delete stale lock files carefully.
Exclude .hg from anti-virus and backup software; relocate repo to a stable local filesystem.
Rebuild the working copy if dirstate is inconsistent.

hg recover
rm -f .hg/store/lock .hg/wlock 2>/dev/null
hg update -C .

Cause: lock contention from long-running hooks or repacks

Server hooks perform heavy checks (lint, scan, monorepo guardians) during changegroup transactions. Meanwhile CI pushes pile up.

Fix:

Move heavy validation to asynchronous post-receive pipelines that can revert via backout if needed.
Tune hook timeouts and reduce scope with targeted checks using revsets.
Schedule repack/maintenance windows; deploy read replicas for CI pulls.

# Example: limit hook scope to new revisions only
hg log -r "::(adds()) and not public()"

Cause: subrepo and largefiles drift

Subrepositories pin revisions by value; inconsistent updates cause “works on my machine” builds. Largefiles standins diverge if content store is incomplete.

Fix:

Gate pushes that modify .hgsub and .hgsubstate without coordinated updates across repos.
Mirror largefiles stores and verify presence during CI bootstrap.
Prefer content-addressable artifact stores for big binaries; prune largefiles usage where feasible.

grep -E "^subrepo" .hgrc
hg verify --large
hg lfstatus

Step-by-Step Repair Procedures

Repair multiple heads on a named branch

Pick a canonical tip, merge the rest, and publish once.

# List heads
hg heads -r "branch(default)"
# Choose a base and merge sequentially
hg update HEAD1
hg merge HEAD2
hg commit -m "Merge head2 into default"
# Repeat as needed, then push
hg push

To prevent recurrence, enforce reject new head hooks and route writes through a single integration bot.

Unhide and reconcile obsolete revisions

When changes appear missing, check phases and obsolescence; convert necessary commits to public carefully.

hg log -r "hidden()"
hg phase --public -r REV
hg pull --hidden
hg evolve --all  # if evolve is enabled

If markers were never propagated, reconstruct with debugobsolete and evolve to a stable DAG.

Recover from working-copy or dirstate damage

Always try recovery before destructive operations.

hg recover
hg update -C .
# If still broken, rebuild from a clean clone and reapply local changes
hg diff > /tmp/patch.diff
hg clone ssh://server/repo clean
cd clean
hg import --no-commit /tmp/patch.diff
hg commit -m "Reapply local work"

Stabilize CI with bundle caches

Cut clone times by shipping prebuilt bundles to runners and layering incremental pulls on top.

# Nightly on server
hg -R /srv/repo bundle --all --type stream-v2 /srv/cache/repo-$(date +%F).hg
ln -sf /srv/cache/repo-$(date +%F).hg /srv/cache/repo-latest.hg
# CI job prolog
hg clone /srv/cache/repo-latest.hg .
hg pull -u ssh://server/repo

Resolve subrepo mismatches

Audit subrepo state and update atomically across repos.

hg subrepos
cat .hgsub .hgsubstate
hg pull -u
hg update -C

Lock policy: changes to .hgsub require review by repo owners and synchronized releases.

Performance Engineering and Tuning

Store layout and maintenance

Modernize to formats that reduce delta depth and accelerate lookups.

# Confirm generaldelta and treemanifest if applicable
cat .hg/requires
# Run verify periodically on mirrors
hg verify

Schedule maintenance to repack or compact stores during off-peak hours; snapshot after verify to create known-good restore points.

Networking and protocol considerations

Use persistent connections, SSH control masters, and HTTP keep-alive. Align server CPU and I/O capacity with concurrent CI traffic.

# ~/.ssh/config
Host hg-server
  HostName hg.company.internal
  User hg
  ControlMaster auto
  ControlPath ~/.ssh/cm-%r@%h:%p
  ControlPersist 600

Client-side caches and sparse workflows

Adopt narrow clones for teams that only need subtrees, and leverage share to reuse local stores between working copies.

# Share a store between sandboxes
hg share /repos/monorepo /work/monorepo-dev
hg share /repos/monorepo /work/monorepo-experiment
# Sparse/narrow example
hg clone --narrow ssh://hg-server/monorepo app-only -r default --include path:services/app

Revsets for targeted operations

Use revsets to limit command scope and avoid scanning the full DAG.

hg log -r "branch(default) and date(-7 to now)"
hg grep -r "funcName" -r "::tip and not obsolete()"
hg status -r "ancestor(.)::."

Windows-specific tuning

Enable fncache and dotencode, avoid long paths, and disable real-time scanning of .hg. Prefer NTFS with short paths and exclude repo directories from Defender.

Governance, Workflows, and Policy

Named branches for releases, bookmarks for feature work

Named branches create long-lived lines with clear ancestry, ideal for releases and sustained maintenance. Bookmarks serve fast-moving feature work and are easy to delete or move. Codify rules to prevent push-time surprises.

Phases policy

Public history is immutable; draft history may be rewritten. Make the server the authority for public transitions, and ensure CI never publishes by accident.

[phases]
publish = False  # on developer machines
# on server: publish = True

Hook strategy

Hooks should be fast, deterministic, and idempotent. Reject new heads on protected branches, validate metadata, and defer heavy scanning to asynchronous pipelines.

[hooks]
pretxnchangegroup.reject_heads = python:hooks.rejectmultipleheads
pretxnchangegroup.branch_policy = python:hooks.enforcebranchpolicy

Backouts over force-pushes

When mistakes land in public history, backout creates an explicit corrective commit that preserves auditability. Force-push equivalents undermine reproducibility and confuse replicas.

Observability and Forensics

Server logs and changegroup tracing

Enable structured logs on servers and correlate with CI job IDs. Track changegroup size, head count deltas, and hook timings.

Client tracing

Use HGPLAIN=1 and --debug for reproducible output; capture timings with time and --traceback during failures.

HGPLAIN=1 hg pull --debug
HGPLAIN=1 hg push --debug

Integrity checks

Run hg verify regularly on mirrors and before backups. Verify largefiles stores for completeness; reconcile missing blobs promptly.

hg verify
hg verify --large

Disaster Recovery and Safe Restore

Bundle-first backups

Bundles capture a portable snapshot of changesets and are independent of filesystem peculiarities. Store recurring full and incremental bundles offsite.

# Full bundle
hg -R /srv/repo bundle --all /backups/repo-full-$(date +%F).hg
# Incremental since last tag
hg -R /srv/repo bundle -r "last(tag())::" /backups/repo-incr-$(date +%F).hg

Cold restore procedure

Stand up a fresh server, restore from the latest verified bundle, then reopen for incremental pulls.

hg init /srv/repo
hg unbundle /backups/repo-full-YYYY-MM-DD.hg
hg unbundle /backups/repo-incr-YYYY-MM-DD.hg
hg verify

Post-incident reconciliation

After restore, compare tips, heads, and phases with surviving clones. Recreate missing bookmarks and reapply protected branch policies before accepting new pushes.

Advanced Pitfalls and Mitigations

Case-collision files on case-insensitive filesystems

Two paths differing only by case cause silent chaos on Windows or macOS default mounts. Add pre-receive checks that reject such changes.

hg log -r "file("~path:.*[A-Z].*") and branch(default)"

Timestamp skew and dirstate confusion

Mercurial relies on mtimes for fast status checks. Skewed clocks force expensive scans or stale metadata. Normalize NTP and enforce stable clocks on CI workers.

Partial obsolescence marker propagation

Without consistent marker exchange, evolve shows pruned or orphan lines only on some clients. Make evolve a managed, versioned dependency and require minimum versions.

Binary bloat in history

Big binaries inflate clone and pull sizes. Move to artifact repositories; prune or quarantine legacy blobs via narrow history where policy allows.

Best Practices Checklist

Repository and server

Adopt modern store formats and periodically run hg verify.
Generate and cache stream bundles for CI and remote sites.
Harden server hooks to prohibit new heads on protected branches.
Collect metrics on clone time, changegroup size, and head counts.

Workflow and policy

Public history is append-only; backout rather than rewrite.
Use bookmarks for short-lived work; named branches for releases.
Standardize phases; developers set publish=False locally.
Gate subrepo changes and largefiles usage.

Operations and reliability

Exempt .hg from antivirus and backup agents.
Schedule maintenance windows for heavy server tasks.
Automate disaster recovery with bundle-based backups and restore drills.
Document recovery procedures and keep runbooks current.

Conclusion

Mercurial’s design delivers integrity and performance, but large-scale deployments magnify subtle behaviors around phases, obsolescence, locks, and storage. The most effective fixes blend process and technology: enforce smart branch policies, stabilize phases, cache and stream history to hungry CI fleets, and make integrity checks routine. Treat the repository as critical infrastructure with observability, runbooks, and disaster recovery rehearsals. With these practices, you transform recurring fire drills into predictable, low-risk operations while preserving a clean, comprehensible history for your teams.

FAQs

1. How do phases actually prevent accidental history rewrites?

Public changesets are immutable by policy; commands like rebase and histedit refuse to rewrite them. Enforcing server-side phase transitions ensures that once integrated, history becomes stable, reducing accidental divergence.

2. When should we choose named branches over bookmarks in Mercurial?

Use named branches for long-lived release lines that require maintenance over time. Bookmarks fit short-lived or ephemeral feature work where rebasing and rapid iteration are common without polluting the permanent branch namespace.

3. What’s the safest way to speed up clones for CI without sacrificing integrity?

Publish verified stream bundles on the server and clone from those artifacts, then pull increments from the canonical remote. This approach minimizes network round-trips while preserving cryptographic integrity and auditability.

4. How do we recover if the working copy becomes inconsistent after a crash?

Start with hg recover, then force a clean update with hg update -C. If inconsistencies persist, export local diffs, reclone from a trusted remote, and reimport to ensure a pristine dirstate.

5. Can we safely use evolve and obsolescence in a heterogeneous toolchain?

Yes, but only with disciplined version management and marker propagation policies. Require minimum client versions, test marker exchange in staging, and verify that CI and bots participate fully in the obsolescence protocol.

Contact Us