Understanding NaN Propagation in Theano Graphs
Symptoms and Impact
- Training loss suddenly becomes NaN or Inf without explicit error
- Model weights diverge or explode in value
- Gradients become zero or NaN, halting effective learning
- Debugging is difficult due to symbolic graph optimization and lazy evaluation
Why It Matters
NaN propagation leads to wasted training cycles, invalid checkpoints, and wasted compute resources. In reproducibility-sensitive research, these failures can silently invalidate results. In production, silent instability puts models at risk of undetected errors.
Theano Architecture and Debugging Challenges
Symbolic Graphs and Lazy Evaluation
Theano defines computation symbolically and compiles it into optimized C/CUDA code. Errors (like NaNs) may not surface until a specific graph node is executed, complicating diagnosis.
Graph Optimizations
Theano aggressively optimizes graphs, sometimes fusing or reordering operations. This can make step-by-step debugging unintuitive, as operations may not execute in Python call order.
Root Causes of NaN and Instability
1. Numeric Instability in Activation Functions
Functions like exp
, log
, or sigmoid
can easily overflow/underflow, especially with unnormalized inputs or high learning rates.
2. Divide-by-Zero or Log-of-Zero Operations
Common in loss functions (e.g., cross-entropy), where zero probabilities are passed to log
, resulting in -Inf or NaN.
3. Exploding or Vanishing Gradients
Deep or recurrent models without gradient clipping can quickly produce NaNs, especially with poor initialization or high momentum/learning rates.
4. GPU/CPU Kernel Inconsistencies
Some numeric issues only arise on specific hardware or Theano backends (e.g., CUDA), causing non-reproducible NaN bugs between runs or machines.
5. Silent Graph Optimization Bugs
Certain Theano optimizations may skip error checking or mask intermediate NaNs due to fused operations.
Diagnostics and Debugging
1. Insert theano.printing.Print
or theano.printing.pprint
from theano.printing import Print output = Print('DEBUG')(some_variable)
Force evaluation and printing of intermediate nodes to detect where NaNs appear.
2. Use NanGuardMode
for Runtime Detection
from theano.compile.nanguardmode import NanGuardMode f = theano.function([...], ..., mode=NanGuardMode(nan_is_error=True, inf_is_error=True, big_is_error=True))
Raises explicit errors as soon as NaN or Inf is detected during execution, pinpointing the culprit operation.
3. Track and Log Model Weights
Periodically log min, max, and norm of weights/gradients during training to catch diverging values before NaNs propagate.
4. Isolate GPU vs. CPU Behavior
Test training with device=cpu
and device=gpu
to identify backend-specific numeric issues.
5. Reduce Graph Complexity During Debugging
Temporarily disable optimizations (optimizer=None
) or simplify the model to identify problematic components.
Step-by-Step Fix Strategy
1. Normalize Inputs and Initializations
Standardize input features and use robust weight initializations (e.g., Xavier/Glorot) to reduce overflow/underflow risk.
2. Apply Gradient Clipping
from theano import tensor as T g = T.grad(cost, params) g_clipped = [T.clip(grad, -1., 1.) for grad in g]
Limits the size of gradients, preventing explosion during backpropagation.
3. Add Small Epsilon to Log and Divide Ops
Modify log(x)
to log(x + 1e-8)
and 1/x
to 1/(x + 1e-8)
to avoid undefined results when x
is zero.
4. Use NanGuardMode
in CI and Training
Wrap all Theano functions in NanGuardMode
during development and CI to ensure no silent NaNs propagate into deployed models.
5. Tune Learning Rates and Optimizer Hyperparameters
Start with conservative learning rates and incrementally increase after confirming stable training. Reduce momentum or try alternate optimizers if divergence persists.
Best Practices
- Always test models on both CPU and GPU for reproducibility
- Integrate
NanGuardMode
in all new model training scripts - Monitor loss and gradient statistics at every epoch
- Write unit tests for custom ops to handle edge-case values
- Document optimizer and hardware settings in experiment logs
Conclusion
NaN propagation and silent graph failures in Theano are symptoms of deeper numeric instability, exacerbated by symbolic optimization and legacy code. By leveraging Theano’s built-in diagnostic tools, carefully managing numerical operations, and adopting disciplined initialization and monitoring, teams can mitigate instability in critical ML and research pipelines. For legacy systems, robust debugging and runtime protection are essential for trustworthy results and reproducible science.
FAQs
1. Why do NaNs appear during Theano training?
Common causes include divide-by-zero, log-of-zero, exploding gradients, or hardware-specific bugs. NanGuardMode helps pinpoint these sources during execution.
2. Is Theano safe to use in production?
Theano is no longer actively maintained, so use it with caution. For new projects, migrate to frameworks like TensorFlow, PyTorch, or JAX. For legacy systems, enforce strict diagnostics.
3. How can I debug NaN issues without error messages?
Use theano.printing.Print
to trace intermediate values and NanGuardMode to automatically trap NaNs during function execution.
4. Are NaN bugs hardware-dependent?
Yes, some numeric issues only appear on certain GPUs or with specific BLAS/CUDA libraries. Always test on multiple backends if possible.
5. Can I disable Theano graph optimizations for debugging?
Yes. Pass optimizer=None
when compiling functions to get more predictable, debuggable execution traces at the cost of runtime speed.