C at Scale: Troubleshooting Async-Signal-Safety and Reentrancy Bugs in Enterprise Services

Details: Category: Programming Languages; By Mindful Chase; 14.Aug; Hits: 84

In sprawling, multi-threaded C services running on Linux or other POSIX systems, intermittent hangs, deadlocks, or mysterious crashes often trace back to a subtle class of faults: async-signal-safety and reentrancy violations. These bugs frequently lurk in error-handling or observability code—for example, logging from a signal handler, capturing a backtrace after SIGSEGV, or performing cleanup on SIGTERM. While small programs may get away with unsafe patterns, enterprise-grade daemons with many threads, complex allocators, and heavy I/O make undefined behavior far more likely. Diagnosing and fixing these issues demands a deep understanding of POSIX signal semantics, thread interactions, memory allocators, and kernel delivery mechanics. This article provides a rigorous, architecture-level troubleshooting guide to pinpoint and permanently eliminate async-signal-unsafe logic in production-grade C systems.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Background and Context

What "async-signal-unsafe" actually means

When a signal arrives, execution can jump to the signal handler at an arbitrary point in any thread, potentially while the thread holds locks or is inside a non-reentrant library routine. POSIX defines a small set of functions that are async-signal-safe; invoking any other function from a signal handler results in undefined behavior. Commonly unsafe calls include malloc, free, printf, pthread_mutex_lock, dlopen/dlsym, and even many time functions. Violations may appear to "work" under light load but will eventually deadlock or corrupt state in long-running services.

Why this matters in enterprise environments

Large services frequently use signals for lifecycle (SIGTERM), watchdogs (SIGALRM), hot reload (SIGHUP), crash handling (SIGSEGV/SIGBUS), CPU sampling (SIGPROF), and external control (SIGUSR1/2). Teams also layer complex logging, metrics, and tracing around those handlers. If handler code allocates memory, writes to buffered streams, or touches non-reentrant subsystems, it can deadlock the whole process or corrupt global state. The result is sporadic outages, inconsistent core dumps, and difficult-to-reproduce "Heisenbugs".

Typical symptoms

Process hangs that coincide with log rotation or shutdown signals
Core dumps missing expected stack frames or showing corrupted malloc arenas
Deadlocks involving libpthread or libc internals shortly after SIGTERM/SIGHUP
High CPU from a spin inside a handler attempting to acquire a locked mutex
Occasional "double free or corruption" after a crash handler "prints diagnostics"

Architectural Implications

Signals, threads, and allocators

In multi-threaded programs, signals can be delivered to any thread that does not block them. If a signal fires while the target thread is within malloc or free, and the handler calls malloc/free again (directly or indirectly through logging), allocator internal locks may deadlock. Different allocators (glibc ptmalloc, jemalloc, tcmalloc) have distinct locking strategies, but all are unsafe to re-enter from handlers.

Buffered I/O and stdio locks

Calling printf, fprintf, or puts in a handler attempts to take stdio locks and may flush or allocate buffers. If the interrupted thread already holds a stdio lock, the handler will block forever. Even seemingly harmless formatting (e.g., snprintf) can allocate or invoke locale/memory routines.

Dynamic linking and loader state

Attempting dlopen, dlsym, or even obtaining backtraces that touch unwinder state (libunwind, backtrace()) is typically unsafe in a signal context. The dynamic loader might be in a transient state when interrupted.

Process lifecycle and supervision

Enterprise platforms rely on systemd, Kubernetes, or custom supervisors. Signals are used extensively: graceful termination (SIGTERM), forced kill (SIGKILL), hot reloads (SIGHUP), and watchdog timers. An unsafe handler breaks graceful semantics and may cause timeouts, cascading restarts, or data loss during shutdown.

Foundations: The minimal-safe signal pattern

Design tenets

Handlers must be minimal, returning quickly.
Only call POSIX async-signal-safe functions (e.g., write, _exit, signal in some systems, sigqueue, kill).
Never call malloc, stdio, pthread_mutex_lock, or perform complex parsing.
Communicate with a dedicated thread via a pipe/eventfd/signalfd rather than doing work in the handler.
Use sigaltstack for fatal signals to survive stack overflows.

Safe communication via a pipe

A classic, portable pattern is a self-pipe: the handler writes a byte to a non-blocking pipe; the main event loop or a dedicated thread select/poll/epolls on the read end and performs the heavy work in normal thread context.

#include <signal.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <stdatomic.h>

static int sigpipe_fds[2] = {-1, -1};
static _Atomic int got_sigterm = 0;

static void set_nonblock(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    (void)fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}

static void term_handler(int signo) {
    (void)signo;
    got_sigterm = 1;
    char c = 1;
    /* async-signal-safe: write */
    (void)write(sigpipe_fds[1], &c, 1);
}

int install_handlers(void) {
    if (pipe(sigpipe_fds) == -1) return -1;
    set_nonblock(sigpipe_fds[0]);
    set_nonblock(sigpipe_fds[1]);

    struct sigaction sa = {0};
    sa.sa_handler = term_handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = SA_RESTART;
    if (sigaction(SIGTERM, &sa, NULL) == -1) return -1;
    return 0;
}

/* elsewhere: poll/epoll/select on sigpipe_fds[0] and act on got_sigterm */

Diagnostic Process

1) Reproduce under controlled delivery

Use kill -TERM and kill -HUP while the service is under load. Combine with strace -f -ttT -e trace=signal,write,read to observe handler activity and verify that only async-signal-safe syscalls appear. If you see calls like futex related to libc locks, it is a red flag for unsafe code.

2) Inspect core dumps for allocator lock contention

Enable core dumps and use gdb to inspect backtraces of all threads. Look for frames inside malloc/free/_int_malloc (glibc) or equivalent in jemalloc/tcmalloc. If a handler interrupted the allocator and then attempted to allocate, you will often find one thread inside a signal trampoline with the next frame in write (safe) or worse, in fprintf/vsnprintf (unsafe). Multiple threads parked in futex wait states within libc can indicate deadlocks.

3) Audit handlers with static and dynamic analysis

Search code for sigaction, signal, and handler functions. Verify that handlers only call async-signal-safe functions.
Use compiler instrumentation: -fsanitize=address, -fsanitize=thread to catch races outside the handler path; while sanitizers are not signal-safe themselves, they can expose suspicious usage when a handler triggers code that should not run.
Review link maps and nm output if your handler references nontrivial symbols (e.g., logging library entry points).

4) Trace delivery and masking

Confirm which thread receives signals. In multithreaded systems, set a mask with pthread_sigmask to block signals in worker threads and create a dedicated "signal thread" using sigwaitinfo or signalfd on Linux. This confines complex behavior to a safe context and removes randomness from delivery.

5) Validate alt signal stacks for fatal signals

Memory corruption and stack overflow can prevent handlers from running if they share the default stack. Use sigaltstack to register an alternate stack for SIGSEGV/SIGBUS/SIGFPE handlers that need to do minimal crash reporting. Make sure any stack you use is mapped and writeable early at startup.

Common Pitfalls

Logging inside handlers

Using printf, fprintf, syslog (classic, not syslog(3) on some platforms), or C++ stream wrappers from a handler is unsafe. Even simple string formatting can acquire locks or allocate memory. Prefer write(STDERR_FILENO, ...) with preformatted, fixed-size buffers, or better, signal a logging thread through a pipe.

Calling malloc/free or functions that allocate

Any function that might allocate (directly or indirectly) is unsafe. That includes backtrace libraries, high-level time formatting, JSON builders, or metrics emission. If you cannot prove it is on the POSIX async-signal-safe list, assume it is unsafe.

Mutexes and critical sections

Handlers that attempt to acquire a pthread_mutex_t or pthread_rwlock_t can deadlock if the interrupted context was holding the same lock. Spinlocks are not safe either; you may spin forever if the interrupted context was inside the critical section.

Relying on SA_RESTART to "fix" I/O patterns

SA_RESTART causes some syscalls to restart automatically after handler return, but it does not make handler code safe. In fact, SA_RESTART can hide EINTR handling bugs and complicate debugging. Design your I/O to handle EINTR gracefully regardless.

Crash handlers that do too much

It is tempting to collect a backtrace, flush logs, and upload diagnostics in a SIGSEGV handler. Most of that is unsafe. A robust pattern is to write a minimal marker to a pipe and immediately _exit, letting a supervising process analyze the core dump asynchronously.

Step-by-Step Fixes

1) Centralize signal ownership with a signal thread

Block all relevant signals in worker threads using pthread_sigmask, then dedicate one thread to synchronously wait for signals via sigwaitinfo. This moves handling logic to a normal thread context where you can use locks, allocators, and logging safely.

#define _GNU_SOURCE
#include <signal.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

static void *signal_thread_fn(void *arg) {
    (void)arg;
    sigset_t set;
    sigemptyset(&set);
    sigaddset(&set, SIGTERM);
    sigaddset(&set, SIGHUP);
    sigaddset(&set, SIGUSR1);
    for (;;) {
        siginfo_t si;
        int sig = sigwaitinfo(&set, &si);
        if (sig == -1) continue; /* log in normal context */
        if (sig == SIGTERM) {
            /* perform graceful shutdown: safe to use stdio here */
            printf("Received SIGTERM, shutting down...\n");
            break;
        } else if (sig == SIGHUP) {
            printf("Reload requested via SIGHUP\n");
        } else if (sig == SIGUSR1) {
            printf("Diagnostics requested via SIGUSR1\n");
        }
    }
    return NULL;
}

void install_signal_thread(void) {
    sigset_t set;
    sigemptyset(&set);
    sigaddset(&set, SIGTERM);
    sigaddset(&set, SIGHUP);
    sigaddset(&set, SIGUSR1);
    /* block in all threads before creating workers */
    pthread_sigmask(SIG_BLOCK, &set, NULL);
    pthread_t tid;
    pthread_create(&tid, NULL, signal_thread_fn, NULL);
}

2) Use signalfd on Linux for integration with event loops

signalfd converts signals into readable bytes on a file descriptor that can be polled like sockets. This simplifies integration with epoll/select-based reactors and avoids installing traditional handlers for non-fatal signals.

#include <sys/signalfd.h>
#include <sys/epoll.h>
#include <signal.h>
#include <unistd.h>

int setup_signalfd(void) {
    sigset_t mask;
    sigemptyset(&mask);
    sigaddset(&mask, SIGTERM);
    sigaddset(&mask, SIGHUP);
    pthread_sigmask(SIG_BLOCK, &mask, NULL);
    int sfd = signalfd(-1, &mask, SFD_NONBLOCK | SFD_CLOEXEC);
    return sfd;
}

3) Implement minimal fatal-signal handling with sigaltstack

For SIGSEGV/SIGBUS/SIGFPE/SIGILL, do the bare minimum: write a short, fixed message and call _exit. If you must dump a backtrace, prefer invoking a separate helper process via fork (still risky) or rely on core dumps and external tooling (e.g., systemd-coredump). Install an alternate stack so the handler runs even when the main stack is invalid.

#include <signal.h>
#include <unistd.h>
#include <string.h>

static char altstack_mem[64 * 1024];

static void fatal_handler(int signo, siginfo_t *si, void *ctx) {
    (void)ctx;
    char buf[128];
    /* Compose a small static message: avoid stdio, malloc */
    int n = 0;
    const char *p = "Fatal signal\n";
    /* best effort; ignore errors */
    (void)write(STDERR_FILENO, p, (int)strlen(p));
    _exit(128 + signo);
}

void install_fatal_handlers(void) {
    stack_t ss = {0};
    ss.ss_sp = altstack_mem;
    ss.ss_size = sizeof(altstack_mem);
    ss.ss_flags = 0;
    sigaltstack(&ss, NULL);

    struct sigaction sa = {0};
    sa.sa_sigaction = fatal_handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = SA_SIGINFO | SA_ONSTACK;
    sigaction(SIGSEGV, &sa, NULL);
    sigaction(SIGBUS,  &sa, NULL);
    sigaction(SIGFPE,  &sa, NULL);
    sigaction(SIGILL,  &sa, NULL);
}

4) Replace unsafe logging paths

Remove all stdio-based logging from handlers. If you must log synchronously, pre-allocate a ring buffer at startup and only use write with fixed offsets in the handler. Better yet, send a single byte to a pipe and let the logging thread emit a structured message.

/* Handler-side: */
static int log_pipe[2];
static void hup_handler(int signo){ (void)signo; char b=1; (void)write(log_pipe[1], &b, 1); }

/* Logger thread: */
void *logger_thread(void *){
    for(;;){ char b; if (read(log_pipe[0], &b, 1) <= 0) continue; /* now safe to log */ }
}

5) Harden EINTR handling and restart logic

Make all I/O paths resilient to EINTR. Wrap syscalls with retry loops unless semantics demand interruption (e.g., timeouts). If you depend on SA_RESTART, document it explicitly and test without it to ensure robustness.

ssize_t r_read(int fd, void *buf, size_t n){
    for(;;){
        ssize_t r = read(fd, buf, n);
        if (r >= 0) return r;
        if (errno == EINTR) continue;
        return -1;
    }
}

ssize_t r_write(int fd, const void *buf, size_t n){
    const char *p = buf; size_t left = n;
    while (left){
        ssize_t r = write(fd, p, left);
        if (r > 0){ p += r; left -= r; continue; }
        if (r == -1 && errno == EINTR) continue;
        return -1;
    }
    return (ssize_t)n;
}

6) Test with fault injection and stress tools

Integrate tests that bombard the process with signals while running high load. Combine stress-ng or internal load generators with kill -USR1/-HUP/-TERM storms. Verify that logs contain a stable sequence of "signal received" events and that no deadlocks occur.

Deep Dive: Interactions that amplify risk

Allocators and reentrancy

glibc's ptmalloc and alternative allocators use internal mutexes and per-thread caches. If a handler interrupts a critical section and then allocates, it can deadlock on the same lock or corrupt internal lists. Even if your handler appears to avoid allocation, hidden allocations can occur within locale handling, DNS resolution, and logging frameworks.

Backtrace collection pitfalls

backtrace() from execinfo.h may allocate and is not guaranteed async-signal-safe. On some systems it works "well enough"; on others, it deadlocks. Prefer external core dumps (/proc/sys/kernel/core_pattern, systemd-coredump) and offline symbolization with addr2line or eu-addr2line. If you insist on in-process stack capture, consider a child process via fork in the handler followed by backtrace() in the child; even then, only async-signal-safe functions should be used before exec/_exit in the child.

Realtime signals and queuing

Realtime signals (SIGRTMIN..SIGRTMAX) queue with payloads and can overwhelm your process if not serviced quickly. If handlers perform expensive work, the queue backs up and latency spikes. Use sigqueue judiciously and prefer eventfd or pipes for high-rate signaling.

setuid binaries and security constraints

In privileged programs, signals compound with restrictions on unsafe functions (e.g., async-signal-unsafe behavior may be exploitable). Keep handlers minimal and avoid exposing reentrancy windows that could be leveraged by malicious inputs.

Systematic Audit Checklist

Source review

Enumerate all sigaction/signal sites and documented signals.
For each handler, list every function call. Cross-check against the POSIX async-signal-safe list.
Replace non-compliant calls with pipe/signalfd notifications.

Runtime verification

Run under strace with signal filters; ensure only expected syscalls occur inside handler windows.
Use perf or bcc/eBPF tooling to sample when handlers fire and observe contention.
Capture core dumps for any hang or crash and inspect for libc/allocator lock owner/waiter relationships.

Operations readiness

Document signal behavior for SREs (which signals do what, how long shutdown should take).
Ensure systemd or orchestrator timeouts reflect real shutdown durations.
Provide debugging switches to disable noncritical signals in emergencies.

Patterns to Adopt

Self-pipe for delivery fan-in

One pipe can represent many signals by writing distinct bytes or small structs. Parsing occurs in the main loop. This harmonizes with reactor frameworks (libevent, libuv) and avoids handler complexity.

Dedicated signal thread

Blocking signals everywhere and using one thread with sigwaitinfo is often the cleanest design for portable services. It keeps the rest of the code unaware of asynchronous delivery, improving reasoning and testability.

signalfd for Linux-centric stacks

If your platform target is Linux-only, signalfd integrates best with epoll. It also renders many handler-related pitfalls irrelevant because the "handler" path becomes straightforward file I/O in a normal thread.

Minimal fatal crash path

For SIGSEGV/"fatal" conditions: write a short message, snapshot a tiny amount of state if you must (e.g., a numeric code), then _exit. Rely on core dumps for full diagnostics.

Anti-Patterns to Eliminate

Logging libraries from handlers

Even libraries that claim to be "async" often allocate or lock. Unless explicitly designed for signal safety (rare), remove them from handlers.

Complex reload logic in SIGHUP handlers

Re-parsing configuration in the handler tends to allocate, use stdio, and take locks. Move reload logic to a signal thread or main loop triggered by a pipe.

Calling pthread APIs from handlers

pthread_create, pthread_mutex_lock, and friends are not async-signal-safe. Avoid entirely.

Performance Considerations

Latency and spikes

Handlers should add near-zero latency. Any nontrivial work risks tail-latency spikes if it hits contention or preemption at unfortunate times. Moving work out of the handler flattens latency distributions.

Throughput and signal storms

In high-throughput systems, signals can arrive faster than they are consumed. Pipes/signalfd and dedicated threads scale better than per-signal heavy handlers. For periodic jobs (e.g., watchdogs), consider timerfd rather than SIGALRM.

NUMA and scheduler interactions

Handlers can interrupt hot paths on any CPU, perturbing cache locality. Centralized handling reduces cross-core noise and can improve cache behavior in hot loops.

Observability: getting the right signals about signals

Lightweight counters

Use sig_atomic_t or C11 atomics to track counts of received signals inside handlers. Export them periodically from a normal thread. Avoid updating complex metrics frameworks directly in handlers.

#include <stdatomic.h>
static _Atomic unsigned long sigterm_count = 0;
static void term_handler(int s){ (void)s; atomic_fetch_add_explicit(&sigterm_count, 1, memory_order_relaxed); }
/* elsewhere: read sigterm_count and publish via normal metrics path */

Tracing

Consider eBPF uprobes or perf-events to trace sigaction and signal delivery in staging. Keep in mind that tracing frameworks themselves may interact with signal delivery; validate overhead.

Case Study: Hang on shutdown due to unsafe logging

Symptom

A service occasionally hung during graceful shutdown. "kill -TERM" would log "Shutting down", then freeze. Core showed one thread inside fprintf and another holding a stdio lock within a hot path.

Root cause

The SIGTERM handler called fprintf to print a message. When delivered to a thread that had interrupted fprintf on the same FILE*, the handler deadlocked on the stdio lock.

Fix

Replaced handler with self-pipe notification and moved logging to the main loop. Added a signal thread to coordinate shutdown and ensure idempotent cleanup. No further hangs under stress.

Step-by-Step Migration Plan

1) Inventory and classify signals

List all signals your process receives and categorize them as "control" (TERM/HUP/USR1) vs "fatal" (SEGV/BUS/FPE/ILL) vs "periodic" (ALRM/PROF). Define appropriate handling strategies for each.

2) Introduce a signal thread and block signals elsewhere

Adopt pthread_sigmask globally during startup and create a single handler thread. Integrate with your event loop or orchestrator semantics.

3) Replace handler code with pipe/eventfd signaling

Where handlers existed, keep only a minimal write to a pipe. Migrate any real logic to the main loop or signal thread.

4) Add altstack for fatal signals, minimize work

Install sigaltstack and ensure the fatal handler does not allocate or lock. Use core dumps for full diagnostics.

5) Harden EINTR and shutdown paths

Audit all blocking syscalls for EINTR safety. Write tests that deliver signals during I/O to verify robustness.

6) Document operational semantics

Publish runbooks explaining how signals are handled, expected shutdown timing, and safe ways to trigger reloads or diagnostics.

Best Practices

Design principles

Handlers do not allocate, lock, or perform I/O beyond write.
All complex work happens in normal thread context.
Prefer signalfd (Linux) or sigwaitinfo for control signals.
Use sigaltstack for fatal signals and exit quickly.
Keep configuration reloads and shutdown cleanups out of handlers.

Tooling and verification

strace to verify safe syscalls in handlers
gdb on core dumps to detect allocator/stdio lock deadlocks
Stress tests with signal storms
eBPF/perf to observe delivery and latency impact

Team practices

Code reviews require checking handler safety against the POSIX list
Runbooks document signals and timeouts
CI pipeline runs signal-interrupt tests alongside normal suites

Conclusion

Async-signal-safety is one of the least discussed yet most pernicious sources of instability in large C services. Production outages often stem from well-intentioned but unsafe logging, allocation, or synchronization inside signal handlers. The durable fix is architectural: ensure handlers are minimal and delegate substantial work to normal thread contexts via pipes, signalfd, or a dedicated signal thread. Combine that with alt stacks for fatal signals, robust EINTR handling, and disciplined reviews. With these patterns, enterprises can eliminate signal-induced deadlocks and crashes, delivering systems that shut down cleanly, reload configuration safely, and produce reliable diagnostics when it matters most.

FAQs

1. Can I safely log from a signal handler using write()?

Yes, write is async-signal-safe when used on a valid, open file descriptor. Keep messages short, preformatted, and avoid dynamic formatting or buffer growth. Prefer signaling a logging thread to emit structured logs.

2. Are backtrace() and dladdr() safe to call in a handler?

No. They may allocate, take locks, or traverse loader state. Rely on core dumps for full stack traces, or offload diagnostics to a dedicated thread triggered by a pipe or signalfd event.

3. How do I test for handler safety systematically?

Combine code review against the POSIX async-signal-safe list with strace under signal stress. Ensure only safe syscalls appear during handler windows and that no handlers call library functions outside the list.

4. Is SA_RESTART a substitute for proper EINTR handling?

No. SA_RESTART helps some syscalls but does not make unsafe handler code safe. Design your I/O to tolerate EINTR explicitly and keep handlers minimal.

5. When should I choose signalfd over sigwaitinfo?

On Linux, use signalfd when integrating with epoll-based event loops so signals behave like normal file descriptors. Use sigwaitinfo when you prefer a portable, thread-based waiting model without extra file descriptors.

Contact Us