Last updated 2026-05-28

Graceful Shutdown

Server.Run traps SIGINT and SIGTERM, marks the health endpoint not-ready, and runs a single ordered Shutdown against the three HTTP servers, the retry tracker, and the store. The total time is capped by NOTIFY_SHUTDOWN_TIMEOUT (default 30s).

When you'd care

Tuning a rolling deploy, debugging requests cut off mid-flight, or sizing a Kubernetes terminationGracePeriodSeconds.

The shutdown sequence

Server.health.markNotReady() — /healthz immediately starts returning 503 with not_ready. Kubernetes readiness probes stop steering traffic to the pod within one probe interval (default ~5s).
http.Server.Shutdown is called on the client server, the internal server, and the metrics server in turn. Each waits for in-flight handlers to return; new connections are rejected.
RetryTracker.CancelAll cancels every in-flight at-least-once retry so no goroutine leaks across the process exit.
The store closer is invoked (pgx.Pool.Close for Postgres, sdk.Client.Close for EntDB, no-op for memory). This is the last step so any in-flight handler that's still touching the store has time to return first.

func (s *Server) Shutdown(ctx context.Context) error {
    s.health.markNotReady()
    ctx, cancel := context.WithTimeout(ctx, s.cfg.ShutdownTimeout)
    defer cancel()

    var firstErr error
    for _, srv := range []*http.Server{s.clientServer, s.internalSrv, s.metricsSrv} {
        if err := srv.Shutdown(ctx); err != nil && firstErr == nil {
            firstErr = err
        }
    }
    if s.retries != nil {
        s.retries.CancelAll()
    }
    if s.closer != nil {
        if err := s.closer.Close(); err != nil && firstErr == nil {
            firstErr = err
        }
    }
    return firstErr
}

Picking a timeout

The default 30s is a reasonable starting point. Two workloads should bump it:

Many open streams — every active StreamEvents connection must drain. The handler exits cleanly on context cancel, but a TCP close handshake takes a few hundred ms times N connections. For 10k open streams plan ~60s.
Slow provider Send in flight — if you registered a provider that talks to a slow backend without its own timeout, NotificationInternalService.Notify calls may need ~10s+ each. Set NOTIFY_SHUTDOWN_TIMEOUT a comfortable margin above the longest acceptable provider latency.

-e NOTIFY_SHUTDOWN_TIMEOUT=60s

Kubernetes terminationGracePeriodSeconds

Set it slightly longer than NOTIFY_SHUTDOWN_TIMEOUT so the kubelet doesn't SIGKILL before notify finishes draining:

spec:
  terminationGracePeriodSeconds: 45      # NOTIFY_SHUTDOWN_TIMEOUT + headroom
  containers:
    - name: notify
      env:
        - { name: NOTIFY_SHUTDOWN_TIMEOUT, value: "30s" }

preStop hook (defensive)

Optional: a short preStop hook gives the load balancer / service mesh time to stop sending new traffic before notify even sees SIGTERM:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]

The container ships FROM scratch so there's no /bin/sh; if you want a preStop sleep, project a sidecar with a shell, or rely on the readiness-probe flip to drain upstream proxies inside Server.Shutdown itself.

Retry-tracker drain semantics

The at-least-once RetryTracker spawns one goroutine per (key, connID) pair when a data-change event is scheduled. CancelAll cancels every parent context and drains the map. After shutdown:

No goroutines are leaked — every retry loop exits via <-ctx.Done().
Any unacked data-change event whose retry budget was not yet exhausted is silently dropped. This is by design: data-change events are best-effort hints. A reconnecting client re-fetches state via its own API, so a missed hint is at worst one extra cold fetch.

If you need at-least-once delivery across pod restarts for a specific use case, push it through a durable channel (email, SMS, web push, mobile push) — the in-app channel is real-time-or-nothing by intent.

Store closure

memory — no Close, no-op.
postgres — pgxpool.Pool.Close; waits for all checked-out connections to be returned before closing.
entdb — sdk.Client.Close; closes the gRPC transport.

The store close runs after the HTTP servers have finished draining. This guarantees no in-flight Connect handler is still touching the store when its connections get torn down — avoiding spurious "connection closed" errors in the very last log lines.

Observability during shutdown

{"time":"...","level":"INFO","msg":"server_shutdown_signal"}
{"time":"...","level":"INFO","msg":"stream_close","connection_id":"...","user_id":"..."}  // one per stream
{"time":"...","level":"INFO","msg":"stream_close","connection_id":"...","user_id":"..."}
... (per in-flight stream) ...
{"time":"...","level":"WARN","msg":"server_shutdown_error","error":"..."}  // only on failure

If the shutdown completes within NOTIFY_SHUTDOWN_TIMEOUT, the process exits with status 0. If the timeout fires first the process logs server_shutdown_error with the underlying cause and exits with the first non-nil error.

Observability
Kubernetes deployment
Realtime engine — RetryTracker invariants