Last updated 2026-05-28

Graceful Shutdown

Server.Run traps SIGINT and SIGTERM, marks the health endpoint not-ready, and runs a single ordered Shutdown against the three HTTP servers, the retry tracker, and the store. The total time is capped by NOTIFY_SHUTDOWN_TIMEOUT (default 30s).

When you'd care

Tuning a rolling deploy, debugging requests cut off mid-flight, or sizing a Kubernetes terminationGracePeriodSeconds.

The shutdown sequence

  1. Server.health.markNotReady()/healthz immediately starts returning 503 with not_ready. Kubernetes readiness probes stop steering traffic to the pod within one probe interval (default ~5s).
  2. http.Server.Shutdown is called on the client server, the internal server, and the metrics server in turn. Each waits for in-flight handlers to return; new connections are rejected.
  3. RetryTracker.CancelAll cancels every in-flight at-least-once retry so no goroutine leaks across the process exit.
  4. The store closer is invoked (pgx.Pool.Close for Postgres, sdk.Client.Close for EntDB, no-op for memory). This is the last step so any in-flight handler that's still touching the store has time to return first.
func (s *Server) Shutdown(ctx context.Context) error {
s.health.markNotReady()
ctx, cancel := context.WithTimeout(ctx, s.cfg.ShutdownTimeout)
defer cancel()
var firstErr error
for _, srv := range []*http.Server{s.clientServer, s.internalSrv, s.metricsSrv} {
if err := srv.Shutdown(ctx); err != nil && firstErr == nil {
firstErr = err
}
}
if s.retries != nil {
s.retries.CancelAll()
}
if s.closer != nil {
if err := s.closer.Close(); err != nil && firstErr == nil {
firstErr = err
}
}
return firstErr
}

Picking a timeout

The default 30s is a reasonable starting point. Two workloads should bump it:

  • Many open streams — every active StreamEvents connection must drain. The handler exits cleanly on context cancel, but a TCP close handshake takes a few hundred ms times N connections. For 10k open streams plan ~60s.
  • Slow provider Send in flight — if you registered a provider that talks to a slow backend without its own timeout, NotificationInternalService.Notify calls may need ~10s+ each. Set NOTIFY_SHUTDOWN_TIMEOUT a comfortable margin above the longest acceptable provider latency.
-e NOTIFY_SHUTDOWN_TIMEOUT=60s

Kubernetes terminationGracePeriodSeconds

Set it slightly longer than NOTIFY_SHUTDOWN_TIMEOUT so the kubelet doesn't SIGKILL before notify finishes draining:

spec:
terminationGracePeriodSeconds: 45 # NOTIFY_SHUTDOWN_TIMEOUT + headroom
containers:
- name: notify
env:
- { name: NOTIFY_SHUTDOWN_TIMEOUT, value: "30s" }

preStop hook (defensive)

Optional: a short preStop hook gives the load balancer / service mesh time to stop sending new traffic before notify even sees SIGTERM:

lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]

The container ships FROM scratch so there's no /bin/sh; if you want a preStop sleep, project a sidecar with a shell, or rely on the readiness-probe flip to drain upstream proxies inside Server.Shutdown itself.

Retry-tracker drain semantics

The at-least-once RetryTracker spawns one goroutine per (key, connID) pair when a data-change event is scheduled. CancelAll cancels every parent context and drains the map. After shutdown:

  • No goroutines are leaked — every retry loop exits via <-ctx.Done().
  • Any unacked data-change event whose retry budget was not yet exhausted is silently dropped. This is by design: data-change events are best-effort hints. A reconnecting client re-fetches state via its own API, so a missed hint is at worst one extra cold fetch.

If you need at-least-once delivery across pod restarts for a specific use case, push it through a durable channel (email, SMS, web push, mobile push) — the in-app channel is real-time-or-nothing by intent.

Store closure

  • memory — no Close, no-op.
  • postgrespgxpool.Pool.Close; waits for all checked-out connections to be returned before closing.
  • entdbsdk.Client.Close; closes the gRPC transport.

The store close runs after the HTTP servers have finished draining. This guarantees no in-flight Connect handler is still touching the store when its connections get torn down — avoiding spurious "connection closed" errors in the very last log lines.

Observability during shutdown

{"time":"...","level":"INFO","msg":"server_shutdown_signal"}
{"time":"...","level":"INFO","msg":"stream_close","connection_id":"...","user_id":"..."} // one per stream
{"time":"...","level":"INFO","msg":"stream_close","connection_id":"...","user_id":"..."}
... (per in-flight stream) ...
{"time":"...","level":"WARN","msg":"server_shutdown_error","error":"..."} // only on failure

If the shutdown completes within NOTIFY_SHUTDOWN_TIMEOUT, the process exits with status 0. If the timeout fires first the process logs server_shutdown_error with the underlying cause and exits with the first non-nil error.

Related