Last updated 2026-05-28
Graceful Shutdown
Server.Run traps SIGINT and
SIGTERM, marks the health endpoint not-ready, and runs
a single ordered Shutdown against the three HTTP
servers, the retry tracker, and the store. The total time is
capped by NOTIFY_SHUTDOWN_TIMEOUT (default 30s).
When you'd care
Tuning a rolling deploy, debugging requests cut off mid-flight, or
sizing a Kubernetes terminationGracePeriodSeconds.
The shutdown sequence
Server.health.markNotReady()—/healthzimmediately starts returning503withnot_ready. Kubernetes readiness probes stop steering traffic to the pod within one probe interval (default ~5s).http.Server.Shutdownis called on the client server, the internal server, and the metrics server in turn. Each waits for in-flight handlers to return; new connections are rejected.RetryTracker.CancelAllcancels every in-flight at-least-once retry so no goroutine leaks across the process exit.- The store closer is invoked (
pgx.Pool.Closefor Postgres,sdk.Client.Closefor EntDB, no-op for memory). This is the last step so any in-flight handler that's still touching the store has time to return first.
func (s *Server) Shutdown(ctx context.Context) error { s.health.markNotReady() ctx, cancel := context.WithTimeout(ctx, s.cfg.ShutdownTimeout) defer cancel()
var firstErr error for _, srv := range []*http.Server{s.clientServer, s.internalSrv, s.metricsSrv} { if err := srv.Shutdown(ctx); err != nil && firstErr == nil { firstErr = err } } if s.retries != nil { s.retries.CancelAll() } if s.closer != nil { if err := s.closer.Close(); err != nil && firstErr == nil { firstErr = err } } return firstErr}Picking a timeout
The default 30s is a reasonable starting point. Two workloads should bump it:
- Many open streams — every active
StreamEventsconnection must drain. The handler exits cleanly on context cancel, but a TCP close handshake takes a few hundred ms times N connections. For 10k open streams plan ~60s. - Slow provider Send in flight — if you registered a provider that talks to a slow backend without its own timeout,
NotificationInternalService.Notifycalls may need ~10s+ each. SetNOTIFY_SHUTDOWN_TIMEOUTa comfortable margin above the longest acceptable provider latency.
-e NOTIFY_SHUTDOWN_TIMEOUT=60sKubernetes terminationGracePeriodSeconds
Set it slightly longer than NOTIFY_SHUTDOWN_TIMEOUT so
the kubelet doesn't SIGKILL before notify finishes draining:
spec: terminationGracePeriodSeconds: 45 # NOTIFY_SHUTDOWN_TIMEOUT + headroom containers: - name: notify env: - { name: NOTIFY_SHUTDOWN_TIMEOUT, value: "30s" }preStop hook (defensive)
Optional: a short preStop hook gives the load balancer / service mesh time to stop sending new traffic before notify even sees SIGTERM:
lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 5"]
The container ships FROM scratch so there's no
/bin/sh; if you want a preStop sleep, project a
sidecar with a shell, or rely on the readiness-probe flip to drain
upstream proxies inside Server.Shutdown itself.
Retry-tracker drain semantics
The at-least-once RetryTracker spawns one goroutine per
(key, connID) pair when a data-change event is
scheduled. CancelAll cancels every parent context and
drains the map. After shutdown:
- No goroutines are leaked — every retry loop exits via
<-ctx.Done(). - Any unacked data-change event whose retry budget was not yet exhausted is silently dropped. This is by design: data-change events are best-effort hints. A reconnecting client re-fetches state via its own API, so a missed hint is at worst one extra cold fetch.
If you need at-least-once delivery across pod restarts for a specific use case, push it through a durable channel (email, SMS, web push, mobile push) — the in-app channel is real-time-or-nothing by intent.
Store closure
- memory — no Close, no-op.
- postgres —
pgxpool.Pool.Close; waits for all checked-out connections to be returned before closing. - entdb —
sdk.Client.Close; closes the gRPC transport.
The store close runs after the HTTP servers have finished draining. This guarantees no in-flight Connect handler is still touching the store when its connections get torn down — avoiding spurious "connection closed" errors in the very last log lines.
Observability during shutdown
{"time":"...","level":"INFO","msg":"server_shutdown_signal"}{"time":"...","level":"INFO","msg":"stream_close","connection_id":"...","user_id":"..."} // one per stream{"time":"...","level":"INFO","msg":"stream_close","connection_id":"...","user_id":"..."}... (per in-flight stream) ...{"time":"...","level":"WARN","msg":"server_shutdown_error","error":"..."} // only on failure
If the shutdown completes within NOTIFY_SHUTDOWN_TIMEOUT,
the process exits with status 0. If the timeout fires first the
process logs server_shutdown_error with the underlying
cause and exits with the first non-nil error.
Related
- Observability
- Kubernetes deployment
- Realtime engine — RetryTracker invariants