Last updated 2026-05-28

Observability

notify exposes structured JSON logs to stderr, a /healthz endpoint for liveness, and a /metrics endpoint for Prometheus scrapers. The metrics endpoint always returns a parseable body so a scraper hitting it never sees a 404.

When you'd care

Setting up dashboards, debugging a slow tenant, wiring an alerting rule, or chasing an intermittent timeout in production.

Logging

Logs are emitted via Go's log/slog with a JSON handler writing to stderr at the level set by NOTIFY_LOG_LEVEL (default info). One event per line, no header, ingest directly into Loki / Datadog / Splunk / CloudWatch.

What gets logged

  • server_listen — one line per listener (client / internal / metrics).
  • notifyd_starting — boot banner with version, commit, store driver, live-connections toggle, listener ports.
  • stream_open / stream_close — every StreamEvents session, including connection_id, user_id, tenant_id, device_type.
  • event_queue_full — one log per dropped event when a client buffer overflows.
  • retry_failed — per-attempt error from the at-least-once retry tracker (key, connection_id, attempt, error).
  • Per-RPC logging interceptor — one line per Connect call with the procedure name and duration.
  • server_shutdown_signal / server_shutdown_error — graceful shutdown.

Sample log line

{
"time": "2026-05-27T05:14:32.108Z",
"level": "INFO",
"msg": "stream_open",
"connection_id": "8b9f...",
"user_id": "user-alice",
"tenant_id": "acme",
"device_type": "browser"
}

Setting the level

# verbose — shows the per-RPC log lines from the logging interceptor
-e NOTIFY_LOG_LEVEL=debug
# default — boot, stream lifecycle, shutdown, errors
-e NOTIFY_LOG_LEVEL=info
# warnings + errors only — recommended for high-traffic production
-e NOTIFY_LOG_LEVEL=warn
# errors only
-e NOTIFY_LOG_LEVEL=error

Health checks

curl -s http://localhost:9090/healthz
# {"status":"ok"}
  • HTTP 200 with {"status":"ok"} once Server.Run has bound all listeners.
  • HTTP 503 with {"status":"not_ready"} after Shutdown begins and before the process exits.

An alias /health is registered for tool-muscle-memory.

Kubernetes probes

livenessProbe:
httpGet:
path: /healthz
port: 9090
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /healthz
port: 9090
initialDelaySeconds: 2
periodSeconds: 5

The readiness probe will start failing the instant Shutdown begins, so an in-flight rolling update stops sending new traffic to the pod immediately while Shutdown drains in-flight requests.

Metrics

/metrics on the metrics port returns text/plain; version=0.0.4 — the standard Prometheus exposition. v0.1 ships a placeholder body so scrapers don't 404; real metric series land in a follow-up wave without changing the route.

curl -s http://localhost:9090/metrics
# # notify metrics — placeholder

Prometheus scrape config

prometheus.yml
scrape_configs:
- job_name: notify
metrics_path: /metrics
static_configs:
- targets: ['notify:9090']
relabel_configs:
- source_labels: [__address__]
target_label: instance

Counter shapes the follow-up wave will ship

Once the registry is wired, expect these series. Labels are intentionally cardinality-bounded — no per-tenant labels.

MetricTypeLabels
notify_notify_requests_totalCountercode
notify_send_totalCounterchannel, provider, status
notify_send_duration_secondsHistogramchannel, provider
notify_store_op_totalCounterop, code
notify_store_op_duration_secondsHistogramop
notify_live_connectionsGaugedevice_type
notify_stream_events_totalCounterkind

Tracing (future)

notify does not emit OpenTelemetry spans today. The Connect interceptors are the right place to add them; once the follow-up lands, set OTEL_* env vars per the standard Go OTel SDK auto-configuration and traces will surface in the backend of your choice.

Recipe: structured-logs pipeline

Loki + Grafana is the cheapest path for JSON logs. Drop this in your promtail config:

scrape_configs:
- job_name: notify
static_configs:
- targets: [localhost]
labels:
job: notify
__path__: /var/log/notify/*.log
pipeline_stages:
- json:
expressions:
level: level
msg: msg
user_id: user_id
tenant_id: tenant_id
- labels:
level:

Related