Reference

Operations & reliability

How the worker stays correct and recovers on its own. The guiding principle throughout: one writer, fail closed, self-heal, and alert loudly when a human is actually needed.

The single-writer worker

A single background worker is the sole writer to the corpus. It runs as a long-lived process (restarted automatically if it exits) and is driven by a blocking scheduler with three jobs:

Job Schedule Default
Daily chat sync cron, once a day hour 22:00 local (DAILY_SYNC_HOUR)
Meeting poll fixed interval every 10 min (ZOOM_POLL_MINUTES)
Daily report cron, once a day 23:00

On startup the worker validates its config (fail-fast on misconfiguration), connects the DB, migrates the schema, seeds the reasoning patterns and voice profile, backfills any missing pattern embeddings, starts the interactivity listener, starts the meeting-ingest thread, runs a full initial backfill, and starts the health monitor.

The meeting ingester is a single thread with its own DB connection, so single-writer is preserved even though ingestion and polling run concurrently.

The approval flow (meetings → review → ingest)

Meeting recordings are never ingested silently. Each finished recording is routed one of three ways:

  • Allowlisted recurring series auto-publish.
  • Denylisted series auto-route to a private archive.
  • Everything else is HELD — an approval card with Publish / Keep private buttons is posted to a review channel, and nothing reaches the corpus until a human clicks. This is the core fail-safe: a held or private meeting can never reach the corpus or a public channel without an explicit human decision.

The card is posted first and recorded second, so a failed post simply retries on the next poll.

The undo window

Publish and Keep-private are not immediate. When clicked, the card swaps to a short countdown card with a single Undo button, and the irreversible action commits only after the window elapses — via a per-meeting timer.

  • Undo window = ZOOM_UNDO_SECONDS, default 5 seconds (0 = commit immediately).
  • The window is per-meeting on its own timer, so rapid consecutive clicks on different meetings are never throttled.
  • Undo restores the exact original card.
  • If the commit itself fails, the actionable card is restored ("try again") so it can be retried.

Publishing resolves the button UI before the slow extraction runs — the actual ingestion is handed to a background queue so the buttons stay responsive.

Self-heal: reconciling orphaned meetings

A published meeting's status is recorded before its best-effort background ingestion. If that ingestion fails (an LLM error) or is lost to a restart, the meeting becomes stranded — marked published, but with zero entries, and skipped by normal polling because it's already on record.

A reconciler runs at the end of every poll to catch exactly this. It:

  1. finds meetings that are published but never ingested,
  2. skips any currently in-flight,
  3. re-fetches the recording + transcript, and
  4. re-enqueues extraction.

A meeting whose transcript still isn't ready is simply left for a later poll. The publish decision stands — nothing is re-posted or re-approved. The ingested_at timestamp is stamped only after extraction completes, so a meeting that errored stays unmarked and gets retried until it succeeds. In-flight tracking prevents a slow meeting from being ingested twice.

Self-heal: Slack socket flapping

The interactivity socket auto-reconnects after a normal network blip. But a token conflict — a second instance running on the same app token — makes the socket flap forever and silently, and the approval buttons stop working. A supervisor thread watches for this:

Threshold Value Meaning
flap window 60 s look-back window for counting drops
flap threshold 6 drops drops within the window that count as "flapping"
check interval 15 s how often the supervisor evaluates
action cooldown 180 s minimum gap between rebuild+alert actions (never thrash)

Every socket drop is recorded. When the drop count crosses the threshold within the window (and the cooldown has elapsed), the supervisor logs an error, fires an alert naming the likely duplicate instance, and force-rebuilds the socket — the same thing a manual restart would do. So a real blip self-heals quietly; a token conflict becomes a loud, actionable signal.

Resilient cursors

The worker tracks where each source was last processed in a small cursor table, so a crash never loses or re-does large amounts of work.

  • Chat cursor advances per window — as each window's entries are written, the cursor moves to that window's latest timestamp in the same transaction. A crash mid-channel re-processes at most the one in-flight window, not the whole batch. (Exactly-once isn't guaranteed across the tiny commit gap; a rare duplicate is acceptable — it just gets deduplicated or superseded.) On first run, with no cursor, it pulls from a configured historical start date.

  • Meeting cursor stores a date, advanced inside a transaction after processing. Each poll re-queries the last 2 days back from the cursor so late-arriving recordings aren't missed; UUID-level dedup makes that overlap free (nothing reprocesses). First run looks back 30 days.

  • A sync heartbeat is itself a cursor row, used by the health monitor's freshness checks.

Health and degraded state

GET /health reports status: "ok" normally and status: "degraded" when the embedding service is unreachable. It also returns corpus size, patterns discovered, and the embedding-service boolean. The API degrades rather than crashes — /retrieve fails per-request while /feedback and /health stay up.

A worker-side health monitor runs every 5 minutes and alerts on: embedding service unreachable, DB check failed, or no successful sync in over 26 hours. It is off-hours aware — because ingestion is a daily batch, it alerts only on an overdue daily run (via the heartbeat), never on "no new entries right now." Alerts are edge-triggered: each problem alerts once when it starts failing, and recovery is logged.

Alerting

A central notifier posts alerts to a configured target (MONITORING_ALERT_TARGET — a DM or channel; empty falls back to logging only). Each alert is a headline plus a threaded traceback (truncated for length). The alert path is best-effort and never raises, so alerting can't break ingestion.

Conditions that fire an alert:

  • daily chat sync failure and meeting poll failure,
  • embedding service unreachable at startup, and startup backfill failure,
  • health-monitor problems (embeddings down, DB check failed, sync overdue),
  • extraction-judge fail-closed drops — "judge failed; dropped N unverified entries (fail-closed)",
  • Slack socket flapping.

A separate daily report (its own target) summarizes question counts (strong-match / extrapolation / redirected), feedback, corpus stats, and top topics.

Fail-closed behaviors and retries

Fail-closed (when in doubt, keep it OUT):

  • Extraction judge — an entry survives only if explicitly cleared on all three flags (keep / grounded / distinctive). On persistent LLM or parse failure, the whole batch is dropped and an alert fires — a flaky judge never lets garbage through.
  • Approval fail-safe — a held or private meeting never reaches the publish channel or corpus without a human click.
  • Dedup merge-on-error — at ≥0.93 similarity, a candidate is merged even if the stance judge errors (fail toward merge, not insert).
  • Config fail-fast — settings are validated on load and raise on misconfiguration (missing keys, embedding dimension/model mismatch, rerank weights not summing to 1.0).

Retries and isolation (operational level):

  • LLM calls — up to 3 attempts with backoff; does not retry client errors except rate-limit/timeout. A process-wide rate limiter bounds concurrency and calls-per-minute.
  • Chat API — up to 5 retries on rate-limit, honoring the Retry-After header.
  • Transcript download — 2 retries with a longer (90 s) timeout than ordinary calls.
  • Per-item isolation — one channel failing never aborts the chat run; one meeting failing never aborts the poll; background ingest failures are caught and logged.
  • DB concurrency — write-ahead logging so the read-only API and the writer coexist, with a busy timeout, and atomic entry+cursor writes in a single transaction.
  • Restart policy — the worker and API restart automatically unless stopped; the worker waits for the embedding service to be healthy before starting.