Health monitoring

How Forven's health monitor tracks component states, enforces data-stream SLAs, and routes amber and red alerts so you catch trouble early.

The health monitor is Forven's background watchdog. It runs as an async task on a roughly 30-second heartbeat and aggregates the state of the moving parts — the scheduler, the brain workers, the bot, and the data collector — into a single colour-coded picture: green, amber, or red. It also watches each data stream against its own staleness SLA, so a stalled OHLCV feed or a silent funding stream surfaces before it quietly corrupts a backtest or a live decision.

This page is for operators who want to read those states correctly and know when to act. You'll find the health surface inside the /ops dashboard, alongside the system controls and the scheduler.

What the health monitor watches

The monitor has two jobs: component liveness and data freshness.

Component states. It aggregates signals from the long-running subsystems:

  • the scheduler (are jobs ticking, or are locks stale?)
  • the brain workers (the orchestrator's processing loop)
  • the bot (live trading / scanner loop)
  • the data collector (ingestion pipeline)
  • the lab

Each component resolves to one of three states:

StateMeaning
greenHealthy — running and within its expected cadence.
amberOverdue or degraded — a stream is past its SLA, or a component is lagging.
redCritical — a stream is badly stale (past roughly twice its SLA) or a component is down.

Data-stream SLAs. Each market-data stream has its own freshness window. If the newest row for a stream is older than its SLA, the monitor raises an amber alert for that stream; if it crosses roughly twice the SLA, it escalates to red.

StreamIllustrative SLA
ohlcv60 minutes
oi (open interest)3 hours
funding12 hours

The SLA values above are illustrative defaults drawn from the current build. The exact windows are per-stream and may change between releases — treat them as orientation, not a contract.

The SLAs are per-stream and independent. There is no single rolled-up "database is healthy" verdict — each stream can go amber or red on its own. That is deliberate: a stale funding feed should not be masked by a perfectly fresh ohlcv feed.

Reading the states

When you open /ops and look at the health section, read it top-down:

  1. Component row first. If a component is amber or red, that is usually the root cause — a paused or hung scheduler will starve everything downstream.
  2. Then the streams. A red ohlcv stream with green components usually means a data-source problem (rate limit, credentials, upstream outage), not an app fault. See Data sources for where each feed comes from.
  3. Cross-check the scheduler. Stale data is often a stalled collector job. The health monitor and the scheduler tell complementary halves of the same story.

Health is about whether the machine is running honestly, not whether a strategy is working. A perfectly green system can still be running strategies that should be killed. Forven is a research tool: a healthy dashboard says nothing about future results.

How alerts route

The health monitor does not just colour cells — it emits alerts. When a component or stream crosses a threshold, the monitor calls the same emit_notification() path everything else uses, so health events flow through the standard notifications routing policy.

  • Health events at warning severity or above are routed out (for example, to Discord) according to your notification preferences.
  • Routing obeys the usual dedupe-by-key and cooldown rules, so a stream that flaps amber/green does not spam you.
  • Critical health alerts are severity-aware: they bypass deduping against lower-severity rows with the same dedupe key, so a genuinely critical condition is never silently suppressed by an earlier benign one.

If a stream recovers, its state returns to green on the next heartbeat and subsequent alerts stop.

Turning a health alert into a fix

A health notification can be escalated into work. From the notification, you can hand it to a repair agent. Substitute the notification id for NOTIFICATION_ID:

# Create a repair task from a notification (operator action)
curl.exe -X POST http://127.0.0.1:8003/api/notifications/NOTIFICATION_ID/repair `
  -H "Content-Type: application/json" `
  -d '{\"agent_id\": \"full-stack-engineer\"}'

That creates an agent task (type=notification_repair) carrying the event payload; the agent investigates and writes its findings back to the task output. You can also acknowledge an alert (POST /api/notifications/NOTIFICATION_ID/acknowledge) or re-route it through the current policy (POST /api/notifications/NOTIFICATION_ID/resend).

Steps: check system health

Day to day, you read health from the dashboard. To do it deliberately:

  1. Open the desktop app and go to /ops.
  2. Find the health section. Note the colour of each component (scheduler, brain workers, bot, data collector, lab).
  3. If any component is amber or red, open the scheduler section on the same page and look for jobs with a recent last_error or a stale lock — see Troubleshooting.
  4. Check the data-stream rows. A stream past its SLA shows amber; one well past shows red.
  5. For a stale data stream, confirm the relevant collector job is enabled and running, and that its data source credentials/connectivity are intact.
  6. If you need an audit trail, check the notification center — every health alert that crossed the routing threshold is logged there with event_type, severity, source, and summary.

What you'll see: the /ops health section renders each component and stream as a green / amber / red indicator that refreshes on the monitor's heartbeat (about every 30 seconds). Amber and red items also appear in the in-app notification center, and — if routing is enabled — in Discord.

The daemon heartbeat and history

The health monitor leans on heartbeats written by the running loops. Those heartbeats are short-lived by design: the heartbeat_activity log is pruned to roughly 2 days by the maintenance job. That keeps the table small, but it means heartbeat history is not a long-term audit source — if you are reconstructing an incident from last week, the raw heartbeats will already be gone.

The same is true of the broader audit trail: notifications are retained about 60 days, and the activity_log about 90 days, before pruning. Keep your own backups if you need a longer record. See Database & maintenance for the full retention picture; maintenance windows are settings-driven via the forven:pipeline:settings key.

If you are troubleshooting something that happened more than a couple of days ago, the heartbeat rows that would explain the gap have likely been pruned. Capture a backup before they age out.

Caveats

A few honest rough edges to keep in mind:

  • No global rollup. Stream SLAs are evaluated per stream. The monitor can fire amber on funding while ohlcv is green; there is no single overall data-health number to glance at.
  • Startup catch-up can look alarming. After an app restart, the scheduler collapses any job that is more than a minute stale into a single immediate run, rather than replaying the whole missed queue. You may briefly see a burst of activity — that is the catch-up, not a fault.
  • Stale locks are not always recoverable on sight. If a job appears hung, the scheduler will not force-recover its lock while a background task or worker thread is still alive — the lock is held until that thread exits. A persistently red scheduler often means a slow external call (an LLM or exchange request) is still running. See Troubleshooting.
  • Health says nothing about correctness of trades. It tracks liveness and freshness only.

Forven is a research tool. A green health dashboard means the system is running and its data is fresh — nothing more. It is not a measure of strategy quality, results are not predictive of future performance, and nothing here is financial advice.