Troubleshooting & recovery

Diagnose and recover from common Forven faults — stuck scheduler locks, an offline daemon, a tripped kill-switch, and MCP connection errors.

When something in the lab stalls, the cause is almost always one of a handful of known faults: a scheduler lock that never released, a paused system that looks broken, a tripped safety control, or an MCP client that can't reach the backend. This page is the operator's recovery runbook. Each section names the real symptom, the real cause, and the exact route or command to clear it.

Forven is local-first. Nearly every fault is recoverable from your own machine — no support ticket, no remote reset. Start with health monitoring to see which component is red, then jump to the matching section below.

Before you start

Two quick checks resolve a surprising share of "it's broken" reports:

  1. Is the system paused? A paused system skips every scheduler job and blocks trading. That is not a fault — it is the system pause flag doing its job. Open /ops and confirm the system is running, not paused.
  2. What autonomy mode are you in? In manual mode, no autonomous jobs run at all. If research, testing, or scanning seems frozen, you may simply be in manual. See autonomy modes.

Pause and autonomy mode are orthogonal. The system can be running while autonomy is manual — and in that case the scheduler is working exactly as configured.

Scheduler stuck, stale locks & zombie threads

The scheduler runs 35+ built-in jobs (the promotion loop, verdict loop, scanner, daily learning, maintenance, and more). Each job that's mid-run holds a lock in the scheduler_jobs table via its running_since timestamp. If a job hangs or its process is killed, that lock can survive — and the job appears stuck forever.

Symptoms

  • A job's last_run_at is old and next_run_at is in the past, but it never fires.
  • The scheduler section of /ops shows a job as running long past its expected duration.
  • A last_error is set but the job won't retry.

How recovery works

Forven recovers stale locks automatically. On each tick (and at startup, via reset_scheduler_job_locks), the scheduler calls recover_stale_scheduler_job_locks():

  1. It checks whether the job still has a live background task (asyncio.Task not done) or a live zombie thread — a worker thread that outlived a timed-out sync job (the B-30 tracking path).
  2. If either is still alive, the lock is kept — this is deliberate, and prevents the same job from running twice.
  3. Otherwise, if running_since is older than the per-kind stale threshold or past the hard absolute cap of 3900 seconds, the lock is force-recovered (running_since set to NULL).
  4. The next tick re-runs the job immediately.

Zombie threads release their own lock through a done_callback once the underlying thread finally exits.

What to do

Most stale locks clear on their own within one tick. If a job stays stuck, the simplest recovery is a backend restart — startup runs reset_scheduler_job_locks(), which clears any orphaned lock before the first tick. Close and reopen the desktop app (or restart your dev launcher) and watch the scheduler section of /ops: the job's last_error and stale running_since should clear and next_run_at should advance.

If the lock refuses to clear, the cause is usually a genuinely live worker blocked on a slow external call — a cold backtest cache, a slow LLM completion, or an exchange request that hasn't returned. That is the recovery system working as designed: it will not yank a lock out from under a thread that's still executing. Find and resolve the hung call rather than forcing the lock.

Note on startup catch-up

After a restart, the scheduler collapses any job that's more than a minute stale into a single immediate run, rather than firing every missed interval. Expect one burst of activity on launch, not a flood — this is intended.

Research daemon offline

The research daemon is the autonomous loop that invents hypotheses, runs them through the gauntlet, and retires losers. If it's quiet, work through these in order.

Checklist

  1. System paused? A paused system halts the daemon. Resume from /ops.
  2. Autonomy mode? In manual mode the daemon does no generation. In semi_auto, generation is paused but testing and scanning still run. Only auto runs the full loop. See autonomy modes.
  3. Generation paused? Generation can be frozen independently to drain a backlog — a deliberate throttle, not a fault.
  4. Heartbeat fresh? The health monitor tracks the daemon heartbeat on roughly a 30-second interval. A stale heartbeat surfaces as an amber or red component state.

Generation pause is sticky in manual mode

If the mode is manual, resuming generation alone is a no-op — set_generation_paused(False) is rejected and the mode stays manual. You must explicitly set the system mode to semi_auto or auto to leave manual. Toggling "Resume Generation" by itself will appear to do nothing.

Approval queue blocked

If pipeline transitions or capital-tier promotions seem to stall waiting for sign-off, the approvals queue is the gate. A blocked queue is usually one of:

  • The system is paused, so the work behind the approval never starts.
  • Autonomy is manual, so nothing is being proposed for approval.
  • An approval is genuinely pending and waiting on you.

Open the approvals queue, clear or revise the pending items, then confirm the system is running and in the intended autonomy mode. If a specific task is wedged behind an approval, inspect it in the task queue — the task-detail audit log shows the exact tool call and error context.

Kill-switch tripped & trading halted

The kill-switch is a drawdown circuit breaker. When portfolio drawdown exceeds max_drawdown_pct (illustrative default 10%), it emergency-closes all open positions at market and halts trading until you reset it manually. This is a safety feature, not a bug — but you need to know how to read and clear it. For the full control, see risk & safety.

Symptoms

  • /api/risk reports kill_switch_active = true with a kill_switch_triggered_at timestamp.
  • New trades are refused; is_trading_allowed() returns false.
  • The risk page shows the kill-switch banner.

Recovery

  1. Open the /risk page and read the trigger time and current portfolio state.
  2. Confirm positions actually closed. If a close failed after its retries, you'll see close_reason = kill_switch_close_fail — that position is still open and needs manual review.
  3. Reset only when you understand why drawdown breached. Use the reset-kill-switch button on the /risk page (operator-only), backed by POST /api/ops/reset-kill-switch. This clears kill_switch_active and re-baselines the high-water mark.

Resetting re-baselines your high-water mark

A reset re-baselines the high-water mark (HWM) to your current equity. The next kill-switch threshold is then measured from the reset point, not the original peak. Reset deliberately, after you understand the drawdown — not reflexively to silence the halt. Nothing here is financial advice.

A related auto-halt is the daily-loss halt: if same-day realized PnL falls past max_daily_loss_pct (illustrative default 5%), new trades are blocked until the next UTC day. It clears on its own at UTC midnight — there is nothing to reset.

Strategy container / bot reset

If a live or paper strategy container is in a bad state — for example a position that exists on the exchange but has no matching trade record, or a record stuck pending reconciliation — the issue is usually reconciliation, not corruption.

  • Phantom positions (filled on the exchange, missing from SQLite) are detected at startup and roughly every 30 minutes. Recovery creates a fresh trade record. Note that this recovered trade is not linked to a strategy and may need a manual look. See circuit breakers and the live-safety material for the reconciliation model.
  • Pending-open trades (no exchange order ID captured) free their risk slot after 180 seconds; the exchange-verify path closes genuinely unfilled trades within that window.
  • To take a single live position off the board manually, use force-close on the live trading page, backed by POST /api/trading/close.

If a container is misbehaving at the lifecycle level rather than the position level, demote it and re-validate through the gauntlet rather than editing state by hand.

Backup & restore

Everything Forven knows lives under FORVEN_HOME (default ~/.forven): config.json, the SQLite databases, the workspace .md files, and the ChromaDB memory. Backing up the lab is a file copy.

# Stop the app first so the SQLite WAL is checkpointed and quiescent
Copy-Item -Recurse "$env:USERPROFILE\.forven" "$env:USERPROFILE\.forven-backup-$(Get-Date -Format yyyyMMdd)"

Two cautions from the way storage is laid out:

  • The databases run in WAL mode, so each is three files (forven.db, forven.db-wal, forven.db-shm). Copy them together, with the app stopped, so the write-ahead log is consistent.
  • The encryption key is stored outside FORVEN_HOME in a non-synced location (%LOCALAPPDATA%\Forven\.forven_key on Windows). It does not follow OneDrive or Dropbox, and it isn't inside the folder you just copied. Back the key up separately, or your encrypted credentials in auth.json can't be decrypted on restore. See environment variables for FORVEN_ENCRYPTION_KEY.

Routine pruning, WAL checkpointing, and optional VACUUM are handled by the maintenance job — see database & maintenance. If you need to clear specific categories of data without touching credentials, use the operator-gated factory-reset categories documented there.

MCP connection issues

The Forven MCP server is a separate stdio process that an MCP client (such as Claude Desktop) spawns, and that in turn makes HTTP calls to your running Forven backend. Most failures are configuration, not bugs. Match the symptom below.

"Tool call timed out"

Long backtests on a cold cache can exceed the default HTTP timeout. Raise it:

{
  "mcpServers": {
    "forven": {
      "command": "C:\\Users\\you\\.venv\\Scripts\\python.exe",
      "args": ["-m", "forven.mcp_server"],
      "env": {
        "PYTHONPATH": "C:\\Users\\you\\projects\\forven",
        "FORVEN_API_URL": "http://127.0.0.1:8003",
        "FORVEN_MCP_TIMEOUT": "180"
      }
    }
  }
}

FORVEN_MCP_TIMEOUT is in seconds (default 60). Increase it for slow caches or long gauntlet runs.

"Connection refused"

The MCP server is just an HTTP client — if the backend is down or on a different port, every tool call fails. Confirm the backend is up and FORVEN_API_URL matches its address (default http://127.0.0.1:8003):

# Is the backend answering on the expected port?
Invoke-WebRequest http://127.0.0.1:8003/api/health -UseBasicParsing

# Start the backend if it isn't running
python -m uvicorn forven.api:app --port 8003

"401 Invalid or missing operator key"

If the backend was started with FORVEN_AUTH_REQUIRED=true, the MCP config must carry the matching keys. Add FORVEN_API_KEY and FORVEN_OPERATOR_KEY to the server's env block in your MCP config. By default the backend runs without auth on localhost, so this only bites when you've turned auth on.

"forven not importable" / module-not-found

When the client spawns the server from its own working directory, Python can't find the forven package unless PYTHONPATH points at the repo root. Set PYTHONPATH to your absolute repo path, and make command an absolute path to the venv interpreter — relative paths fail because the client doesn't spawn from your project directory.

Tools don't appear

The client only reads MCP config at startup. After editing .mcp.json or claude_desktop_config.json, restart the client — tools do not hot-reload.

Where to look for MCP logs

On Windows, the client writes a per-server log to %APPDATA%\Claude\logs\mcp-server-forven.log. Check it for the exact failure — timeout, refused connection, or 401 — before changing config.

For the full client config, transport details, and tool catalog, see the MCP server reference.

When all else fails

A clean restart resolves transient lock and worker faults more often than any single command:

  1. Stop the system from /ops (or close the desktop app) so trading and jobs quiesce.
  2. Confirm no live positions are mid-close.
  3. Restart. Startup re-runs lock recovery, phantom reconciliation, and the health checks.
  4. Re-open health monitoring and confirm every component is green before resuming autonomy.

If a fault survives a restart, it's a real state issue — capture the component state, the offending job's last_error, and the relevant log, rather than forcing flags by hand.

A research tool, not a trading bot

Forven is a research environment for building and stress-testing strategies. Backtest and paper results are illustrative, are not predictive of live performance, and nothing in this documentation is financial advice. Beta builds hard-lock execution to paper, so most recovery here concerns the research pipeline rather than real capital.

  • Health monitoring — read component states and data-stream SLAs first
  • Scheduler & jobs — list, enable, and tune the built-in jobs
  • MCP server — connect a client and verify the tools
  • FAQ — short answers to the common questions