Data sources

Reference for every market-data source Forven ingests — Binance/CCXT, Binance Vision, Polygon, Yahoo, CSV — plus symbol formats, enrichment streams, and the market calendar.

This is the source-by-source reference for Forven's market-data layer. The data manager page covers the /data UI and the day-to-day backfill workflow; this page documents what each source provides, how symbols are formatted across them, the enrichment streams that ride alongside OHLCV, and how the market calendar shapes session awareness.

It is written for developers and operators who need the exact source names, symbol conventions, config keys, and API endpoints. Everything below is what the data layer actually does — no source is documented that Forven does not ingest.

The sources at a glance

Forven ingests OHLCV from five source families. A symbol's asset class is detected from its format, and the right source is used automatically when you do not pick one.

SourceIdentifierGood forRequires a key
Binance (spot + futures)binanceCrypto OHLCV, the default for live-traded symbolsNo
CCXT adapterccxtAdditional exchanges beyond BinanceNo (exchange-dependent)
Binance Visionbinance-visionBulk historical crypto archives (years of bars)No
Polygon.iopolygonMulti-asset: stocks, forex, indices, cryptoYes (POLYGON_API_KEY)
Yahoo FinanceyahooMacro series (VIX, DXY, bonds, sector ETFs)No
CSV uploadcsvYour own bars or data not covered aboveNo

The live /api/data/sources endpoint reports the same list with an availability flag, a required_key flag, and the asset_types each source can serve. Use it to confirm what is reachable in your install before you start a fetch.

# List available sources and which ones need a key
curl http://127.0.0.1:8003/api/data/sources

Binance and CCXT

Binance is the default crypto source and covers both spot and futures markets. The CCXT adapter sits behind it to reach additional exchanges where you need them. Neither requires a key for public OHLCV.

Because Forven routes live orders to HyperLiquid, it is common to hold both Binance and HyperLiquid candles for the same symbol. The quality check can compare the two — see the divergence note under Verifying quality below.

Binance Vision

Binance Vision is the bulk historical downloader. It pulls monthly and daily archives directly from data.binance.vision, which is the fastest way to seed years of history without thousands of paginated API calls.

It is built to survive interruptions: it probes for the true start date of a symbol, tracks which dates it has already covered, and resumes a partial backfill rather than restarting. Use it for the first big backfill of a symbol; use Binance for keeping the tail current.

Polygon.io

Polygon is the multi-asset source — stocks (AAPL), forex (EUR-USD), indices, and crypto. It is the only source here that requires a key.

Set the key in Settings → API Keys or via the POLYGON_API_KEY environment variable. Without it, Polygon will not appear as usable in the /data ingestion picker.

# Provide the Polygon key via environment variable
$env:POLYGON_API_KEY = "your-polygon-key"

The Polygon client rate-limits itself conservatively at 4 calls per minute by default, which sits just under the free tier's ~5/min ceiling. The number below is illustrative of the free tier, not a guarantee — if you have a paid plan with higher quota and you are seeing throttling, raise the limit.

Yahoo Finance

Yahoo supplies the macro series Forven uses for context and enrichment — VIX, DXY, bond yields, and sector ETFs. You will rarely fetch from it directly; it feeds the macro enrichment stream described below.

CSV upload

CSV import lets you bring your own bars. One caveat worth knowing up front: if you import a CSV for a symbol/timeframe you also fetch from Binance, the two can collide. Forven does not lose the earlier data — both are combined on save — but the last source to write stamps the dataset's source metadata. Decide which source is canonical for a symbol and stick to it.

Symbol formats

The same instrument is spelled differently by each source. Forven normalizes between four formats at the import and export boundaries, so you generally type the canonical form and let the layer translate.

ContextFormatExample
Filesystem / canonicalBASE-QUOTEBTC-USDT
CCXTBASE/QUOTE (or :SETTLE)BTC/USDT, BTC/USDT:USDT
Polygonprefixed tickerX:BTCUSD
Binance VisionconcatenatedBTCUSDT

Asset class is detected from the symbol's shape, so you do not declare it:

  • Crypto: BTC-USDT, BTC/USDT:USDT
  • Stocks: AAPL
  • Forex: EUR-USD
  • Indices: index tickers

When you type a symbol in the /data ingestion picker, use the canonical BASE-QUOTE form (or a plain ticker for equities). The layer maps it to whatever the chosen source expects.

Enrichment streams

OHLCV is the spine, but Forven collects nine background streams in total and can merge the derivative and macro ones onto bars for a backtest. The collectors run proactively, ranked by staleness so cold symbols do not starve.

StreamWhat it measures
OHLCVSpot and futures candles (the base series)
Funding ratesPerp funding paid/received
Open interestOutstanding contract notional
Long/short ratioAccount or position skew
Taker volumeAggressive buy/sell flow
LiquidationsForced-close volume
Fear/greed indexSentiment proxy
Macro indicatorsVIX, DXY, bonds, sector ETFs
BTC dominanceBTC share of crypto market cap

Enrichment happens on demand during a backtest. The load phase merges available streams onto the OHLCV frame using a merge-asof join — each bar is matched to the nearest prior value of each stream, so no future information leaks backward.

Two things to know about how this avoids lookahead bias:

  • Bucket-aggregate streams are shifted to bucket close. Taker buy/sell ratio and liquidations are sampled at a bucket's start but summarize the forward window, which is only known at close. Forven shifts their timestamps to the bucket close before merging, so an in-progress bucket can never be merged onto a finer bar.
  • Missing streams stay absent, not zeroed. If a stream is unavailable for a symbol, its columns are simply not present (rather than silently filled with zeros that a strategy might trade on). Where a default is sensible within an available stream, funding fills as 0 and ratios as 1.

Point-in-time reconstruction (as_of) is supported for OHLCV only, via the revision log. Enrichment streams (funding, OI, and so on) do not support as_of — a backtest using point-in-time OHLCV must source those bars accordingly rather than through on-demand enrichment.

The market calendar

Crypto trades around the clock, but equities and forex do not. Forven carries a market calendar so session-aware strategies and data checks know when a market is actually open: NYSE hours and holidays for equities, session windows for forex, and always-on for crypto. This keeps a stock backtest from treating an overnight gap as a missing bar, and lets the data layer reason about expected coverage per asset class.

How data lands on disk

Every persisted bar is a closed bar. Forming candles are dropped at the write boundary, and each write is atomic: Forven writes to a temporary file, fsyncs it, then atomically renames it into place. A crash between write and rename leaves a stray .tmp file, which a background orphan scan cleans up after it ages out. Before any bar enters the lake it passes an OHLC sanity check — high ≥ low, open and close within the bar's range, positive prices, non-negative volume — so corrupt bars never reach a backtest.

You do not manage any of this directly; it is the contract that lets you trust what the data manager shows you.

Configuration

Where the data lake lives, and how regime gating behaves, are controlled by a small set of environment variables and settings keys. Full reference for each lives in environment variables and the configuration reference.

Where data is stored

VariableMeaning
FORVEN_HOMEBase directory for packaged installs (e.g. %LOCALAPPDATA%\Forven). Data is stored under $FORVEN_HOME/data/.
FORVEN_DATA_DIRExplicit override for the data-lake root. If set, all streams — OHLCV, funding, OI, derivatives, macro — live here.
FORVEN_DBSQLite database path. The catalog is queried to discover which symbols are actively traded so the keep-alive sweep knows what to keep warm.

Keep FORVEN_DATA_DIR and FORVEN_HOME consistent. If they diverge, the OHLCV lake and the enrichment streams can end up under different roots, and a packaged install may read empty enrichment streams. Forven asserts root consistency at startup and raises an alarm on mismatch, but it is easiest to set one or the other deliberately and leave it.

Remote data engine (optional)

For a shared-server setup, the data layer can federate to a remote Forven instance instead of reading local Parquet.

Setting / variableMeaning
remote_engine_enabledRoute data queries to a remote Forven instance.
remote_engine_urlBase URL of the remote engine.
FORVEN_REMOTE_ENGINE_DATA_ROOTRemote data-engine root path; overrides the settings value.
FORVEN_REMOTE_ENGINE_ALLOWED_ROOTSecurity boundary — remote paths must sit under this root.

Regime gating

Regime detection (covered in full on market regimes) is configured here because it decides which strategies a market's current condition will admit.

SettingDefaultMeaning
regime_min_confidence0.3Minimum detection confidence [0.0–1.0] to pass the gate. Raise for stricter gating.
strict_regime_gatingIf true, block strategies incompatible with the detected regime and reject low-confidence detections. Permissive when false.
allow_unknown_regime_strategiesIf true, allow strategies with no entry in the compatibility matrix. Blocked in strict mode if false.

Steps: ingest data from a new source

You drive ingestion from the /data page. The path is the same regardless of which source you choose.

  1. Open the /data page and select the Ingestion tab.
  2. Choose a source: Binance, Polygon.io, Binance Vision, CSV, or Yahoo Finance. (For Polygon, make sure POLYGON_API_KEY is set first.)
  3. Enter the symbol in canonical form — BTC-USDT, AAPL, EUR-USD — and the timeframe(s) you want.
  4. Set a date range, or choose the all_available option to backfill the full history (best paired with Binance Vision for crypto).
  5. Click Fetch and watch progress in the Activity Log.
  6. When it completes, the dataset appears in the Coverage Matrix. If gaps are flagged, click the symbol cell to auto-backfill them and extend the tail to the present.

What you'll see: a new row in the Coverage Matrix whose color reflects freshness, plus entries in the data activity log for each collector success, gap fill, and orphan scan. The dataset is then ready to backtest against.

The same flow is available over the API for scripting:

# Start an async ingestion job (returns a run_id)
curl -X POST http://127.0.0.1:8003/api/data/ingestion/submit `
  -H "Content-Type: application/json" `
  -d '{\"symbol\":\"BTC-USDT\",\"timeframe\":\"1h\",\"exchange\":\"binance\",\"all_available\":true}'

# Poll the run
curl http://127.0.0.1:8003/api/data/ingestion/runs

Verifying data quality

Before you trust a dataset, run a quality check. It scans for gaps, reconciles close prices across sources, and computes a checksum.

# Run validation for a symbol/timeframe
curl -X POST http://127.0.0.1:8003/api/data/quality `
  -H "Content-Type: application/json" `
  -d '{\"symbol\":\"BTC-USDT\",\"timeframe\":\"1h\"}'

The result reports overlap_bars, max_divergence_pct, and the missing-bar count. When you hold the same symbol from two sources — say Binance and HyperLiquid — the divergence figure tells you how far they disagree, which is worth reviewing before you rely on either for a live-adjacent test.

API surface

The data layer is served by the /api/data/* router on the local backend (127.0.0.1:8003). The most useful endpoints:

MethodPathPurpose
GET/api/data/sourcesList sources with availability, required_key, and asset_types.
GET/api/data/datasetsList local datasets with symbols, timeframes, row counts, date ranges, checksums.
POST/api/data/ingestion/submitStart an async fetch job; returns a run_id.
GET/api/data/ingestion/runsList ingestion jobs with status, bars_fetched, bars_new.
GET/api/data/{symbol}/{timeframe}Dataset detail: source, range, checksum, gaps, quality metrics.
GET/api/data/{symbol}/{timeframe}/ohlcvRead the last N bars as JSON (limit, default 100).
POST/api/data/qualityRun gap/divergence/checksum validation.
GET/api/data/healthPer-stream stats, latest collection times, error counts.
GET/api/data/activityActivity log: backfills, collector results, gap fills, orphan scans.
GET/api/data/export/{symbol}/{timeframe}Download a dataset as CSV, Parquet, or JSON.
GET/api/data/engine/statusEngine status: enabled flag, lake root, remote config, backfill queue.
POST/api/data/engine/catchupExecute backfill for stale pairs (max_tasks).

See the API reference for the full router catalog and authentication.

Caveats

  • Free-tier rate limits. Polygon's free tier is roughly 5 calls/minute and Forven defaults to 4. Large multi-asset backfills will be slow without a paid key. The numbers here are illustrative of the providers' published limits, not a Forven guarantee.
  • Source metadata is last-write-wins. Mixing CSV imports and live fetches for the same symbol/timeframe means the most recent write stamps the source label. Pick one canonical source per symbol.
  • Point-in-time is OHLCV-only. Enrichment streams cannot be reconstructed as-of a past date.

Forven is a research tool. Clean data improves the honesty of a backtest, but no dataset makes a result predictive of future performance, and nothing here is financial advice.