docs: add Databento integration plan and roadmap items

2026-03-29 09:52:06 +02:00
parent 8079ca58e7
commit c02159481d
8 changed files with 1050 additions and 1 deletions
--- a/docs/DATABENTO_INTEGRATION_PLAN.md
+++ b/docs/DATABENTO_INTEGRATION_PLAN.md
@@ -0,0 +1,780 @@
 # Databento Historical Data Integration Plan
 ## Overview
 Integrate Databento historical API for backtesting and scenario comparison pages, replacing yfinance for historical data on these pages. The integration will support configurable start prices/values independent of portfolio settings, with intelligent caching to avoid redundant downloads.
 ## Architecture
 ### Current State
 - **Backtest page** (`app/pages/backtests.py`): Uses `YFinanceHistoricalPriceSource` via `BacktestPageService`
 - **Event comparison** (`app/pages/event_comparison.py`): Uses seeded event presets with yfinance data
 - **Historical provider** (`app/services/backtesting/historical_provider.py`): Protocol-based architecture with `YFinanceHistoricalPriceSource` and `SyntheticHistoricalProvider`
 ### Target State
 - Add `DatabentoHistoricalPriceSource` implementing `HistoricalPriceSource` protocol
 - Add `DatabentoHistoricalOptionSource` implementing `OptionSnapshotSource` protocol (future)
 - Smart caching layer: only re-download when parameters change
 - Pre-seeded scenario data via batch downloads
 ## Databento Data Sources
 ### Underlyings and Datasets
 | Instrument | Dataset | Symbol Format | Notes |
 |------------|---------|----------------|-------|
 | GLD ETF | `XNAS.BASIC` or `EQUS.PLUS` | `GLD` | US equities consolidated |
 | GC=F Futures | `GLBX.MDP3` | `GC` + continuous or `GC=F` raw | Gold futures |
 | Gold Options | `OPRA.PILLAR` | `GLD` underlying | Options on GLD ETF |
 ### Schemas
 | Schema | Use Case | Fields |
 |--------|----------|--------|
 | `ohlcv-1d` | Daily backtesting | open, high, low, close, volume |
 | `ohlcv-1h` | Intraday scenarios | Hourly bars |
 | `trades` | Tick-level analysis | Full trade data |
 | `definition` | Instrument metadata | Expiries, strike prices, tick sizes |
 ## Implementation Plan
 ### Phase 1: Historical Price Source (DATA-DB-001)
 **File:** `app/services/backtesting/databento_source.py`
 ```python
 from __future__ import annotations
 from dataclasses import dataclass
 from datetime import date, timedelta
 from pathlib import Path
 from typing import Any
 import hashlib
 import json
 from app.services.backtesting.historical_provider import DailyClosePoint, HistoricalPriceSource
 try:
    import databento as db
    DATABENTO_AVAILABLE = True
 except ImportError:
    DATABENTO_AVAILABLE = False
@dataclass(frozen=True)
 class DatabentoCacheKey:
    """Cache key for Databento data requests."""
    dataset: str
    symbol: str
    schema: str
    start_date: date
    end_date: date
    def cache_path(self, cache_dir: Path) -> Path:
        key_str = f"{self.dataset}_{self.symbol}_{self.schema}_{self.start_date}_{self.end_date}"
        key_hash = hashlib.sha256(key_str.encode()).hexdigest()[:16]
        return cache_dir / f"dbn_{key_hash}.parquet"
    def metadata_path(self, cache_dir: Path) -> Path:
        key_str = f"{self.dataset}_{self.symbol}_{self.schema}_{self.start_date}_{self.end_date}"
        key_hash = hashlib.sha256(key_str.encode()).hexdigest()[:16]
        return cache_dir / f"dbn_{key_hash}_meta.json"
@dataclass
 class DatabentoSourceConfig:
    """Configuration for Databento data source."""
    api_key: str | None = None  # Falls back to DATABENTO_API_KEY env var
    cache_dir: Path = Path(".cache/databento")
    dataset: str = "XNAS.BASIC"
    schema: str = "ohlcv-1d"
    stype_in: str = "raw_symbol"
    # Re-download threshold
    max_cache_age_days: int = 30
 class DatabentoHistoricalPriceSource(HistoricalPriceSource):
    """Databento-based historical price source for backtesting."""
    def __init__(self, config: DatabentoSourceConfig | None = None) -> None:
        if not DATABENTO_AVAILABLE:
            raise RuntimeError("databento package required: pip install databento")
        self.config = config or DatabentoSourceConfig()
        self.config.cache_dir.mkdir(parents=True, exist_ok=True)
        self._client: db.Historical | None = None
    @property
    def client(self) -> db.Historical:
        if self._client is None:
            self._client = db.Historical(key=self.config.api_key)
        return self._client
    def _load_from_cache(self, key: DatabentoCacheKey) -> list[DailyClosePoint] | None:
        """Load cached data if available and fresh."""
        cache_file = key.cache_path(self.config.cache_dir)
        meta_file = key.metadata_path(self.config.cache_dir)
        if not cache_file.exists() or not meta_file.exists():
            return None
        try:
            with open(meta_file) as f:
                meta = json.load(f)
            # Check cache age
            download_date = date.fromisoformat(meta["download_date"])
            age_days = (date.today() - download_date).days
            if age_days > self.config.max_cache_age_days:
                return None
            # Check parameters match
            if meta["dataset"] != key.dataset or meta["symbol"] != key.symbol:
                return None
            # Load parquet and convert
            import pandas as pd
            df = pd.read_parquet(cache_file)
            return self._df_to_daily_points(df)
        except Exception:
            return None
    def _save_to_cache(self, key: DatabentoCacheKey, df: pd.DataFrame) -> None:
        """Save data to cache."""
        cache_file = key.cache_path(self.config.cache_dir)
        meta_file = key.metadata_path(self.config.cache_dir)
        df.to_parquet(cache_file, index=False)
        meta = {
            "download_date": date.today().isoformat(),
            "dataset": key.dataset,
            "symbol": key.symbol,
            "schema": key.schema,
            "start_date": key.start_date.isoformat(),
            "end_date": key.end_date.isoformat(),
            "rows": len(df),
        }
        with open(meta_file, "w") as f:
            json.dump(meta, f, indent=2)
    def _fetch_from_databento(self, key: DatabentoCacheKey) -> pd.DataFrame:
        """Fetch data from Databento API."""
        data = self.client.timeseries.get_range(
            dataset=key.dataset,
            symbols=key.symbol,
            schema=key.schema,
            start=key.start_date.isoformat(),
            end=(key.end_date + timedelta(days=1)).isoformat(),  # Exclusive end
            stype_in=self.config.stype_in,
        )
        df = data.to_df()
        return df
    def _df_to_daily_points(self, df: pd.DataFrame) -> list[DailyClosePoint]:
        """Convert DataFrame to DailyClosePoint list."""
        points = []
        for idx, row in df.iterrows():
            # Databento ohlcv schema has ts_event as timestamp
            ts = row.get("ts_event", row.get("ts_recv", idx))
            if hasattr(ts, "date"):
                row_date = ts.date()
            else:
                row_date = date.fromisoformat(str(ts)[:10])
            close = float(row["close"]) / 1e9  # Databento prices are int64 x 1e-9
            points.append(DailyClosePoint(date=row_date, close=close))
        return sorted(points, key=lambda p: p.date)
    def load_daily_closes(self, symbol: str, start_date: date, end_date: date) -> list[DailyClosePoint]:
        """Load daily closing prices from Databento (with caching)."""
        # Map symbols to datasets
        dataset = self._resolve_dataset(symbol)
        databento_symbol = self._resolve_symbol(symbol)
        key = DatabentoCacheKey(
            dataset=dataset,
            symbol=databento_symbol,
            schema=self.config.schema,
            start_date=start_date,
            end_date=end_date,
        )
        # Try cache first
        cached = self._load_from_cache(key)
        if cached is not None:
            return cached
        # Fetch from Databento
        import pandas as pd
        df = self._fetch_from_databento(key)
        # Cache results
        self._save_to_cache(key, df)
        return self._df_to_daily_points(df)
    def _resolve_dataset(self, symbol: str) -> str:
        """Resolve symbol to Databento dataset."""
        symbol_upper = symbol.upper()
        if symbol_upper in ("GLD", "GLDM", "IAU"):
            return "XNAS.BASIC"  # ETFs on Nasdaq
        elif symbol_upper in ("GC=F", "GC", "GOLD"):
            return "GLBX.MDP3"  # CME gold futures
        elif symbol_upper == "XAU":
            return "XNAS.BASIC"  # Treat as GLD proxy
        else:
            return self.config.dataset  # Use configured default
    def _resolve_symbol(self, symbol: str) -> str:
        """Resolve vault-dash symbol to Databento symbol."""
        symbol_upper = symbol.upper()
        if symbol_upper == "XAU":
            return "GLD"  # Proxy XAU via GLD prices
        elif symbol_upper == "GC=F":
            return "GC"  # Use parent symbol for continuous contracts
        return symbol_upper
    def get_cost_estimate(self, symbol: str, start_date: date, end_date: date) -> float:
        """Estimate cost in USD for a data request."""
        dataset = self._resolve_dataset(symbol)
        databento_symbol = self._resolve_symbol(symbol)
        try:
            cost = self.client.metadata.get_cost(
                dataset=dataset,
                symbols=databento_symbol,
                schema=self.config.schema,
                start=start_date.isoformat(),
                end=(end_date + timedelta(days=1)).isoformat(),
            )
            return cost
        except Exception:
            return 0.0  # Return 0 if cost estimation fails
 class DatabentoBacktestProvider:
    """Databento-backed historical provider for synthetic backtesting."""
    provider_id = "databento_v1"
    pricing_mode = "synthetic_bs_mid"
    def __init__(
        self,
        price_source: DatabentoHistoricalPriceSource,
        implied_volatility: float = 0.16,
        risk_free_rate: float = 0.045,
    ) -> None:
        self.price_source = price_source
        self.implied_volatility = implied_volatility
        self.risk_free_rate = risk_free_rate
    def load_history(self, symbol: str, start_date: date, end_date: date) -> list[DailyClosePoint]:
        return self.price_source.load_daily_closes(symbol, start_date, end_date)
    # ... rest delegates to SyntheticHistoricalProvider logic
 ```
 ### Phase 2: Backtest Settings Model (DATA-DB-002)
 **File:** `app/models/backtest_settings.py`
 ```python
 from dataclasses import dataclass, field
 from datetime import date
 from uuid import UUID
 from app.models.backtest import ProviderRef
@dataclass(frozen=True)
 class BacktestSettings:
    """User-configurable backtest settings (independent of portfolio)."""
    # Scenario identification
    settings_id: UUID
    name: str
    # Data source configuration
    data_source: str = "databento"  # "databento", "yfinance", "synthetic"
    dataset: str = "XNAS.BASIC"
    schema: str = "ohlcv-1d"
    # Date range
    start_date: date = date(2024, 1, 1)
    end_date: date = date(2024, 12, 31)
    # Independent scenario configuration (not derived from portfolio)
    underlying_symbol: str = "GLD"
    start_price: float = 0.0  # 0 = auto-derive from first close
    underlying_units: float = 1000.0  # Independent of portfolio
    loan_amount: float = 0.0  # Debt position for LTV analysis
    margin_call_ltv: float = 0.75
    # Templates to test
    template_slugs: tuple[str, ...] = field(default_factory=lambda: ("protective-put-atm-12m",))
    # Provider reference
    provider_ref: ProviderRef = field(default_factory=lambda: ProviderRef(
        provider_id="databento_v1",
        pricing_mode="synthetic_bs_mid",
    ))
    # Cache metadata
    cache_key: str = ""  # Populated when data is fetched
    data_cost_usd: float = 0.0  # Cost of last data fetch
 ```
 ### Phase 3: Cache Management (DATA-DB-003)
 **File:** `app/services/backtesting/databento_cache.py`
 ```python
 from __future__ import annotations
 from dataclasses import dataclass
 from datetime import date, timedelta
 from pathlib import Path
 import hashlib
 import json
 from app.services.backtesting.databento_source import DatabentoCacheKey
@dataclass
 class CacheEntry:
    """Metadata for a cached Databento dataset."""
    cache_key: DatabentoCacheKey
    file_path: Path
    download_date: date
    size_bytes: int
    cost_usd: float
 class DatabentoCacheManager:
    """Manages Databento data cache lifecycle."""
    def __init__(self, cache_dir: Path = Path(".cache/databento")) -> None:
        self.cache_dir = cache_dir
        self.cache_dir.mkdir(parents=True, exist_ok=True)
    def list_entries(self) -> list[CacheEntry]:
        """List all cached entries."""
        entries = []
        for meta_file in self.cache_dir.glob("*_meta.json"):
            with open(meta_file) as f:
                meta = json.load(f)
            cache_file = meta_file.with_name(meta_file.stem.replace("_meta", "") + ".parquet")
            if cache_file.exists():
                entries.append(CacheEntry(
                    cache_key=DatabentoCacheKey(
                        dataset=meta["dataset"],
                        symbol=meta["symbol"],
                        schema=meta["schema"],
                        start_date=date.fromisoformat(meta["start_date"]),
                        end_date=date.fromisoformat(meta["end_date"]),
                    ),
                    file_path=cache_file,
                    download_date=date.fromisoformat(meta["download_date"]),
                    size_bytes=cache_file.stat().st_size,
                    cost_usd=0.0,  # Would need to track separately
                ))
        return entries
    def invalidate_expired(self, max_age_days: int = 30) -> list[Path]:
        """Remove cache entries older than max_age_days."""
        removed = []
        cutoff = date.today() - timedelta(days=max_age_days)
        for entry in self.list_entries():
            if entry.download_date < cutoff:
                entry.file_path.unlink(missing_ok=True)
                meta_file = entry.file_path.with_name(entry.file_path.stem + "_meta.json")
                meta_file.unlink(missing_ok=True)
                removed.append(entry.file_path)
        return removed
    def clear_all(self) -> int:
        """Clear all cached data."""
        count = 0
        for file in self.cache_dir.glob("*"):
            if file.is_file():
                file.unlink()
                count += 1
        return count
    def get_cache_size(self) -> int:
        """Get total cache size in bytes."""
        return sum(f.stat().st_size for f in self.cache_dir.glob("*") if f.is_file())
    def should_redownload(self, key: DatabentoCacheKey, params_changed: bool) -> bool:
        """Determine if data should be re-downloaded."""
        cache_file = key.cache_path(self.cache_dir)
        meta_file = key.metadata_path(self.cache_dir)
        if params_changed:
            return True
        if not cache_file.exists() or not meta_file.exists():
            return True
        try:
            with open(meta_file) as f:
                meta = json.load(f)
            download_date = date.fromisoformat(meta["download_date"])
            age_days = (date.today() - download_date).days
            return age_days > 30
        except Exception:
            return True
 ```
 ### Phase 4: Backtest Page UI Updates (DATA-DB-004)
 **Key changes to `app/pages/backtests.py`:**
 1. Add Databento configuration section
 2. Add independent start price/units inputs
 3. Show estimated data cost before fetching
 4. Cache status indicator
 ```python
 # In backtests.py
 with ui.card().classes("w-full ..."):
    ui.label("Data Source").classes("text-lg font-semibold")
    data_source = ui.select(
        {"databento": "Databento (historical market data)", "yfinance": "Yahoo Finance (free, limited)"},
        value="databento",
        label="Data source",
    ).classes("w-full")
    # Databento-specific settings
    with ui.column().classes("w-full gap-2").bind_visibility_from(data_source, "value", lambda v: v == "databento"):
        ui.label("Dataset configuration").classes("text-sm text-slate-500")
        dataset_select = ui.select(
            {"XNAS.BASIC": "Nasdaq Basic (GLD)", "GLBX.MDP3": "CME Globex (GC=F)"},
            value="XNAS.BASIC",
            label="Dataset",
        ).classes("w-full")
        schema_select = ui.select(
            {"ohlcv-1d": "Daily bars", "ohlcv-1h": "Hourly bars"},
            value="ohlcv-1d",
            label="Resolution",
        ).classes("w-full")
        # Cost estimate
        cost_label = ui.label("Estimated cost: $0.00").classes("text-sm text-slate-500")
        # Cache status
        cache_status = ui.label("").classes("text-xs text-slate-400")
    # Independent scenario settings
    with ui.card().classes("w-full ..."):
        ui.label("Scenario Configuration").classes("text-lg font-semibold")
        ui.label("Configure start values independent of portfolio settings").classes("text-sm text-slate-500")
        start_price_input = ui.number(
            "Start price",
            value=0.0,
            min=0.0,
            step=0.01,
        ).classes("w-full")
        ui.label("Set to 0 to auto-derive from first historical close").classes("text-xs text-slate-400 -mt-2")
        underlying_units_input = ui.number(
            "Underlying units",
            value=1000.0,
            min=0.0001,
            step=0.0001,
        ).classes("w-full")
        loan_amount_input = ui.number(
            "Loan amount ($)",
            value=0.0,
            min=0.0,
            step=1000,
        ).classes("w-full")
 ```
 ### Phase 5: Scenario Pre-Seeding (DATA-DB-005)
 **File:** `app/services/backtesting/scenario_bulk_download.py`
 ```python
 from __future__ import annotations
 from dataclasses import dataclass
 from datetime import date
 from pathlib import Path
 import json
 try:
    import databento as db
    DATABENTO_AVAILABLE = True
 except ImportError:
    DATABENTO_AVAILABLE = False
@dataclass
 class ScenarioPreset:
    """Pre-configured scenario ready for backtesting."""
    preset_id: str
    display_name: str
    symbol: str
    dataset: str
    window_start: date
    window_end: date
    default_start_price: float  # First close in window
    default_templates: tuple[str, ...]
    event_type: str
    tags: tuple[str, ...]
    description: str
 def download_historical_presets(
    client: db.Historical,
    presets: list[ScenarioPreset],
    output_dir: Path,
 ) -> dict[str, Path]:
    """Bulk download historical data for all presets.
    Returns mapping of preset_id to cached file path.
    """
    results = {}
    for preset in presets:
        cache_key = DatabentoCacheKey(
            dataset=preset.dataset,
            symbol=preset.symbol,
            schema="ohlcv-1d",
            start_date=preset.window_start,
            end_date=preset.window_end,
        )
        cache_file = cache_key.cache_path(output_dir)
        # Download if not cached
        if not cache_file.exists():
            data = client.timeseries.get_range(
                dataset=preset.dataset,
                symbols=preset.symbol,
                schema="ohlcv-1d",
                start=preset.window_start.isoformat(),
                end=preset.window_end.isoformat(),
            )
            data.to_parquet(cache_file)
        results[preset.preset_id] = cache_file
    return results
 def create_default_presets() -> list[ScenarioPreset]:
    """Create default scenario presets for gold hedging research."""
    return [
        ScenarioPreset(
            preset_id="gld-2020-covid-crash",
            display_name="GLD March 2020 COVID Crash",
            symbol="GLD",
            dataset="XNAS.BASIC",
            window_start=date(2020, 2, 15),
            window_end=date(2020, 4, 15),
            default_start_price=143.0,  # Approx GLD close on 2020-02-15
            default_templates=("protective-put-atm-12m", "protective-put-95pct-12m"),
            event_type="crash",
            tags=("covid", "crash", "high-vol"),
            description="March 2020 COVID market crash - extreme volatility event",
        ),
        ScenarioPreset(
            preset_id="gld-2022-rate-hike-cycle",
            display_name="GLD 2022 Rate Hike Cycle",
            symbol="GLD",
            dataset="XNAS.BASIC",
            window_start=date(2022, 1, 1),
            window_end=date(2022, 12, 31),
            default_start_price=168.0,
            default_templates=("protective-put-atm-12m", "ladder-50-50-atm-95pct-12m"),
            event_type="rate_cycle",
            tags=("rates", "fed", "extended"),
            description="Full year 2022 - aggressive Fed rate hikes",
        ),
        ScenarioPreset(
            preset_id="gcf-2024-rally",
            display_name="GC=F 2024 Gold Rally",
            symbol="GC",
            dataset="GLBX.MDP3",
            window_start=date(2024, 1, 1),
            window_end=date(2024, 12, 31),
            default_start_price=2060.0,
            default_templates=("protective-put-atm-12m",),
            event_type="rally",
            tags=("gold", "futures", "rally"),
            description="Gold futures rally in 2024",
        ),
    ]
 ```
 ### Phase 6: Settings Persistence (DATA-DB-006)
 **File:** `app/models/backtest_settings_repository.py`
 ```python
 from dataclasses import asdict
 from datetime import date
 from pathlib import Path
 from uuid import UUID, uuid4
 import json
 from app.models.backtest_settings import BacktestSettings
 class BacktestSettingsRepository:
    """Persistence for backtest settings."""
    def __init__(self, base_path: Path | None = None) -> None:
        self.base_path = base_path or Path(".workspaces")
    def _settings_path(self, workspace_id: str) -> Path:
        return self.base_path / workspace_id / "backtest_settings.json"
    def load(self, workspace_id: str) -> BacktestSettings:
        """Load backtest settings, creating defaults if not found."""
        path = self._settings_path(workspace_id)
        if path.exists():
            with open(path) as f:
                data = json.load(f)
            return BacktestSettings(
                settings_id=UUID(data["settings_id"]),
                name=data.get("name", "Default Backtest"),
                data_source=data.get("data_source", "databento"),
                dataset=data.get("dataset", "XNAS.BASIC"),
                schema=data.get("schema", "ohlcv-1d"),
                start_date=date.fromisoformat(data["start_date"]),
                end_date=date.fromisoformat(data["end_date"]),
                underlying_symbol=data.get("underlying_symbol", "GLD"),
                start_price=data.get("start_price", 0.0),
                underlying_units=data.get("underlying_units", 1000.0),
                loan_amount=data.get("loan_amount", 0.0),
                margin_call_ltv=data.get("margin_call_ltv", 0.75),
                template_slugs=tuple(data.get("template_slugs", ("protective-put-atm-12m",))),
                cache_key=data.get("cache_key", ""),
                data_cost_usd=data.get("data_cost_usd", 0.0),
            )
        # Return defaults
        return BacktestSettings(
            settings_id=uuid4(),
            name="Default Backtest",
        )
    def save(self, workspace_id: str, settings: BacktestSettings) -> None:
        """Persist backtest settings."""
        path = self._settings_path(workspace_id)
        path.parent.mkdir(parents=True, exist_ok=True)
        data = asdict(settings)
        data["settings_id"] = str(data["settings_id"])
        data["start_date"] = data["start_date"].isoformat()
        data["end_date"] = data["end_date"].isoformat()
        data["template_slugs"] = list(data["template_slugs"])
        data["provider_ref"] = {
            "provider_id": settings.provider_ref.provider_id,
            "pricing_mode": settings.provider_ref.pricing_mode,
        }
        with open(path, "w") as f:
            json.dump(data, f, indent=2)
 ```
 ## Roadmap Items
 ### DATA-DB-001: Databento Historical Price Source
 **Dependencies:** None  
 **Estimated effort:** 2-3 days  
 **Deliverables:**
 - `app/services/backtesting/databento_source.py`
 - `tests/test_databento_source.py` (mocked API)
 - Environment variable `DATABENTO_API_KEY` support
 ### DATA-DB-002: Backtest Settings Model
 **Dependencies:** None  
 **Estimated effort:** 1 day  
 **Deliverables:**
 - `app/models/backtest_settings.py`
 - Repository for persistence
 ### DATA-DB-003: Cache Management
 **Dependencies:** DATA-DB-001  
 **Estimated effort:** 1 day  
 **Deliverables:**
 - `app/services/backtesting/databento_cache.py`
 - Cache cleanup CLI command
 ### DATA-DB-004: Backtest Page UI Updates
 **Dependencies:** DATA-DB-001, DATA-DB-002  
 **Estimated effort:** 2 days  
 **Deliverables:**
 - Updated `app/pages/backtests.py`
 - Updated `app/pages/event_comparison.py`
 - Cost estimation display
 ### DATA-DB-005: Scenario Pre-Seeding
 **Dependencies:** DATA-DB-001  
 **Estimated effort:** 1-2 days  
 **Deliverables:**
 - `app/services/backtesting/scenario_bulk_download.py`
 - Pre-configured presets for gold hedging research
 - Bulk download script
 ### DATA-DB-006: Options Data Source (Future)
 **Dependencies:** DATA-DB-001  
 **Estimated effort:** 3-5 days  
 **Deliverables:**
 - `DatabentoOptionSnapshotSource` implementing `OptionSnapshotSource`
 - OPRA.PILLAR integration for historical options chains
 ## Configuration
 Add to `.env`:
 ```
 DATABENTO_API_KEY=db-xxxxxxxxxxxxxxxxxxxxxxxx
 ```
 Add to `requirements.txt`:
 ```
 databento>=0.30.0
 ```
 Add to `pyproject.toml`:
 ```toml
 [project.optional-dependencies]
 databento = ["databento>=0.30.0"]
 ```
 ## Testing Strategy
 1. **Unit tests** with mocked Databento responses (`tests/test_databento_source.py`)
 2. **Integration tests** with recorded VCR cassettes (`tests/cassettes/*.yaml`)
 3. **E2E tests** using cached data (`tests/test_backtest_databento_playwright.py`)
 ## Cost Management
 - Use `metadata.get_cost()` before fetching to show estimated cost
 - Default to cached data when available
 - Batch download for large historical ranges (>1 year)
 - Consider Databento flat rate plans for heavy usage
 ## Security Considerations
 - API key stored in environment variable, never in code
 - Cache files contain only market data (no PII)
 - Rate limiting respected (100 requests/second per IP)
--- a/docs/roadmap/ROADMAP.yaml
+++ b/docs/roadmap/ROADMAP.yaml
@@ -1,5 +1,5 @@
 version: 1
-updated_at: 2026-03-27
+updated_at: 2026-03-28
 structure:
  backlog_dir: docs/roadmap/backlog
  in_progress_dir: docs/roadmap/in-progress
@@ -13,14 +13,20 @@ notes:
  - Pre-alpha policy: we may cut or replace old features without backward compatibility until alpha is declared.
  - Alpha migration policy: once alpha is declared, compatibility only needs to move forward; backward migrations are not required.
 priority_queue:
  - DATA-DB-001
  - DATA-DB-002
  - DATA-DB-004
  - CONV-001
  - EXEC-002
  - DATA-DB-003
  - DATA-DB-005
  - DATA-002A
  - DATA-001A
  - OPS-001
  - BT-003
  - BT-002A
  - GCF-001
  - DATA-DB-006
 recently_completed:
  - PORTFOLIO-003
  - PORTFOLIO-002
@@ -44,6 +50,12 @@ recently_completed:
  - CORE-002B
 states:
  backlog:
    - DATA-DB-001
    - DATA-DB-002
    - DATA-DB-003
    - DATA-DB-004
    - DATA-DB-005
    - DATA-DB-006
    - CONV-001
    - EXEC-002
    - DATA-002A
--- a/docs/roadmap/backlog/DATA-DB-001-databento-historical-price-source.yaml
+++ b/docs/roadmap/backlog/DATA-DB-001-databento-historical-price-source.yaml
@@ -0,0 +1,34 @@
 id: DATA-DB-001
 title: Databento Historical Price Source
 status: backlog
 priority: high
 dependencies: []
 estimated_effort: 2-3 days
 created: 2026-03-28
 updated: 2026-03-28
 description: |
  Integrate Databento historical API as a data source for backtesting and scenario
  comparison pages. This replaces yfinance for historical data on backtest pages
  and provides reliable, high-quality market data.
 acceptance_criteria:
  - DatabentoHistoricalPriceSource implements HistoricalPriceSource protocol
  - Cache layer prevents redundant downloads when parameters unchanged
  - Environment variable DATABENTO_API_KEY used for authentication
  - Cost estimation available before data fetch
  - GLD symbol resolved to XNAS.BASIC dataset
  - GC=F symbol resolved to GLBX.MDP3 dataset
  - Unit tests with mocked Databento responses pass
 implementation_notes: |
  Key files:
  - app/services/backtesting/databento_source.py (new)
  - tests/test_databento_source.py (new)
  Uses ohlcv-1d schema for daily bars. The cache key includes dataset, symbol,
  schema, start_date, and end_date. Cache files are Parquet format for fast
  loading. Metadata includes download_date for age validation.
 dependencies_detail:
  - None - this is the foundation for Databento integration
--- a/docs/roadmap/backlog/DATA-DB-002-backtest-settings-model.yaml
+++ b/docs/roadmap/backlog/DATA-DB-002-backtest-settings-model.yaml
@@ -0,0 +1,39 @@
 id: DATA-DB-002
 title: Backtest Settings Model
 status: backlog
 priority: high
 dependencies:
  - DATA-DB-001
 estimated_effort: 1 day
 created: 2026-03-28
 updated: 2026-03-28
 description: |
  Create BacktestSettings model that captures user-configurable backtest parameters
  independent of portfolio settings. This allows running scenarios with custom start
  prices and position sizes without modifying the main portfolio.
 acceptance_criteria:
  - BacktestSettings dataclass defined with all necessary fields
  - start_price can be 0 (auto-derive) or explicit value
  - underlying_units independent of portfolio.gold_ounces
  - loan_amount and margin_call_ltv for LTV analysis
  - data_source field supports "databento" and "yfinance"
  - Repository persists settings per workspace
  - Default settings created for new workspaces
 implementation_notes: |
  Key fields:
  - settings_id: UUID for tracking
  - data_source: "databento" | "yfinance" | "synthetic"
  - dataset: "XNAS.BASIC" | "GLBX.MDP3"
  - underlying_symbol: "GLD" | "GC" | "XAU"
  - start_date, end_date: date range
  - start_price: 0 for auto-derive, or explicit
  - underlying_units: position size for scenario
  - loan_amount: debt level for LTV analysis
  Settings are stored in .workspaces/{workspace_id}/backtest_settings.json
 dependencies_detail:
  - DATA-DB-001: Need data source configuration fields
--- a/docs/roadmap/backlog/DATA-DB-003-databento-cache-management.yaml
+++ b/docs/roadmap/backlog/DATA-DB-003-databento-cache-management.yaml
@@ -0,0 +1,40 @@
 id: DATA-DB-003
 title: Databento Cache Management
 status: backlog
 priority: medium
 dependencies:
  - DATA-DB-001
 estimated_effort: 1 day
 created: 2026-03-28
 updated: 2026-03-28
 description: |
  Implement cache lifecycle management for Databento data. Cache files should be
  invalidated after configurable age (default 30 days) and when request parameters
  change. Provide CLI tool for cache inspection and cleanup.
 acceptance_criteria:
  - DatabentoCacheManager lists all cached entries
  - Entries invalidated after max_age_days
  - Parameters change detection triggers re-download
  - Cache size tracking available
  - CLI command to clear all cache
  - CLI command to show cache statistics
 implementation_notes: |
  Cache files stored in .cache/databento/:
  - dbn_{hash}.parquet: Data file
  - dbn_{hash}_meta.json: Metadata (download_date, params, rows)
  Cache invalidation rules:
  1. Age > 30 days: re-download
  2. Parameters changed: re-download
  3. File corruption: re-download
  CLI commands:
  - vault-dash cache list
  - vault-dash cache clear
  - vault-dash cache stats
 dependencies_detail:
  - DATA-DB-001: Needs DatabentoCacheKey structure
--- a/docs/roadmap/backlog/DATA-DB-004-backtest-page-ui-updates.yaml
+++ b/docs/roadmap/backlog/DATA-DB-004-backtest-page-ui-updates.yaml
@@ -0,0 +1,50 @@
 id: DATA-DB-004
 title: Backtest Page UI Updates
 status: backlog
 priority: high
 dependencies:
  - DATA-DB-001
  - DATA-DB-002
 estimated_effort: 2 days
 created: 2026-03-28
 updated: 2026-03-28
 description: |
  Update backtest and event comparison pages to support Databento data source
  and independent scenario configuration. Show estimated data cost and cache
  status in the UI.
 acceptance_criteria:
  - Data source selector shows Databento and yFinance options
  - Databento config shows dataset and resolution dropdowns
  - Dataset selection updates cost estimate display
  - Cache status shows age of cached data
  - Independent start price input (0 = auto-derive)
  - Independent underlying units and loan amount
  - Event comparison page uses same data source config
  - Settings persist across sessions
 implementation_notes: |
  Page changes:
  Backtests page:
  - Add "Data Source" section with Databento/yFinance toggle
  - Add dataset selector (XNAS.BASIC for GLD, GLBX.MDP3 for GC=F)
  - Add resolution selector (ohlcv-1d, ohlcv-1h)
  - Show estimated cost with refresh button
  - Show cache status (age, size)
  - "Configure Scenario" section with independent start price/units
  Event comparison page:
  - Same data source configuration
  - Preset scenarios show if data cached
  - Cost estimate for missing data
  State management:
  - Use workspace-level BacktestSettings
  - Load on page mount, save on change
  - Invalidate cache when params change
 dependencies_detail:
  - DATA-DB-001: Need DatabentoHistoricalPriceSource
  - DATA-DB-002: Need BacktestSettings model
--- a/docs/roadmap/backlog/DATA-DB-005-scenario-pre-seeding.yaml
+++ b/docs/roadmap/backlog/DATA-DB-005-scenario-pre-seeding.yaml
@@ -0,0 +1,48 @@
 id: DATA-DB-005
 title: Scenario Pre-Seeding from Bulk Downloads
 status: backlog
 priority: medium
 dependencies:
  - DATA-DB-001
 estimated_effort: 1-2 days
 created: 2026-03-28
 updated: 2026-03-28
 description: |
  Create pre-configured scenario presets for gold hedging research and implement
  bulk download capability to pre-seed event comparison pages. This allows quick
  testing against historical events without per-event data fetching.
 acceptance_criteria:
  - Default presets include COVID crash, rate hike cycle, gold rally events
  - Bulk download script fetches all preset data
  - Presets stored in config file (JSON/YAML)
  - Event comparison page shows preset data availability
  - One-click "Download All Presets" button
  - Progress indicator during bulk download
 implementation_notes: |
  Default presets:
  - GLD March 2020 COVID Crash (extreme volatility)
  - GLD 2022 Rate Hike Cycle (full year)
  - GC=F 2024 Gold Rally (futures data)
  Bulk download flow:
  1. Create batch job for each preset
  2. Show progress per preset
  3. Store in cache directory
  4. Update preset availability status
  Preset format:
  - preset_id: unique identifier
  - display_name: human-readable name
  - symbol: GLD, GC, etc.
  - dataset: Databento dataset
  - window_start/end: date range
  - default_start_price: first close
  - default_templates: hedging strategies
  - event_type: crash, rally, rate_cycle
  - tags: for filtering
 dependencies_detail:
  - DATA-DB-001: Needs cache infrastructure
--- a/docs/roadmap/backlog/DATA-DB-006-databento-options-source.yaml
+++ b/docs/roadmap/backlog/DATA-DB-006-databento-options-source.yaml
@@ -0,0 +1,46 @@
 id: DATA-DB-006
 title: Databento Options Data Source
 status: backlog
 priority: low
 dependencies:
  - DATA-DB-001
 estimated_effort: 3-5 days
 created: 2026-03-28
 updated: 2026-03-28
 description: |
  Implement historical options data source using Databento's OPRA.PILLAR dataset.
  This enables historical options chain lookups for accurate backtesting with
  real options prices, replacing synthetic Black-Scholes pricing.
 acceptance_criteria:
  - DatabentoOptionSnapshotSource implements OptionSnapshotSource protocol
  - OPRA.PILLAR dataset used for GLD/SPY options
  - Option chain lookup by snapshot_date and symbol
  - Strike and expiry filtering supported
  - Cached per-date for efficiency
  - Fallback to synthetic pricing when data unavailable
 implementation_notes: |
  OPRA.PILLAR provides consolidated options data from all US options exchanges.
  Key challenges:
  1. OPRA data volume is large - need efficient caching
  2. Option symbology differs from regular symbols
  3. Need strike/expiry resolution in symbology
  Implementation approach:
  - Use 'definition' schema to get instrument metadata
  - Use 'trades' or 'ohlcv-1d' for price history
  - Cache per (symbol, expiration, strike, option_type, date)
  - Use continuous contracts for futures options (GC=F)
  Symbology:
  - GLD options: Use underlying symbol "GLD" with OPRA
  - GC options: Use parent symbology "GC" for continuous contracts
  This is a future enhancement - not required for initial backtesting
  which uses synthetic Black-Scholes pricing.
 dependencies_detail:
  - DATA-DB-001: Needs base cache infrastructure