Config-Driven ML Experiments: Hypotheses as Code

When hardcoded stops scaling

Early in the project, the training script was intentionally minimal. One symbol, one feature set, one target definition. All hardcoded. The goal was to move fast and understand the domain: what features had signal, what target horizons were realistic, whether the data pipeline was even producing clean input. Premature abstraction at that stage would have slowed down the thing that actually mattered, which was learning whether the approach was viable at all.

Once the pipeline was stable and the hypothesis space started expanding (more symbols, different timeframes, different feature combinations), the hardcoded approach stopped working. Not because it was naive, but because it had served its purpose and the problem had changed. The question was no longer whether to introduce a separation between experiment definition and execution. That was already clear. The interesting question was what form that separation should take for this specific problem.

A lot of ML experiment tracking tooling (MLflow, W&B, etc.) solves a different problem: logging and comparison after the fact. What was needed here was upstream control. A way to define what to test before running it, with enough structure that the runner could consume any valid experiment without modification. That meant treating each hypothesis as a first-class artifact: a YAML file that fully specifies the experiment, feeding a generic runner that knows nothing about which symbol or feature set it's evaluating.

The experiment config

Each experiment config defines everything the runner needs: what data to use, which features to calculate, how to define the prediction target, position sizing, and training parameters. Nothing is hardcoded in the runner.

experiments/btc-xgb-4h-momentum.yaml

experiment_id: btc-xgb-4h-momentum
symbol: BTC/USDT
timeframe: 4h
model_type: xgboost

feature_groups:
  - momentum
  - volume
  - volatility

target:
  type: binary
  horizon: 3        # bars forward to measure
  threshold: 0.008  # 0.8% move required for a 1 label

position:
  size: 0.02        # 2% of portfolio per trade
  hold_max: 6       # exit after 6 bars regardless

training:
  train_ratio: 0.70
  min_samples: 1000
  early_stopping_rounds: 50

The runner

The runner has one job: load a config, run the pipeline, report the result. It never needs to know which symbol you're testing or what features you've selected. Those are the config's concern.

runner.py

import sys
import yaml
from src.pipeline import ExperimentPipeline

def run(config_path: str) -> None:
    with open(config_path) as f:
        config = yaml.safe_load(f)

    pipeline = ExperimentPipeline(config)
    result = pipeline.run()

    if result.gate_passed:
        print(
            f"✓ {config['experiment_id']} registered. "
            f"Sharpe: {result.sharpe:.2f}, "
            f"WR: {result.win_rate:.1%}, "
            f"vs B&H: {result.alpha:+.1%}"
        )
    else:
        print(f"✗ Gate failures: {', '.join(result.failures)}")

if __name__ == "__main__":
    run(sys.argv[1])

A new hypothesis is a single command: python runner.py experiments/btc-xgb-4h-momentum.yaml. No code review required, no merge needed, no risk of breaking an existing model.

Feature groups as named contracts

Feature groups are named collections of indicators registered in a central dictionary. A config references groups by name, and the FeaturePipeline resolves them to a concrete list of column names. This matters because the same feature list needs to be calculated identically at training time, backtest time, and live inference time. Saving the resolved feature names alongside the model ensures this. Even if you later rename or reorganise a group, models trained against the old definition will still resolve correctly.

src/features.py

import pandas as pd
import talib

FEATURE_GROUPS: dict[str, list[str]] = {
    "momentum":   ["rsi_14", "rsi_7", "macd_signal", "macd_hist", "roc_10"],
    "volume":     ["obv_ratio", "vwap_deviation", "volume_zscore"],
    "volatility": ["atr_pct", "bb_width", "daily_range_pct"],
    "trend":      ["ema_cross_9_21", "ema_cross_21_55", "adx_14"],
}

class FeaturePipeline:
    def __init__(self, feature_groups: list[str]):
        self.feature_names = [
            name
            for group in feature_groups
            for name in FEATURE_GROUPS[group]
        ]

    def fit_transform(self, df: pd.DataFrame) -> tuple[pd.DataFrame, dict]:
        features = self._calculate_all(df)
        metadata = {
            "feature_names": self.feature_names,
            "n_features": len(self.feature_names),
        }
        return features[self.feature_names].dropna(), metadata

    def transform(self, df: pd.DataFrame, metadata: dict) -> pd.DataFrame:
        # Use saved feature names, not current group definitions
        self.feature_names = metadata["feature_names"]
        return self._calculate_all(df)[self.feature_names].dropna()

    def _calculate_all(self, df: pd.DataFrame) -> pd.DataFrame:
        result = df.copy()
        result["rsi_14"] = talib.RSI(df["close"], timeperiod=14)
        result["rsi_7"]  = talib.RSI(df["close"], timeperiod=7)
        # ... rest of indicator calculations
        return result

Tracking results in PostgreSQL

Every training run is a row in the database: config, metrics, whether it passed the gate, which model file it produced. This makes it possible to query across all experiments: find every model with a Sharpe above 1.0, find all BTC experiments that passed the gate, compare different feature group combinations on the same symbol. A JSON registry or flat files would make this kind of analysis much harder.

CREATE TABLE experiments (
    experiment_id  TEXT PRIMARY KEY,
    config         JSONB        NOT NULL,
    created_at     TIMESTAMPTZ  DEFAULT now()
);

CREATE TABLE model_results (
    id             SERIAL PRIMARY KEY,
    experiment_id  TEXT         REFERENCES experiments(experiment_id),
    model_path     TEXT,
    train_from     TIMESTAMPTZ,
    train_to       TIMESTAMPTZ,
    metrics        JSONB,
    gate_passed    BOOLEAN,
    gate_failures  TEXT[],
    trained_at     TIMESTAMPTZ  DEFAULT now()
);

-- Find all experiments that passed the gate, sorted by Sharpe
SELECT
    experiment_id,
    metrics->>'sharpe'   AS sharpe,
    metrics->>'win_rate' AS win_rate,
    metrics->>'alpha'    AS alpha,
    trained_at
FROM model_results
WHERE gate_passed = true
ORDER BY (metrics->>'sharpe')::float DESC;

What this buys you

After running many training iterations, the config-driven approach made a few things clear that would have been difficult to see otherwise.

Reproducibility is the first one. Every model can be rebuilt exactly from its config file. There's no mental state to reconstruct about what settings were active that day.

Iteration speed is the second. Trying a new feature combination doesn't require a code change, a review, or any risk of breaking existing models. Write the YAML, run the command.

Analysis is the third. Because every result is in PostgreSQL with a consistent schema, it's straightforward to query across all experiments. Which feature groups appear most often in passing models? Which timeframes produce the most reliable results? Which symbols have enough volume for the model's minimum trade count? These are SQL queries, not grepping through log files.

The pattern isn't specific to trading. Any problem with multiple hypothesis variations: A/B testing infrastructure, hyperparameter search, recommendation model variants. All benefit from the same separation between experiment definition and execution engine.