When hardcoded stops scaling
Early in the project, the training script was intentionally minimal. One symbol, one feature set, one target definition. All hardcoded. The goal was to move fast and understand the domain: what features had signal, what target horizons were realistic, whether the data pipeline was even producing clean input. Premature abstraction at that stage would have slowed down the thing that actually mattered, which was learning whether the approach was viable at all.
Once the pipeline was stable and the hypothesis space started expanding (more symbols, different timeframes, different feature combinations), the hardcoded approach stopped working. Not because it was naive, but because it had served its purpose and the problem had changed. The question was no longer whether to introduce a separation between experiment definition and execution. That was already clear. The interesting question was what form that separation should take for this specific problem.
A lot of ML experiment tracking tooling (MLflow, W&B, etc.) solves a different problem: logging and comparison after the fact. What was needed here was upstream control. A way to define what to test before running it, with enough structure that the runner could consume any valid experiment without modification. That meant treating each hypothesis as a first-class artifact: a YAML file that fully specifies the experiment, feeding a generic runner that knows nothing about which symbol or feature set it's evaluating.
The experiment config
Each experiment config defines everything the runner needs: what data to use, which features to calculate, how to define the prediction target, position sizing, and training parameters. Nothing is hardcoded in the runner.
experiment_id: btc-xgb-4h-momentum
symbol: BTC/USDT
timeframe: 4h
model_type: xgboost
feature_groups:
- momentum
- volume
- volatility
target:
type: binary
horizon: 3 # bars forward to measure
threshold: 0.008 # 0.8% move required for a 1 label
position:
size: 0.02 # 2% of portfolio per trade
hold_max: 6 # exit after 6 bars regardless
training:
train_ratio: 0.70
min_samples: 1000
early_stopping_rounds: 50The runner
The runner has one job: load a config, run the pipeline, report the result. It never needs to know which symbol you're testing or what features you've selected. Those are the config's concern.
import sys
import yaml
from src.pipeline import ExperimentPipeline
def run(config_path: str) -> None:
with open(config_path) as f:
config = yaml.safe_load(f)
pipeline = ExperimentPipeline(config)
result = pipeline.run()
if result.gate_passed:
print(
f"✓ {config['experiment_id']} registered — "
f"Sharpe: {result.sharpe:.2f}, "
f"WR: {result.win_rate:.1%}, "
f"vs B&H: {result.alpha:+.1%}"
)
else:
print(f"✗ Gate failures: {', '.join(result.failures)}")
if __name__ == "__main__":
run(sys.argv[1])A new hypothesis is a single command: python runner.py experiments/btc-xgb-4h-momentum.yaml. No code review required, no merge needed, no risk of breaking an existing model.
Feature groups as named contracts
Feature groups are named collections of indicators registered in a central dictionary. A config references groups by name, and the FeaturePipeline resolves them to a concrete list of column names. This matters because the same feature list needs to be calculated identically at training time, backtest time, and live inference time. Saving the resolved feature names alongside the model ensures this. Even if you later rename or reorganise a group, models trained against the old definition will still resolve correctly.
import pandas as pd
import talib
FEATURE_GROUPS: dict[str, list[str]] = {
"momentum": ["rsi_14", "rsi_7", "macd_signal", "macd_hist", "roc_10"],
"volume": ["obv_ratio", "vwap_deviation", "volume_zscore"],
"volatility": ["atr_pct", "bb_width", "daily_range_pct"],
"trend": ["ema_cross_9_21", "ema_cross_21_55", "adx_14"],
}
class FeaturePipeline:
def __init__(self, feature_groups: list[str]):
self.feature_names = [
name
for group in feature_groups
for name in FEATURE_GROUPS[group]
]
def fit_transform(self, df: pd.DataFrame) -> tuple[pd.DataFrame, dict]:
features = self._calculate_all(df)
metadata = {
"feature_names": self.feature_names,
"n_features": len(self.feature_names),
}
return features[self.feature_names].dropna(), metadata
def transform(self, df: pd.DataFrame, metadata: dict) -> pd.DataFrame:
# Use saved feature names, not current group definitions
self.feature_names = metadata["feature_names"]
return self._calculate_all(df)[self.feature_names].dropna()
def _calculate_all(self, df: pd.DataFrame) -> pd.DataFrame:
result = df.copy()
result["rsi_14"] = talib.RSI(df["close"], timeperiod=14)
result["rsi_7"] = talib.RSI(df["close"], timeperiod=7)
# ... rest of indicator calculations
return resultTracking results in PostgreSQL
Every training run is a row in the database: config, metrics, whether it passed the gate, which model file it produced. This makes it possible to query across all experiments: find every model with a Sharpe above 1.0, find all BTC experiments that passed the gate, compare different feature group combinations on the same symbol. A JSON registry or flat files would make this kind of analysis much harder.
CREATE TABLE experiments (
experiment_id TEXT PRIMARY KEY,
config JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE model_results (
id SERIAL PRIMARY KEY,
experiment_id TEXT REFERENCES experiments(experiment_id),
model_path TEXT,
train_from TIMESTAMPTZ,
train_to TIMESTAMPTZ,
metrics JSONB,
gate_passed BOOLEAN,
gate_failures TEXT[],
trained_at TIMESTAMPTZ DEFAULT now()
);
-- Find all experiments that passed the gate, sorted by Sharpe
SELECT
experiment_id,
metrics->>'sharpe' AS sharpe,
metrics->>'win_rate' AS win_rate,
metrics->>'alpha' AS alpha,
trained_at
FROM model_results
WHERE gate_passed = true
ORDER BY (metrics->>'sharpe')::float DESC;What this buys you
After running many training iterations, the config-driven approach made a few things clear that would have been difficult to see otherwise.
Reproducibility is the first one. Every model can be rebuilt exactly from its config file. There's no mental state to reconstruct about what settings were active that day.
Iteration speed is the second. Trying a new feature combination doesn't require a code change, a review, or any risk of breaking existing models. Write the YAML, run the command.
Analysis is the third. Because every result is in PostgreSQL with a consistent schema, it's straightforward to query across all experiments. Which feature groups appear most often in passing models? Which timeframes produce the most reliable results? Which symbols have enough volume for the model's minimum trade count? These are SQL queries, not grepping through log files.
The pattern isn't specific to trading. Any problem with multiple hypothesis variations: A/B testing infrastructure, hyperparameter search, recommendation model variants. All benefit from the same separation between experiment definition and execution engine.