Model Training Pipelines and the Fail-Fast Methodology

The feature consistency problem

Early in the project, training runs produced models that looked reasonable in isolation but were difficult to compare across experiments. Sharpe ratios were inconsistent in ways that did not track cleanly with changes to the experiment configs. Some of the variance was expected randomness. Some of it was a real problem that took time to find.

The root cause was that training and backtesting were computing rolling features from different starting points in the historical data. Training computed features over the full historical dataset before the split. Backtesting computed the same features from the start of the backtest window. The two calculations used different lookback windows for the initial bars, which produced different feature distributions at the edges. The models were not being evaluated on features computed the same way they had been trained.

This is a subtle form of evaluation inconsistency. It does not corrupt the training set by exposing it to future data. It corrupts the comparison between training performance and backtest performance by making them measure slightly different things. The result is that backtest metrics are not a reliable predictor of paper trading performance, which defeats the purpose of running backtests.

FeaturePipeline as the single source of truth

The fix was a centralized FeaturePipeline class that is the only thing in the system allowed to compute features. Training calls it. Backtesting calls it. Live inference calls it. They all call it the same way with the same inputs.

The pipeline saves feature metadata at training time: the exact list of feature names, the feature groups used, and the resolved column ordering. At inference time, the saved metadata is loaded and used to replicate the exact calculation. Even if a feature group definition changes between a model's training run and a later inference run, the saved metadata ensures the original features are reproduced correctly.

pipeline/features.py

from dataclasses import dataclass
import pandas as pd

@dataclass
class FeatureMetadata:
    feature_names: list[str]
    groups: list[str]
    n_features: int
    computed_at: str  # ISO timestamp of training run

class FeaturePipeline:
    def __init__(self, feature_groups: list[str]):
        self.feature_groups = feature_groups
        self.feature_names  = self._resolve(feature_groups)

    def fit_transform(
        self, df: pd.DataFrame
    ) -> tuple[pd.DataFrame, FeatureMetadata]:
        features = self._compute(df)
        metadata = FeatureMetadata(
            feature_names=self.feature_names,
            groups=self.feature_groups,
            n_features=len(self.feature_names),
            computed_at=pd.Timestamp.utcnow().isoformat(),
        )
        return features[self.feature_names].dropna(), metadata

    def transform(
        self, df: pd.DataFrame, metadata: FeatureMetadata
    ) -> pd.DataFrame:
        # Use saved names from training, not current group definitions
        self.feature_names = metadata.feature_names
        return self._compute(df)[self.feature_names].dropna()

    def _resolve(self, groups: list[str]) -> list[str]:
        return [name for g in groups for name in FEATURE_GROUPS[g]]

    def _compute(self, df: pd.DataFrame) -> pd.DataFrame:
        # Indicator calculations for all registered groups
        ...

The metadata file is as important as the model file. A model loaded without its metadata cannot be used correctly. The training runner saves both to the same directory and the execution layer loads both together.

The three-tier model gate

Every training run ends at the gate. No model reaches paper trading without passing all three tiers.

Tier one is sanity checks: does the model file exist, are metrics populated, does the training data meet minimum coverage requirements. There is no point evaluating performance on a model that did not save correctly or on data that was below the minimum threshold. The gate fails here and logs the reason.

Tier two is performance checks: is total return positive, is Sharpe above a minimum threshold, is win rate in a plausible range, does the strategy beat buy-and-hold. These thresholds are conservative. The goal is not to pass the best models. It is to reject models that could not possibly be useful in execution.

Tier three is statistical checks: is the test sample large enough for the reported metrics to be reliable, are the returns statistically significant. A model that shows a Sharpe of 1.2 on 40 test trades has not demonstrated anything.

pipeline/gate.py

from dataclasses import dataclass

@dataclass
class GateResult:
    passed: bool
    failures: list[str]

def run_gate(result: TrainingResult, config: dict) -> GateResult:
    failures = []

    # Tier 1: Sanity
    if not result.model_path.exists():
        failures.append("model file missing")
    if result.metrics is None:
        failures.append("metrics not populated")
    if result.n_train_samples < config["training"].get("min_samples", 500):
        failures.append(f"insufficient training data: {result.n_train_samples} samples")
    if failures:
        return GateResult(passed=False, failures=failures)

    # Tier 2: Performance
    m = result.metrics
    if m["total_return"] <= 0:
        failures.append(f"negative return: {m['total_return']:.2%}")
    if m["sharpe"] < 0.5:
        failures.append(f"Sharpe below threshold: {m['sharpe']:.2f}")
    if not (0.45 <= m["win_rate"] <= 0.75):
        failures.append(f"win rate out of range: {m['win_rate']:.1%}")
    if m["alpha"] < 0:
        failures.append(f"negative alpha vs buy-and-hold: {m['alpha']:.2%}")
    if failures:
        return GateResult(passed=False, failures=failures)

    # Tier 3: Statistical
    if result.n_test_samples < 100:
        failures.append(f"test sample too small: {result.n_test_samples} trades")
    if m.get("p_value") and m["p_value"] > 0.05:
        failures.append(f"returns not statistically significant: p={m['p_value']:.3f}")

    return GateResult(passed=not failures, failures=failures)

What fail-fast means for this domain

Fail-fast in software engineering usually means catching errors early so they do not propagate. In ML experiment iteration it means something slightly different: run the cheap checks before the expensive ones, and stop any experiment that cannot produce a useful result before investing compute in it.

Data quality checks run before feature calculation. Feature calculation runs before model training. The gate runs after training but before any model reaches paper trading. Each step is a checkpoint. An experiment with a misconfigured target definition or insufficient data fails immediately with a clear reason. The researcher reads the log, fixes the config, and reruns. No waiting for a slow training run to complete before discovering the run was invalid from the start.

This also shapes how you think about experiment iteration. Rather than carefully designing each config before running it, you can move faster and let the gate be the filter. Write a reasonable config, run it, read the gate output. If it fails at the sanity tier, fix the config. If it fails at the performance tier, the model trained but was not strong enough. If it passes, it earned its way to paper trading. The gate is not just a safety check. It is the feedback mechanism that makes rapid iteration trustworthy.

What the results showed

Capital protection models trained reliably and consistently. These are models configured to identify conditions where holding cash is better than holding the asset. Bear markets and high-volatility sideways periods have strong feature signatures that XGBoost identifies well. The gate pass rate for capital protection configs across different symbols and timeframes was significantly higher than for alpha-generating configs.

Generating consistent positive alpha is harder. The difficulty is not the model architecture or the feature set. A pattern that generates alpha attracts capital, which arbitrages the pattern away. The right response to this is not a more sophisticated model but a faster iteration loop. The config system and the gate were built for exactly this. The gate enforces that any model reaching execution has earned it. The logging infrastructure keeps a complete record of every experiment that did not pass, which is as informative as the ones that did.