ML Engineering

9 min read

May 20, 2026

Backtest Variants: Walk-Forward, Regime Testing, and Window Sampling

Why a single train/test holdout is not enough for financial models, and how three complementary backtest approaches give a more complete picture of model robustness before any strategy reaches paper trading.

PythonBacktestingStatisticsXGBoost

Why a single holdout is not enough

The standard ML evaluation approach: split chronologically, train on the earlier portion, test on the later portion, report metrics. This is the minimum requirement. For financial models it is not sufficient on its own.

A single holdout window samples one market regime. If the training data runs from 2021 to 2023 and the test window is late 2023, the model is evaluated on whatever conditions existed in that specific period. If that period happened to be low-volatility and the model was trained on high-volatility data, the reported metrics reflect that particular transition, not the model's general reliability across market conditions.

Three complementary approaches fill this out. Walk-forward validation tests whether performance holds as the training window advances. Regime testing asks whether performance holds across structurally different market conditions. Window sampling tests whether reported metrics are sensitive to the specific holdout dates chosen. Together they answer three distinct questions that a single holdout cannot.

Walk-forward validation

Walk-forward validation advances the training window in time and tests on the subsequent period repeatedly. Rather than training once on 70% and testing on 30%, you train on months 1 through 12, test on month 13, then train on months 1 through 13, test on month 14, and so on. The result is a series of performance scores across many windows rather than one score on one window.

A model that produces consistent Sharpe ratios across many walk-forward windows is more trustworthy than one that performs well on a single holdout. The variance of the walk-forward scores is informative on its own. High variance means performance is sensitive to which window it is evaluated on.

backtesting/walk_forward.py
from datetime import datetime
from typing import Generator
import pandas as pd

def walk_forward_splits(
    df: pd.DataFrame,
    initial_train_months: int = 12,
    test_months: int = 1,
    step_months: int = 1,
) -> Generator[tuple[pd.DataFrame, pd.DataFrame], None, None]:
    train_end = df.index[0] + pd.DateOffset(months=initial_train_months)

    while True:
        test_end = train_end + pd.DateOffset(months=test_months)
        if test_end > df.index[-1]:
            break

        train = df[df.index < train_end]
        test  = df[(df.index >= train_end) & (df.index < test_end)]

        if len(train) > 0 and len(test) > 0:
            yield train, test

        train_end += pd.DateOffset(months=step_months)

# Usage: collect Sharpe across all windows
sharpes = []
for train_df, test_df in walk_forward_splits(df, initial_train_months=12):
    model, metadata = train(train_df, config)
    metrics = backtest(test_df, model, metadata)
    sharpes.append(metrics["sharpe"])

print(f"Walk-forward Sharpe: mean={sum(sharpes)/len(sharpes):.2f}, "
      f"std={pd.Series(sharpes).std():.2f}, "
      f"negative windows={sum(s < 0 for s in sharpes)}/{len(sharpes)}")

Regime testing

Markets move through structurally different conditions: bull markets with persistent upward trend and low volatility, bear markets with sustained drawdown, and sideways markets with no clear direction and high intraday noise. Each regime has different feature distributions and different optimal strategies.

A model trained and tested predominantly in bull-market data can look strong and then fail the moment the regime changes. Regime testing partitions the test period by market condition and evaluates performance on each partition separately. The goal is to understand which regimes the model is actually reliable in, not just whether it looks good across the averaged result.

backtesting/regime.py
import numpy as np
import pandas as pd

def classify_regime(df: pd.DataFrame, window: int = 50) -> pd.Series:
    trend = df["close"].rolling(window).mean().pct_change()

    regime = pd.Series("sideways", index=df.index)
    regime[trend >  0.002] = "bull"
    regime[trend < -0.002] = "bear"
    return regime

def evaluate_by_regime(
    df: pd.DataFrame,
    probas: np.ndarray,
    entry_threshold: float,
    horizon: int,
) -> dict[str, dict]:
    regime = classify_regime(df)
    forward_return = df["close"].pct_change(horizon).shift(-horizon)
    signals = (probas >= entry_threshold).astype(int)
    results = {}

    for r in ["bull", "bear", "sideways"]:
        mask = (regime == r).values[-len(probas):]
        if mask.sum() < 20:
            results[r] = {"n": int(mask.sum()), "skipped": "insufficient samples"}
            continue
        strat = forward_return.iloc[-len(probas):][mask] * signals[mask]
        results[r] = {
            "n":        int(mask.sum()),
            "sharpe":   round(sharpe_ratio(strat.dropna()), 3),
            "win_rate": round(float((strat > 0).mean()), 4),
        }
    return results

Capital protection models show their strongest performance in bear and sideways regimes, which is consistent with what they are designed to identify. A model that shows strong performance only in bull regimes is not a capital protection model regardless of what its config says.

Window sampling

Walk-forward tells you whether performance persists through time. Regime testing tells you whether it holds across market conditions. Window sampling asks a different question: how sensitive is this model to the specific dates chosen for the train/test boundary?

The approach: take the full data range and generate multiple splits by varying the split date within a range. Train on each prefix, test on a following window of fixed length, and compare the distribution of results across split points. A robust model shows similar metrics regardless of where the boundary falls. A fragile model shows high variance, which indicates that the reported performance depends on which specific bars ended up in the test set.

This is particularly useful for shorter backtests where a single unusually good or bad period can dominate the result. Window sampling exposes that sensitivity before the model reaches paper trading rather than after.

Putting the three approaches together

A model that passes the gate on a single holdout has earned a provisional result. The three backtest variants turn that provisional result into something more meaningful.

Walk-forward answers: does performance persist as time advances? Regime testing answers: does performance depend on market conditions? Window sampling answers: does performance depend on the specific holdout chosen?

A model that produces consistent results across all three is worth paper trading. A model that passes the gate but shows high walk-forward variance or strong regime dependence is worth understanding better before committing to execution. The information is not a reason to disqualify the model. It is context for interpreting the gate result and setting realistic expectations for what paper trading will show.