ML Engineering

12 min read

April 22, 2026

A Practical Setup for Training ML Models on Market Data

The setup decisions that matter most when training ML models on market data, from data acquisition and target definition through chronological splits, feature engineering, and validation that reflects real trading.

PythonXGBoostAlpacaData Engineering

What this covers

Most ML tutorials train on static datasets where train/test split is random and the target variable is given. Market data breaks both assumptions. The data is time-ordered, so a random split leaks future information into training. Defining whether a price move happened has to be done carefully to avoid encoding information the model wouldn't have had at prediction time.

This covers the setup decisions that matter: data acquisition and normalisation, defining a prediction target without leakage, feature engineering that stays honest about lookback windows, chronological splitting, training XGBoost, and validation that tells you whether the model actually works rather than whether it memorised the training set.

Data acquisition

Alpaca provides free OHLCV data for US equities and crypto. The key thing when pulling historical bars is to request adjusted data. Corporate actions like splits and dividends create discontinuities in the raw price series that produce false signals for momentum features.

src/collectors/alpaca.py
import alpaca_trade_api as tradeapi
import pandas as pd
from datetime import datetime, timezone

class AlpacaCollector:
    def __init__(self, api_key: str, secret_key: str):
        self.api = tradeapi.REST(
            api_key, secret_key,
            base_url="https://data.alpaca.markets"
        )

    def get_bars(
        self,
        symbol: str,
        timeframe: str,  # "1Day", "1Hour", "4Hour"
        start: str,
        end: str,
    ) -> pd.DataFrame:
        bars = self.api.get_bars(
            symbol,
            timeframe,
            start=start,
            end=end,
            adjustment="all",  # split + dividend adjusted
        ).df

        bars.index = bars.index.tz_convert("UTC")
        bars.columns = [c.lower() for c in bars.columns]

        return bars[["open", "high", "low", "close", "volume"]]

Always normalise timestamps to UTC before storing. Mixing naive and tz-aware datetimes is a reliable source of hard-to-find bugs in backtesting.

Defining the target: the most important decision

The target variable is a forward-looking quantity: does price rise by at least X% over the next N bars? There are two ways to get this wrong.

The first is calculating forward returns on the full dataset before splitting. If you do this, the test set's target was calculated using data from within the test period, which is fine. But if you ever normalise features using the full dataset's statistics (mean, std), those statistics encode information from the test period and contaminate training. The safest habit is to calculate everything within each split.

The second mistake is using a single forward bar's return as the target. A single bar is noisy. A horizon of 3-5 bars with a minimum threshold (0.8%–1.5% depending on the asset) produces cleaner signal.

src/targets.py
import numpy as np
import pandas as pd

def binary_target(
    close: pd.Series,
    horizon: int,
    threshold: float,
) -> pd.Series:
    """
    Returns 1 if price rises >= threshold% over the next
    horizon bars, else 0. NaN for the final horizon rows
    where the forward return cannot be calculated.
    """
    forward_return = close.shift(-horizon) / close - 1
    return (forward_return >= threshold).astype("Int64")

# --- Usage ---

split_idx = int(len(df) * 0.70)
train_df  = df.iloc[:split_idx].copy()
test_df   = df.iloc[split_idx:].copy()

# Calculate targets independently within each split
train_df["target"] = binary_target(train_df["close"], horizon=3, threshold=0.008)
test_df["target"]  = binary_target(test_df["close"],  horizon=3, threshold=0.008)

# Drop rows at the end of each split where target is NaN
train_df = train_df.dropna(subset=["target"])
test_df  = test_df.dropna(subset=["target"])

Feature engineering and the lookback window rule

Every feature must use only data available at the time of the prediction. The lookback window rule: if a feature uses the last N bars, it must be calculated using a rolling window, never a full-series normalisation. This is where most implementations silently leak.

Volume z-score is the classic trap. Normalising volume by the full series mean and standard deviation uses future statistics. Normalising by a rolling 20-bar mean uses only past data. The model result can look completely different.

src/features.py
import pandas as pd
import numpy as np
import talib

def momentum_features(df: pd.DataFrame) -> pd.DataFrame:
    out = pd.DataFrame(index=df.index)
    out["rsi_14"]    = talib.RSI(df["close"], timeperiod=14)
    out["rsi_7"]     = talib.RSI(df["close"], timeperiod=7)
    out["macd_hist"] = talib.MACD(df["close"])[2]
    out["roc_10"]    = talib.ROC(df["close"], timeperiod=10)
    return out

def volume_features(df: pd.DataFrame) -> pd.DataFrame:
    out = pd.DataFrame(index=df.index)

    rolling_mean = df["volume"].rolling(20).mean()
    rolling_std  = df["volume"].rolling(20).std()

    # Correct: rolling normalisation (past data only)
    out["volume_zscore"] = (df["volume"] - rolling_mean) / rolling_std

    # Wrong: full-series normalisation (leaks future statistics)
    # out["volume_zscore"] = (df["volume"] - df["volume"].mean()) / df["volume"].std()

    obv = talib.OBV(df["close"], df["volume"])
    out["obv_ratio"] = obv / obv.rolling(20).mean()

    return out

def volatility_features(df: pd.DataFrame) -> pd.DataFrame:
    out = pd.DataFrame(index=df.index)
    out["atr_pct"]        = talib.ATR(df["high"], df["low"], df["close"], 14) / df["close"]
    out["bb_width"]       = (talib.BBANDS(df["close"])[0] - talib.BBANDS(df["close"])[2]) / df["close"]
    out["daily_range_pct"] = (df["high"] - df["low"]) / df["close"]
    return out

The chronological split

Never shuffle time series data. A random split will place future observations in the training set, and the model will learn patterns that include information it wouldn't have had at prediction time. The result looks like a high-performing model but degrades immediately in live trading.

Chronological split: everything before the split date trains the model, everything after tests it. Some practitioners add a gap between train and test equal to the maximum feature lookback period (e.g., 55 bars for a 55-period EMA) to ensure no feature calculation in the test set uses data from the training period.

src/splitting.py
import pandas as pd

def chronological_split(
    df: pd.DataFrame,
    train_ratio: float = 0.70,
    gap_bars: int = 0,
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Split by time position, not randomly.
    gap_bars: bars to drop at the start of test to avoid any
    feature calculation overlap with the training window.
    """
    n = len(df)
    split = int(n * train_ratio)

    train = df.iloc[:split]
    test  = df.iloc[split + gap_bars:]

    return train, test

# --- Example ---
train, test = chronological_split(df, train_ratio=0.70, gap_bars=55)

print(f"Train: {train.index[0]}{train.index[-1]} ({len(train)} bars)")
print(f"Test:  {test.index[0]}{test.index[-1]} ({len(test)} bars)")

Walk-forward validation takes this further: train on months 1-12, test on month 13, then train on 1-13, test on 14, and so on. This gives you a realistic picture of how the model holds up as time passes and the training window shifts.

Training XGBoost

XGBoost works well for tabular market data. It handles the non-linear relationships between features naturally, doesn't require feature scaling, and the early stopping mechanism prevents overfitting without requiring you to guess the right number of estimators. Keep max_depth shallow (3-5) to reduce overfitting. Financial features tend to interact in low-order ways.

src/training.py
import xgboost as xgb
import numpy as np
from dataclasses import dataclass

@dataclass
class TrainingResult:
    model: xgb.XGBClassifier
    feature_names: list[str]
    feature_metadata: dict
    predictions: np.ndarray
    probas: np.ndarray
    y_test: np.ndarray

def train(
    train_df,
    test_df,
    feature_names: list[str],
    feature_metadata: dict,
    config: dict,
) -> TrainingResult:
    target = "target"

    X_train = train_df[feature_names].dropna()
    y_train = train_df.loc[X_train.index, target]

    X_test  = test_df[feature_names].dropna()
    y_test  = test_df.loc[X_test.index, target]

    model = xgb.XGBClassifier(
        n_estimators=500,
        learning_rate=0.05,
        max_depth=4,           # shallow to reduce overfit
        subsample=0.8,
        colsample_bytree=0.8,
        min_child_weight=5,    # minimum samples per leaf
        early_stopping_rounds=config.get("early_stopping_rounds", 50),
        eval_metric="logloss",
        random_state=42,
        verbosity=0,
    )

    model.fit(
        X_train, y_train,
        eval_set=[(X_test, y_test)],
        verbose=False,
    )

    predictions = model.predict(X_test)
    probas      = model.predict_proba(X_test)[:, 1]

    return TrainingResult(
        model=model,
        feature_names=feature_names,
        feature_metadata=feature_metadata,
        predictions=predictions,
        probas=probas,
        y_test=y_test.values,
    )

Validation that reflects real trading

Accuracy on the test set is a weak signal for trading models. A model that predicts 1 every time on a bull market has high accuracy and loses money the moment the market turns. The metrics that matter are Sharpe ratio (risk-adjusted return), win rate (what fraction of trades are profitable), and comparison to buy-and-hold (does the model actually add alpha, or is it just capturing the market trend).

src/validation.py
import numpy as np
import pandas as pd

def sharpe_ratio(returns: pd.Series, periods_per_year: int = 252) -> float:
    if returns.std() == 0:
        return 0.0
    return (returns.mean() / returns.std()) * np.sqrt(periods_per_year)

def backtest(
    test_df: pd.DataFrame,
    probas: np.ndarray,
    entry_threshold: float = 0.60,
    horizon: int = 3,
) -> dict:
    """
    Simulate entering on signal, holding for horizon bars.
    """
    close = test_df["close"].iloc[-len(probas):]
    signals = (probas >= entry_threshold).astype(int)

    # Forward return over the holding period
    forward_return = close.pct_change(horizon).shift(-horizon)

    strategy_returns = forward_return * signals
    buy_hold_returns = forward_return

    strat  = strategy_returns.dropna()
    buyhold = buy_hold_returns.dropna()

    return {
        "sharpe":          round(sharpe_ratio(strat), 3),
        "win_rate":        round((strat > 0).mean(), 4),
        "total_return":    round((1 + strat).prod() - 1, 4),
        "buy_hold_return": round((1 + buyhold).prod() - 1, 4),
        "alpha":           round(((1 + strat).prod() - 1) - ((1 + buyhold).prod() - 1), 4),
        "n_trades":        int(signals.sum()),
        "signal_rate":     round(signals.mean(), 4),
    }

What to expect from the first run

Capital protection models tend to validate well. These identify conditions where holding cash beats holding the asset. Bear markets and high-volatility sideways periods have strong feature signatures that a gradient boosted model can identify reliably.

Generating consistent alpha is harder. Markets are adversarial in a way that most ML datasets are not: when a pattern is profitable, other participants find it and arbitrage it away. This is why the evaluation loop matters more than the model architecture. A weak model that you can evaluate honestly and iterate on quickly will outperform a sophisticated model trained once on a static dataset.

Start simple: one symbol, one timeframe, two or three feature groups. Build the evaluation infrastructure first. Then expand.

The three most common mistakes in the first few attempts are: calculating forward returns on the full dataset before splitting, using full-series feature normalisation instead of rolling windows, and evaluating on accuracy instead of Sharpe. Fix these before worrying about model architecture.