Architecture of a Systematic Trading Platform

What this system is

This platform started as two parallel experiments. One was learning where different ML approaches break down across equity and crypto markets. The other was a harder question to answer in the abstract: how far can AI-assisted development take you on a genuinely complex engineering problem? Algorithmic trading is a good test for both. The feedback loop is honest. A model that looks good in training will tell you quickly in paper trading whether it is actually good or whether it got lucky on one historical window.

The system that grew out of that work has four main subsystems. A data pipeline pulls from Polygon for equities (WebSocket live feed and REST historical) and from CoinGecko and GeckoTerminal for crypto market data, with Binance as the primary crypto gap filler. Equity orders route through Alpaca. Crypto orders route through Binance. A centralized feature pipeline transforms raw OHLCV data into model-ready features. An experiment-driven training infrastructure runs hypotheses defined in YAML configs and tracks all results in PostgreSQL. Broker execution integrations push orders through each broker behind a shared interface, with a validation layer and a local ledger that tracks what the system intended versus what actually executed.

The YAML experiment config pattern is covered in depth in Config-Driven ML Experiments: Hypotheses as Code. The core XGBoost training setup is covered in depth in A Practical Setup for Training ML Models on Market Data. This article describes how the four subsystems fit together and the design constraints that shaped the boundaries between them.

Design constraints known before writing any code

Four constraints were clear from the start. Each one had a direct architectural consequence.

Data integrity. Market data from four sources with different schemas, update frequencies, and reliability characteristics would need to arrive at a single canonical representation. Any gap in the data reaching model training would corrupt feature calculations in ways that could be hard to detect until a model produced inexplicable results.

Feature consistency. Whatever features were calculated during training needed to be calculated identically during backtesting and live inference. This sounds simple. It is harder to maintain than it looks as the feature library grows and as different parts of the pipeline make different assumptions about lookback windows and rolling periods.

Evaluation rigour. Financial markets are non-stationary. Standard ML evaluation metrics do not map to trading performance. The evaluation layer needed to measure what actually mattered: Sharpe ratio, win rate relative to a benchmark, and alpha against buy-and-hold.

Execution safety. Moving from paper trading to live trading means real orders. The execution layer needed to validate orders before submission and maintain its own record of positions so that a discrepancy between what the system believed and what the broker reported could be caught immediately.

How the subsystems connect

Data flows in one direction through the system. Collectors ingest from each source and normalize into a canonical OHLCV schema stored in PostgreSQL. The feature pipeline reads from that schema and computes features on demand, saving metadata at training time to guarantee identical calculation at inference. The training runner reads from the feature pipeline, trains a model, runs it through a three-tier gate, and commits passing models to a models directory. The execution layer loads a gate-passing model, generates signals through the same feature pipeline used in training, and submits orders to the broker.

No subsystem reaches across another's boundary. The execution layer does not know how features are calculated. The feature pipeline does not know what model will consume its output. The training runner does not know which broker will execute the resulting signals. These clean interfaces were a deliberate choice: the most likely thing to change in a system like this is a data source. Binance's WebSocket volume stream broke mid-project. Replacing it touched one collector class and nothing else in the pipeline.

pipeline/runner.py

import yaml
from pipeline.data import DataPipeline
from pipeline.features import FeaturePipeline
from pipeline.training import Trainer
from pipeline.gate import ModelGate
from pipeline.registry import ModelRegistry

def run(config_path: str) -> RunResult:
    with open(config_path) as f:
        config = yaml.safe_load(f)

    # 1. Data: ensure coverage, run quality checks
    data = DataPipeline()
    data.ensure_coverage(config["symbol"], config["timeframe"])
    df = data.load(config["symbol"], config["timeframe"])
    quality_errors = data.check_quality(df)
    if quality_errors:
        return RunResult.failed(quality_errors)

    # 2. Features: fit and save metadata for inference parity
    features = FeaturePipeline(config["feature_groups"])
    X, metadata = features.fit_transform(df)

    # 3. Training: chronological split, XGBoost, gate metrics
    result = Trainer(config).train(X, metadata)

    # 4. Gate: three-tier validation before any model touches execution
    gate_result = ModelGate().run(result, config)
    if not gate_result.passed:
        return RunResult.gate_failed(gate_result.failures)

    # 5. Registry: commit model + metadata together
    ModelRegistry().commit(result, metadata, config)
    return RunResult.success(result)

The experiment config as the organizing principle

The YAML experiment config is the central artifact that ties training to execution. An experiment defines a symbol, timeframe, model type, feature groups, target definition, position sizing, and training parameters. The runner consumes any valid config without modification. All results are tracked in PostgreSQL against the config that produced them.

The practical effect of this design is that the training codebase stays stable as the number of experiments grows. Adding a new hypothesis is a new file, not a code change. At a few dozen configs, the separation between experiment definition and execution keeps the codebase from accumulating branching logic that makes it hard to trust any individual result. This pattern is covered in depth in Config-Driven ML Experiments: Hypotheses as Code.

Hardware and continuous operation

Training runs on dedicated hardware with an RTX 4090. Most XGBoost training runs are fast enough that GPU acceleration is not the bottleneck, but it matters for larger multi-class architectures and for batch training across many configs simultaneously.

The data collectors and paper trading sessions run continuously. Collectors include gap-filling routines that detect missing windows and backfill them on a schedule. Paper trading consumes live market data through the same feature pipeline used in training, which is the only way to trust that what was measured in backtest reflects what would be seen in execution.