Validation Geometry Is Part of the Model

The dangerous part was not the classifier. The label file was, sitting there looking convenient.

A saved train_y.npy artifact existed, and for this baseline it was a trap. It contained magnitude-filtered positive targets only, which made it unusable for a directional classifier. Train on it anyway and the result would look like a model comparison while actually being a dataset-artifact comparison.

So I made the LightGBM minute baseline read Pramaana's per-asset feature parquet files directly. I wasn't trying to tune trees until they confessed. I was trying to close the model-class capacity objection in the minute-level ceiling experiment without letting label construction, overlapping windows, or temporal bleed sneak into the room wearing a lab coat.

The research question behind the ICAIF 2026 paper is deliberately narrow: minute-scale cryptocurrency direction from OHLCV candles appears reproducibly capped near 52% across a broad set of model and feature configurations, while the same research stack recovers materially more directional information at the hourly horizon. Across seven minute configurations and approximately 36 million rows of minute-scale OHLCV history, the observed range is 51.4% to 52.3%. The LightGBM baseline is the seventh configuration: 46 microstructure-proxy features, a 15-minute forward return target in basis points, a stride-15 de-overlap, a no-trade filter at |target| > 10 BPS, and a per-asset 85/15 temporal split with a 15-row purge before validation.

The transferable idea is simple and annoyingly easy to violate: in time-series ML, validation geometry isn't bookkeeping. It's part of the model.

The baseline was a geometry test, not a tuning contest

A naive capacity comparison asks, “Does a different model class beat the neural setup?” That sounds reasonable until the labels are temporal, overlapping, filtered, and asset-scoped. The better question is, “Can I compare model classes without changing the target semantics or leaking nearby time into validation?”

So the script documents the protocol before it imports anything. That docstring is doing more than explaining a file. It pins down the shape of the experiment.

#!/usr/bin/env python3
"""Run a LightGBM directional baseline on the frozen M6 sniper matrix.

This closes the model-class capacity objection for the paper's minute-level
ceiling section. The script reads Pramaana's per-asset
``data/tmp_sniper_feat_*.parquet`` files directly because the saved
``train_y.npy`` artifact currently contains magnitude-filtered positive targets
only and is therefore not usable for a directional classifier.

Protocol matched to ``scripts/preprocess_sniper.py``:
  - 46 engineered microstructure-proxy features
  - 15-minute forward return target in BPS
  - stride-15 de-overlap
  - |target| > 10 BPS no-trade filter
  - per-asset 85/15 temporal split with a 15-row purge before validation
"""

from __future__ import annotations

import json
import time
from pathlib import Path

import lightgbm as lgb
import numpy as np
import polars as pl
from scipy import stats
from sklearn.metrics import balanced_accuracy_score, roc_auc_score


TRAIN_RATIO = 0.85

I like this kind of comment because you can falsify it. Every important experimental choice is named: feature count, target horizon, de-overlap, no-trade filtering, split ratio, and purge width. If the number later changes, the protocol has to change with it. (The real script also exposes these knobs as argparse CLI flags; I trimmed the flag-parsing boilerplate from this excerpt to keep the protocol in focus.)

The saved label artifact failed the most basic requirement for this comparison. It did not represent both directions, and directional classification needs the sign of the target. Magnitude-filtered positive targets only? That's a different task, not a neutral shortcut. So the baseline reconstructs the binary label from the 15-minute forward return in basis points inside the feature-parquet path, rather than inheriting a label artifact whose semantics were already wrong for this purpose.

There's a useful mental model here. A time-series split is less like cutting a deck of cards and more like cutting wet paint. The boundary smears unless you leave room for it to dry. In this baseline, the 15-row purge is that dry strip between training and validation.

The timeline is the experiment

The pipeline has only a few stages, but the order matters. The baseline starts from per-asset parquet files, reconstructs the directional target from the forward return, filters out the no-trade zone, de-overlaps with stride 15, applies a per-asset temporal split, inserts a purge gap, and only then fits and validates the model.

The diagram looks almost too ordinary, which is exactly the trap. The ordinary-looking arrows are where most of the statistical damage would happen if they were skipped or reordered.

For each asset, the geometry is anchored by time rather than random assignment. The training interval comes first. The validation interval comes last. The purge gap sits between them. The overlapping-window hazard is caused by the target construction itself: a 15-minute forward return means nearby rows can share future information unless the split respects the horizon. The stride-15 de-overlap reduces that hazard, and the purge gap protects the split boundary.

Geometry choice	What it protects against	Concrete value in this baseline
Per-asset temporal split	Cross-time contamination within each asset	85/15 train/validation
Purge gap	Boundary bleed from nearby rows	15 rows before validation
De-overlap	Repeated labels from overlapping horizons	stride-15
No-trade filter	Tiny targets treated as tradable direction	`
Label reconstruction	Wrong task inherited from saved labels	sign of 15-minute forward return in BPS
Raw tree features	Scaling mismatch for tree baseline	tree model does not require RobustScaler

The naive version would be shorter. Load X, load y, fit classifier, report accuracy. It would also be wrong in exactly the way that makes a result hard to debug: the code would run, the metrics would print, and the comparison would look scientific. The error would live in the meaning of y and the geometry of the split, not in a stack trace.

That distinction matters because most bad financial ML baselines do not fail loudly. They often fail by making the wrong thing convenient. A cached label array, a random split helper, a global shuffle, a validation set reused for early stopping, a forward-return target created before de-overlap: none of these choices necessarily creates an obvious programming error. They create an evidentiary error. The model may be implemented correctly while the experiment answers a question I did not intend to ask.

Why the saved label file had to be rejected

A directional classifier is only as honest as its labels. In this baseline, the target is “15-minute forward return in BPS; binary label is sign(target).” That gives the model a two-sided problem: up versus down after filtering out the no-trade zone.

The existing saved train_y.npy artifact did not satisfy that contract. It contained magnitude-filtered positive targets only. That makes it unsuitable for a directional classifier because it no longer represents the binary sign task the baseline is supposed to measure. There is no clever model-side fix for that. Once the target artifact encodes the wrong task, using a different classifier just gives the wrong task a new costume.

The baseline therefore goes back to data/tmp_sniper_feat_*.parquet. That choice matters because the feature parquet is upstream of the bad shortcut. It lets the script reconstruct labels under the same protocol used by the sniper preprocessing path: 46 engineered microstructure-proxy features, 15-minute forward return in basis points, stride-15 de-overlap, the no-trade filter, and the per-asset temporal split with purge.

This is the part of baseline design that feels unglamorous but decides whether the result means anything. A model-class objection says, “Maybe the neural family is the reason the minute result clusters near 52%.” A contaminated label artifact would make the answer meaningless. Reconstructing labels from feature parquet keeps the comparison focused on capacity instead of accidentally comparing two different tasks.

The same principle applies outside this specific paper. If an intermediate artifact was built for a different objective, it is not a neutral cache. It is an encoded research decision. A saved target array can carry filtering, horizon, class definition, censoring, asset selection, and split assumptions. If those assumptions no longer match the experiment, downstream code should not pretend the file is just bytes on disk. It is a contract, and in this case the contract was wrong for the classifier I needed to run.

The capacity objection needed a clean target

The minute-ceiling section reports seven configurations. The first three use tens of millions of labels and increasingly rich feature sets. The fourth and fifth reduce sample size but de-overlap and alter the loss. The sixth uses a different 15-minute basis-point target and compact microstructure-proxy features. The seventh replaces the neural CQR family with a classical tree model on the frozen microstructure-proxy matrix.

That seventh row is the key capacity-control move. If the ceiling were merely an artifact of the neural setup, a different model class on the same frozen feature/target construction should have had a chance to break away. Instead, the LightGBM baseline reports 52.231% accuracy with CI [52.042, 52.420], AUC 0.53046, balanced accuracy 52.21%, and n=268342 validation samples. In the minute configurations table, that appears as 52.23% ± 0.189 for row 7.

The ceiling holds: 52.231% accuracy, 95% CI [52.042, 52.420]. Swap the neural CQR family for a classical tree on the same frozen feature and target matrix, and the result lands in the same 51.4–52.3% band. The minute ceiling is not an artifact of the neural setup.

For metric provenance, this was a CPU-only LightGBM run from scripts/run_lightgbm_sniper_baseline.py using the Python scientific stack in the paper environment: Polars for parquet loading, NumPy for arrays, SciPy for the binomial interval and significance calculation, scikit-learn metrics for balanced accuracy and AUC, and the LightGBM scikit-learn API for the classifier. The run writes data/lightgbm_sniper_baseline.json and data/lightgbm_sniper_baseline_model.txt; the reported metric is taken from that JSON artifact, not copied by hand into the manuscript.

The important part is not that LightGBM has a particular personality. It is that the baseline swapped model class while holding the validation geometry and target construction in place. Without that, “LightGBM versus neural” would be a noisy argument about everything except the model.

The training block reflects that narrow purpose. It uses a tree classifier with raw engineered features, reserves the final 15% of the training side for early stopping, and computes class weighting from the fit subset. The held-out validation interval remains untouched by early stopping.

X_train, y_train, X_val, y_val, feature_names, asset_stats = _load_sniper_arrays(pramaana)

fit_end = int(len(X_train) * 0.85)
X_fit, X_es = X_train[:fit_end], X_train[fit_end:]
y_fit, y_es = y_train[:fit_end], y_train[fit_end:]

n_pos = int(y_fit.sum())
n_neg = int(len(y_fit) - n_pos)
scale_pos_weight = n_neg / max(n_pos, 1)

clf = lgb.LGBMClassifier(
    n_estimators=2000,
    learning_rate=0.03,
    num_leaves=31,
    max_depth=6,
    min_child_samples=200,
    subsample=0.85,
    colsample_bytree=0.85,
    reg_alpha=0.5,
    reg_lambda=1.0,
    scale_pos_weight=scale_pos_weight,
    objective="binary",
    metric="binary_logloss",
    n_jobs=-1,
    random_state=42,
    force_col_wise=True,
    verbosity=-1,
)

clf.fit(
    X_fit,
    y_fit,
    eval_set=[(X_es, y_es)],
    eval_metric="binary_logloss",
    callbacks=[
        lgb.early_stopping(stopping_rounds=100),
        lgb.log_evaluation(period=100),
    ],
)

The hyperparameters are not the story I care about here. The non-obvious detail is the split inside the training side: the final 15% of train is used only for early stopping, which keeps validation as the final held-out interval rather than letting it become a tuning surface.

That one choice prevents a common baseline failure. If I had passed the paper validation interval as the LightGBM eval_set, early stopping would have made the validation set part of the training procedure. The model would not directly fit labels from validation, but the selected number of boosting rounds would be chosen by validation performance. That is enough to contaminate the final metric. The point of the baseline is not to squeeze the last basis point out of LightGBM; it is to answer whether a classical tree classifier breaks the minute ceiling under the same target and split discipline.

The results JSON captures the protocol in a compact form. This is the artifact I want beside the paper because it records not only the metric, but also the target, preprocessing, split, and scaling assumptions that make the metric interpretable.

{
  "label": "LightGBM M7 sniper baseline",
  "created_utc": "2026-05-04T02:58:24Z",
  "source": "/home/the author/Development/Python/crypto-fl-v2/data/tmp_sniper_feat_*.parquet",
  "protocol": {
    "feature_set": "46 microstructure proxy features",
    "target": "15-minute forward return in BPS; binary label is sign(target)",
    "preprocessing": "stride-15 de-overlap; |target| > 10 BPS no-trade filter",
    "split": "Per-asset 85/15 temporal split with 15-row purge before validation; final 15% of train used only for early stopping",
    "scaling": "Raw engineered features; tree model does not require RobustScaler"
  },
  "train_samples": 1520254,
  "early_stop_samples": 228039,
  "validation_samples": 268342,
  "n_features": 46
}

I care more about the protocol object than the timestamp. A metric without this surrounding geometry is just a number looking for a story, and the protocol is what stops the wrong story from attaching itself.

Temporal bleed arrives without a stack trace

The hard part about temporal bleed is that it rarely announces itself. No exception fires to tell you, “Your validation rows are too close to your training rows.” What you get is a better-looking number, and that number is seductive because it can be explained as model quality.

The minute experiments are especially exposed to this because the target horizon is short and overlapping labels are easy to create accidentally. A 15-minute forward return target means row t and row t+1 can be describing heavily shared future intervals. If the split boundary cuts through those neighborhoods without a purge, the validation side can remain too close to what the training side has already seen.

So the geometry is engineered at multiple levels rather than resting on one protective trick.

Hazard	Naive baseline failure	Geometry response
Saved label artifact has wrong semantics	Classifier trains on a target that is not the directional task	Reconstruct labels from feature parquet
Adjacent windows share future information	Validation resembles training near the boundary	Insert 15-row purge before validation
Overlapping targets inflate sample familiarity	Many rows encode nearly the same horizon	Apply stride-15 de-overlap
Asset timelines differ	Global shuffle mixes unrelated time positions	Split per asset temporally
Early stopping erodes validation independence	Held-out set becomes part of model selection	Use final 15% of train only for early stopping

This is also why I resist treating train/test split as an afterthought in financial ML. For IID tabular data, the split is often a convenience. For time series, it's a claim about causality: what information was available before the prediction time, and what was not. If that claim is false, the model can be perfectly implemented and still be scientifically useless.

The per-asset split matters for a second reason. Cryptocurrency pairs don't all have identical listing histories, liquidity regimes, or missing-data structure. A global split by row count would mix assets at unrelated calendar positions. A global random split would be worse. The baseline needs each asset’s validation samples to come from the end of that asset’s own history, with the purge applied at that asset boundary. That preserves the meaning of “future held out” even when the panel is irregular.

The no-trade filter is a different kind of guardrail. It doesn't prevent leakage. What it prevents is the classifier being evaluated on economically tiny moves as if every infinitesimal return were a meaningful directional event. The target is still a research abstraction rather than a full trading simulator with costs and execution, but the |target| > 10 BPS filter keeps the sign label from being dominated by noise around zero. That matters when the entire empirical question is about a narrow 51.4% to 52.3% band. At that scale, sloppy target construction can easily masquerade as something real.

The minute ceiling becomes more credible when the baseline cannot cheat

The paper's data-and-methods section frames the smaller de-overlapped and microstructure-proxy runs as controls against cleaner labels, different short-horizon targets, and a classical tree classifier. The LightGBM row belongs to that logic. No new headline result. A stress test aimed at one specific objection.

The minute ceiling is not “models cannot beat chance.” The reported range sits above chance in a statistical sense. The issue is effect size. Across the seven configurations, directional accuracy remains inside a 0.9 percentage-point band, from 51.4% to 52.3%. With sample sizes in the millions for the larger configurations, the open question isn't whether the models detect something. It's why substantial changes in features, scaling, loss functions, target construction, model class, and sample count don't move the result into a stronger range.

The hourly positive control is what keeps this from becoming nihilism. At the hourly horizon, the FT-Transformer/CQR walk-forward run reached 54.69% directional accuracy on 386,056 future-held-out hourly samples, with a 95% interval of 54.53–54.84 and 80.53% conformal coverage. That result comes from the five-fold expanding walk-forward CQR run recorded in backtest_results/walkforward_cqr_hourly.json and the corresponding resume log, executed with the PyTorch FT-Transformer/CQR training stack on the project’s CUDA workstation environment. One runtime distinction is worth stating plainly: the hourly result is not a LightGBM CPU run. It's a neural FT-Transformer/CQR evaluation under expanding walk-forward temporal validation, with the metric emitted by the walk-forward script and then pulled into data/results_manifest.json for paper table generation.

That contrast matters because it shows the research stack does not mechanically inflate every task. Minute OHLCV direction clusters near the ceiling. Hourly direction recovers more directional information under a stricter temporal protocol.

But the contrast only means something if the minute baseline is clean. Had the LightGBM run used a broken label artifact, the row would not close the capacity objection. It would open a new hole. Reconstructing labels from parquet and enforcing split geometry is what lets the baseline answer the narrow question it was built to answer.

The evidence-bundle design reinforces this. The paper repository is separate from the trading system because the paper is the manuscript and reproducibility layer, while Pramaana is the experiment apparatus. The paper artifacts point back to the scripts and result files that generated each table row: scripts/run_lightgbm_sniper_baseline.py, data/lightgbm_sniper_baseline.json, data/lightgbm_sniper_baseline_model.txt, and the per-asset parquet inputs for the LightGBM capacity control; backtest_results/walkforward_cqr_hourly.json, the walk-forward logs, and scripts/walkforward_cqr_hourly.py for the hourly positive control. That's the level at which a result becomes inspectable. A table row should not be a manually typed number; it should be the visible tip of an artifact chain.

What the feature importances can and cannot say

The LightGBM run also writes a model file and feature importances. The top features include return_60m, z_effort_60m, price_vs_vwap_5m, and z_effort_15m. I don't treat those importances as a market theory. Tree feature importance is useful for checking that the model is not obviously broken. It isn't a causal explanation of minute returns.

What it can say is more modest: the classifier found most of its splits in recent return, effort, and VWAP-relative features, which is consistent with the feature family the baseline was meant to test. That helps catch the opposite failure mode, a model that reports a plausible metric while accidentally training on an ID column, timestamp encoding, or mislabeled target. Feature inspection is not proof. It's a useful diagnostic once the validation geometry is already correct.

This is also why I report AUC alongside accuracy. Accuracy answers the sign-decision question at the default threshold. AUC checks whether the probability ranking carries directional information independent of that threshold. In the LightGBM run, AUC is 0.53046, which aligns with the story told by 52.231% accuracy: the model is picking up something small, not discovering a dramatically separable classification boundary.

The high-confidence slice is similarly restrained. The JSON records a threshold of p >= 0.60 or p <= 0.40, with 2,305 samples, 0.859% coverage, and 57.007% accuracy. Interesting as a calibration and selectivity diagnostic. It does not rescue the minute problem. A tiny high-confidence region with better accuracy is not the same as a broad tradable edge, especially before costs, slippage, and execution constraints. For the paper’s central claim, the full validation metric is the right number to emphasize.

The discipline is to make shortcuts impossible

The small engineering choice I'd generalize is this: when a saved artifact has ambiguous or wrong semantics, don't patch around it downstream. Rebuild from the nearest artifact whose meaning is still compatible with the experiment.

In this case, the nearest compatible artifact was the per-asset feature parquet. That forced the script to carry the target definition, filtering rule, de-overlap policy, split geometry, purge width, and scaling assumption in the same place as the model run. Less convenient than loading train_y.npy. More honest.

There's a tradeoff. Reconstructing labels from parquet couples the baseline to the upstream feature files and requires the paper repo to reference Pramaana's data path. The evidence bundle handles that by recording local evidence anchors rather than pretending the manuscript repository contains every large matrix. Compact artifacts get tracked directly with metadata and hashes; larger upstream matrices are referenced by path and size unless full hashing is requested. An engineering compromise, and one that keeps the paper repository focused on reproducibility rather than becoming a warehouse for heavyweight experiment outputs.

The payoff is that the baseline result has a shape I can defend. It says: on the frozen M6 sniper matrix, with 46 microstructure-proxy features, 15-minute forward-return sign labels, stride-15 de-overlap, a no-trade filter, per-asset temporal splitting, a 15-row purge, and a held-out validation interval, a classical tree model lands at 52.231% accuracy. That's a capacity-control statement, not a vague leaderboard entry.

It also changes how I evaluate future baselines. I want the experiment script to make the invalid path difficult. If the wrong label file is easy to load, someone eventually will. If early stopping on validation is a one-line convenience, it will creep in. If a random split helper is available in the same file as time-series code, it will eventually be called in the wrong place. Good research code does not merely implement the correct protocol; it removes temptations that produce attractive but meaningless numbers.

What I changed in how I think about baselines

I used to think of baselines as simpler models. I now think of them as simpler claims. A good baseline should remove one objection at a time. This LightGBM run removes the objection that the minute ceiling is merely a neural architecture artifact. It doesn't claim to solve execution, transaction costs, order-book dynamics, or richer market state. It says only that replacing the model family on this engineered short-horizon matrix does not break the ceiling.

That restraint is the virtue. A baseline that tries to answer every objection simultaneously becomes another opaque system. Answer one objection under carefully preserved geometry and it becomes useful evidence.

For time-series ML, the geometry is the evidence. The split, purge, stride, and label source aren't clerical details that come after the “real” modeling work. They're the rails that keep the model from learning yesterday's shadow of tomorrow.

A classifier trained on the wrong artifact can look competent. A classifier trained inside the wrong timeline can look brilliant. I'd rather have the modest number I can trust than the impressive one produced by a boundary I forgot to draw. The next credible test won't be another minute-candle model with a larger parameter count. It will be the one that adds the state variables the minute horizon is missing (order-book pressure, queue dynamics, spread formation, and execution-aware microstructure) while keeping the same discipline about labels, time, and evidence.

In time-series work, validation geometry is not an implementation detail. It is the contract that makes a model's number mean something. The same discipline that protects a live trading gate or a self-indexing memory write protects the scientific claim.