GBM Lite Models (Retired)¶

Retired Model

GBM Lite has been removed from the dashboard as of March 2026 due to overoptimistic predictions. This page is kept for reference. See active models.

Ultra-lightweight gradient boosting models optimized for maximum stock coverage with minimal data requirements.

Overview¶

GBM Lite models use machine learning to rank stocks by expected returns while requiring only 2 quarters of historical data - enabling predictions for ~98% of stocks in the database.

Key Features¶

Minimal Data Requirements: Only 2 quarters needed (vs 8+ for full GBM)
Maximum Coverage: Works for 589/598 stocks (98.5%)
Strong Performance: Rank IC 0.50 (1y) and 0.40 (3y)
Efficient: 59 features (vs 464 for full GBM)

Available Variants¶

GBM Lite 1y¶

Predicts 1-year forward returns.

Rank IC: 0.50
Decile Spread: 66%
Coverage: 589 stocks

GBM Lite 3y¶

Predicts 3-year forward returns.

Rank IC: 0.40
Decile Spread: 145%
Coverage: 589 stocks

How It Works¶

1. Feature Engineering¶

The model uses 59 engineered features derived from:

Current Snapshot (27 base features): - Profitability: profit_margins, operating_margins, gross_margins, ROE, ROA - Growth: revenue_growth, earnings_growth - Balance Sheet: debt_to_equity, current_ratio, quick_ratio - Valuation: PE, PB, PS, EV/EBITDA, EV/Revenue - Dividends: dividend_yield, payout_ratio - Company: market_cap, beta - Market: VIX, 10Y Treasury - Price Momentum: returns_1m, returns_3m, returns_6m, returns_1y, volatility, volume_trend

Engineered Features (2.2x per base feature): - Computed yields: FCF yield, OCF yield, earnings yield - Log transforms: log(market_cap) - QoQ changes: Quarter-over-quarter deltas - Missingness flags: Binary indicators for missing data - Categorical: Sector encoding

2. What's Excluded (vs Full GBM)¶

To achieve 2-quarter minimum, we removed:

❌ Lag features (would require 3+ quarters)
❌ Rolling windows (would require 4-6+ quarters)
❌ YoY changes (would require 5 quarters)

3. Training Process¶

# Cross-sectional normalization
features_normalized = winsorize(features, 1st-99th percentile)
features_normalized = standardize(features, by_date=True)

# LightGBM ranking objective
model = lgb.train(
    params={'objective': 'regression', 'metric': 'rmse'},
    train_data=train_set,
    num_boost_round=500,
    early_stopping_rounds=50
)

4. Prediction Output¶

For each stock, the model provides:

Expected Return: Predicted percentage return over horizon
Percentile Rank: 0-100 ranking vs all stocks
Decile: 1-10 grouping (10 = top 10% expected returns)

Performance Metrics¶

Rank Information Coefficient (IC)¶

Measures correlation between predicted ranks and actual returns:

GBM Lite 1y: 0.50 (strong predictive power)
GBM Lite 3y: 0.40 (good long-term signal)

Decile Spread¶

Average return difference between top and bottom deciles:

GBM Lite 1y: 66% (top 10% outperform bottom 10% by 66%)
GBM Lite 3y: 145% (massive 3-year spread)

Comparison to Full GBM¶

Metric	GBM Lite 1y	GBM Full 1y	Delta
Rank IC	0.50	0.59	-15%
Decile Spread	66%	75%	-12%
Features	59	464	-87%
Coverage	589 stocks	589 stocks	Same
Min Quarters	2	8	-75%

Key Insight: GBM Lite achieves 85-88% of full GBM's performance while covering the same stocks with 76% fewer features.

Theoretical Foundation¶

Cross-Sectional Learning¶

GBM models learn relative patterns, not absolute values:

Z-score normalization per date: All features standardized within each time period
Ranking objective: Model predicts relative ordering, not exact returns
Regime-agnostic: Works across market conditions by focusing on cross-sectional relationships

Why It Works with Minimal History¶

Current snapshot + momentum captures most signal: - Fundamental quality metrics (profitability, growth, leverage) - Valuation multiples (relative cheapness) - Recent price momentum (trend signals) - QoQ changes (acceleration/deceleration)

What historical depth adds (full GBM): - Mean reversion patterns (rolling averages) - Volatility trends (rolling std) - Long-term momentum (lags, slopes)

For stock ranking, current state + recent changes provide most discriminating power.

Use Cases¶

Best For¶

New listings: Stocks with limited trading history
Broad coverage: When you need predictions for almost all stocks
Resource efficiency: Fast training and prediction
Baseline model: Good starting point before adding complexity

Not Ideal For¶

Absolute return forecasts: Use LSTM models instead
Maximum accuracy: Use full GBM if you have 8+ quarters
Market timing: Use Opportunistic GBM instead

Implementation Example¶

from invest.scripts.run_gbm_predictions import run_predictions

# Run GBM Lite 1y predictions
predictions = run_predictions(
    variant='lite',
    horizon='1y',
    db_path='data/stock_data.db'
)

# Get top decile stocks
top_stocks = predictions[predictions['decile'] >= 9]
print(f"Top 20% stocks: {len(top_stocks)}")

References¶

GBM Full Models - For comparison with full-featured version
Training Script
Feature Configuration
Prediction Script

Academic Background¶

Gradient Boosting Machines¶

Chen & Guestrin (2016). "XGBoost: A Scalable Tree Boosting System". KDD '16
Ke et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree". NIPS '17

Factor Models & Cross-Sectional Prediction¶

Fama & French (1993). "Common risk factors in the returns on stocks and bonds"
Gu, Kelly, & Xiu (2020). "Empirical Asset Pricing via Machine Learning". Review of Financial Studies
Moritz & Zimmermann (2016). "Tree-based Conditional Portfolio Sorts"

Information Coefficient¶

Grinold & Kahn (2000). "Active Portfolio Management" (Information Ratio framework)