# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Overview Alpha Lab is a quantitative research experiment framework for the `qshare` library. It uses a notebook-centric approach for exploring trading strategies and ML models. The codebase is organized around two prediction tasks: - **cta_1d**: CTA (Commodity Trading Advisor) futures 1-day return prediction - **stock_15m**: Stock 15-minute forward return prediction using high-frequency features ## Directory Structure ``` alpha_lab/ ├── common/ # Shared utilities │ ├── __init__.py │ ├── paths.py # Path management │ └── plotting.py # Common plotting functions │ ├── cta_1d/ # CTA 1-day return prediction │ ├── __init__.py # Re-exports from src/ │ ├── config.yaml # Task configuration │ ├── src/ # Implementation modules │ │ ├── __init__.py │ │ ├── loader.py # CTA1DLoader │ │ ├── train.py # Training functions │ │ ├── backtest.py # Backtest functions │ │ └── labels.py # Label blending utilities │ └── *.ipynb # Experiment notebooks │ ├── stock_15m/ # Stock 15-minute return prediction │ ├── __init__.py # Re-exports from src/ │ ├── config.yaml # Task configuration │ ├── src/ # Implementation modules │ │ ├── __init__.py │ │ ├── loader.py # Stock15mLoader │ │ └── train.py # Training functions │ └── *.ipynb # Experiment notebooks │ └── results/ # Output directory (gitignored) ``` ## Common Commands ### Development Setup ```bash # Install dependencies pip install -r requirements.txt # Create environment configuration cp .env.template .env # Edit .env with your DolphinDB host and data paths ``` ### Running Experiments ```bash # Start Jupyter for interactive experiments jupyter notebook # Train CTA model from config python -m cta_1d.train --config cta_1d/config.yaml --output results/cta_1d/exp01 # Train Stock 15m model python -m stock_15m.train --config stock_15m/config.yaml --output results/stock_15m/exp01 # Run CTA backtest python -m cta_1d.backtest \ --model results/cta_1d/exp01/model.json \ --dt-range 2023-01-01 2023-12-31 \ --output results/cta_1d/backtest_01 ``` ### Python API Usage ```python # CTA 1D workflow from cta_1d import CTA1DLoader, train_model, TrainConfig loader = CTA1DLoader(return_type='o2c_twap1min', normalization='dual') dataset = loader.load(dt_range=['2020-01-01', '2023-12-31']) config = TrainConfig(dt_range=['2020-01-01', '2023-12-31'], feature_sets=['alpha158']) model, metrics = train_model(config, output_dir='results/exp01') # Stock 15m workflow from stock_15m import Stock15mLoader, train_model, TrainConfig loader = Stock15mLoader(normalization_mode='dual') dataset = loader.load( dt_range=['2020-01-01', '2023-12-31'], feature_path='/data/parquet/stock_1min_alpha158', kline_path='/data/parquet/stock_1min_kline' ) ``` ## Architecture ### Module Organization All implementation code lives in `src/` subdirectories: - **`cta_1d/src/`**: CTA-specific implementations - `loader.py`: CTA1DLoader class - `train.py`: train_model, TrainConfig - `backtest.py`: run_backtest, BacktestConfig - `labels.py`: Label blending utilities - **`stock_15m/src/`**: Stock-specific implementations - `loader.py`: Stock15mLoader class - `train.py`: train_model, TrainConfig Root `__init__.py` files re-export public APIs for backward compatibility: ```python from cta_1d import CTA1DLoader # Imports from cta_1d.src ``` ### Data Flow Both tasks follow a consistent pattern: 1. **Loaders** (`src/loader.py`): Fetch data from DolphinDB (CTA) or Parquet files (Stock), apply normalization, compute sample weights, return `pl_Dataset` 2. **Training** (`src/train.py`): XGBoost with early stopping, outputs model JSON + metrics 3. **Backtest** (`src/backtest.py`): CTA-only; uses `qshare.eval.cta.backtest.CTABacktester` for strategy simulation ### Key Classes - **`CTA1DLoader`**: Loads alpha158/hffactor features from DolphinDB; supports 5 normalization modes (`zscore`, `cs_zscore`, `rolling_20`, `rolling_60`, `dual`) - **`Stock15mLoader`**: Loads Alpha158 on 1-min data; computes 15-min forward returns; normalization modes: `industry`, `cs_zscore`, `dual` - **`pl_Dataset`**: From `qshare.data`; provides `.with_segments()`, `.split()`, `.to_numpy()` methods ### Normalization Modes **CTA 1D** (`dual` blending): - `zscore`: Fit-time mean/std normalization - `cs_zscore`: Cross-sectional z-score per datetime - `rolling_20/60`: Rolling window normalization - `dual`: Weighted blend (default: [0.2, 0.1, 0.3, 0.4]) **Stock 15m**: - `industry`: Industry-neutralized returns - `cs_zscore`: Cross-sectional z-score - `dual`: 80% industry-neutral + 20% cs_zscore ### Experiment Tracking Manual tracking in `results/{task}/README.md`: ```markdown ## 2025-01-15: Baseline XGB - Notebook: `cta_1d/03_baseline_xgb.ipynb` (cells 1-50) - Config: eta=0.5, lambda=0.1 - Train IC: 0.042 - Test IC: 0.038 - Notes: Dual normalization, 4 trades/day ``` ### Dependencies on qshare The codebase relies heavily on the `qshare` library (already installed in the venv): - `qshare.data.pl_Dataset`: Dataset container with Polars backend - `qshare.io.ddb`: DolphinDB session management - `qshare.io.polars`: Parquet loading utilities - `qshare.algo.polars`: Industry neutralization, cross-sectional z-score - `qshare.eval.cta.backtest`: CTA backtesting framework - `qshare.config.research.cta`: Predefined column lists (HFFACTOR_COLS) ### Configuration Files YAML configs define data ranges, model hyperparameters, and output settings: ```yaml data: dt_range: ['2020-01-01', '2023-12-31'] feature_sets: [alpha158, hffactor] normalization: dual model: type: xgb params: {eta: 0.05, max_depth: 6} ``` Load with: `python -m cta_1d.train --config config.yaml` or `yaml.safe_load()` directly.