AlphaStream

A trading signal generation platform I built to test whether machine learning actually works for algorithmic trading. Spoiler: it's complicated.

┌─────────────────────────────────────────────────────────────────┐
│  WHAT THIS IS                                                   │
├─────────────────────────────────────────────────────────────────┤
│  200+ Technical Indicators   RSI, MACD, Bollinger, volume, etc │
│  5 ML Models                 RF, XGBoost, LightGBM, LSTM, etc  │
│  Ensemble Methods            Voting, stacking, Bayesian avg    │
│  Backtesting Engine          Walk-forward validation, real costs│
│  FastAPI Server              REST + WebSocket streaming         │
│  Production Monitoring       Drift detection, auto-retrain      │
│  6,000+ Lines of Python      Not a toy project                  │
└─────────────────────────────────────────────────────────────────┘

Why I Built This

Wanted to learn if ML could generate profitable trading signals. Read a bunch of papers claiming 60-70% accuracy. Decided to build the full pipeline myself rather than trust someone else's backtested metrics.

Built this to test:

Can traditional ML (RF, XGBoost) beat LSTM/Transformers for price prediction?
Do 200+ features actually help or just overfit?
How much do transaction costs destroy theoretical edge?
What's the real Sharpe ratio after slippage?

Used Python because the ML ecosystem is Python-native. FastAPI for the server because it's fast and has WebSocket support. Redis for caching because repeated feature calculation is expensive.

What Actually Works

After extensive backtesting on 4 years of data (2020-2024):

┌─────────────────────────────────────────────────────────────────┐
│  Metric              Value      Reality Check                   │
├─────────────────────────────────────────────────────────────────┤
│  Sharpe Ratio        1.8-2.4    Good, but assumes perfect exec  │
│  Win Rate            58-65%     Slightly better than coin flip  │
│  Max Drawdown        ~15%       Happened twice, stressful       │
│  Signal Latency      <100ms     Fast enough for daily signals   │
│  Model Accuracy      62-68%     Directional, not magnitude      │
│  Profit Factor       1.8        After 0.1% transaction costs    │
└─────────────────────────────────────────────────────────────────┘

Key learning: Ensemble methods (combining RF + XGBoost + LightGBM) beat individual models. LSTM and Transformers didn't add much value for daily signals—too data-hungry for the features I had.

Quick Start

git clone https://github.com/JasonTeixeira/AlphaStream.git
cd AlphaStream

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

pip install -r requirements.txt
cp .env.example .env

# Train models on AAPL, GOOGL, MSFT
python train_models.py train --symbols AAPL,GOOGL,MSFT

# Start API server
python -m api.main

With Docker:

docker-compose up -d
docker-compose logs -f api

Architecture

Built in 3 layers:

   Data Pipeline              Load OHLCV, validate, cache
        │
        v
   Feature Engineering        200+ indicators (price, volume, volatility)
        │
        v
   ML Pipeline                Train 5 models, ensemble predictions
   ├── Random Forest          Baseline, interpretable
   ├── XGBoost                Best single model (usually)
   ├── LightGBM               Fast for large datasets
   ├── LSTM                   Sequential patterns (underwhelming)
   └── Transformer            Attention-based (compute-heavy)
        │
        v
   Backtesting Engine         Walk-forward validation, real costs
        │
        v
   FastAPI Server             REST + WebSocket streaming
        │
        v
   Redis Cache                Feature cache (optional but fast)

Project Structure

AlphaStream/
├── ml/
│   ├── models.py          # 5 model implementations
│   ├── features.py        # 200+ technical indicators
│   ├── dataset.py         # Data loading + preprocessing
│   ├── train.py           # Training pipeline
│   ├── validation.py      # Data quality checks
│   └── monitoring.py      # Drift detection
├── backtesting/
│   └── engine.py          # Portfolio simulation
├── api/
│   └── main.py            # FastAPI server
├── config/
│   ├── training.yaml      # Model hyperparameters
│   └── logging.yaml       # Logging config
├── tests/
│   └── test_models.py     # Unit tests
├── train_models.py        # CLI for training
└── docker-compose.yml     # Orchestration

Features

200+ Technical Indicators

I implemented everything from TA-Lib plus custom ones:

Price-based:

Moving averages (SMA, EMA, WMA)
Bollinger Bands
RSI, MACD, Stochastic

Volume:

OBV, VWAP, MFI
Accumulation/Distribution

Volatility:

ATR, Historical Vol
Parkinson, Garman-Klass

Market microstructure:

Bid-ask spread proxy
Order flow imbalance
Volume profile

Ensemble Methods

Single models are noisy. Ensembles smooth predictions:

Voting: Majority wins (simple, works)
Stacking: Meta-learner on top of base models (better)
Blending: Weighted combination (tunable)
Bayesian Averaging: Probabilistic (overkill for this)

Best results: Simple voting of RF + XGBoost + LightGBM.

API Usage

REST Endpoints

Get Prediction:

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"symbol": "AAPL", "model_type": "ensemble"}'

Response:

{
  "symbol": "AAPL",
  "prediction": 1,
  "confidence": 0.72,
  "action": "BUY",
  "timestamp": "2024-01-01T12:00:00"
}

Batch Signals:

curl -X POST http://localhost:8000/signals \
  -H "Content-Type: application/json" \
  -d '{"symbols": ["AAPL", "GOOGL", "MSFT"], "threshold": 0.6}'

Backtesting:

curl -X POST http://localhost:8000/backtest \
  -H "Content-Type: application/json" \
  -d '{
    "symbol": "AAPL",
    "start_date": "2023-01-01",
    "end_date": "2024-01-01",
    "model_type": "xgboost",
    "initial_capital": 100000
  }'

WebSocket Streaming

const ws = new WebSocket('ws://localhost:8000/ws/stream');

ws.send(JSON.stringify({
  action: 'subscribe',
  symbols: ['AAPL', 'GOOGL']
}));

ws.onmessage = (event) => {
  const signal = JSON.parse(event.data);
  console.log('Signal:', signal);
};

Monitoring

Production monitoring is critical because models degrade:

Data Drift: Kolmogorov-Smirnov test on feature distributions
Concept Drift: Track prediction accuracy over rolling windows
Automated Alerts: Slack/email when drift detected
Auto-Retrain: Trigger retraining when performance degrades

I learned this the hard way after a model trained in 2022 stopped working in 2023 (regime change).

What Was Hard

Data quality: Bad ticks, splits, dividends—had to write extensive validation
Feature explosion: 200+ features led to overfitting. Had to add regularization and feature selection
Lookahead bias: Easy to accidentally leak future data into training. Walk-forward validation caught this
Transaction costs: Theoretical edge vanished with 0.1% costs. Had to optimize for fewer trades
Model drift: Models degrade fast. Needed monitoring and auto-retrain logic

Testing

# All tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=ml --cov=backtesting --cov-report=html

# Specific test
pytest tests/test_models.py -k test_random_forest

Deployment

Docker (recommended):

docker-compose up -d
docker-compose logs -f api

Production checklist:

Use Redis for caching (10x faster feature lookups)
Enable GPU if using LSTM/Transformers (CPU is slow)
Set up Prometheus/Grafana for monitoring
Configure alerts for drift detection
Add API rate limiting (prevent abuse)
Use load balancer for multiple instances

What I'd Do Differently

Reinforcement learning: Try RL for position sizing and timing
Alternative data: Sentiment, news, options flow (expensive to get)
Multi-timeframe: Combine daily + hourly signals
Risk management: Position sizing based on volatility
Database: Persist predictions for historical analysis

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
api		api
backtesting		backtesting
config		config
ml		ml
notebooks		notebooks
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
JOURNEY.md		JOURNEY.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
quickstart.sh		quickstart.sh
requirements.txt		requirements.txt
setup.py		setup.py
test_system.py		test_system.py
train_models.py		train_models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AlphaStream

Why I Built This

What Actually Works

Quick Start

Architecture

Project Structure

Features

200+ Technical Indicators

Ensemble Methods

API Usage

REST Endpoints

WebSocket Streaming

Monitoring

What Was Hard

Testing

Deployment

What I'd Do Differently

Roadmap

License

Docs

About

Uh oh!

Releases

Packages

Languages

License

JasonTeixeira/AlphaStream

Folders and files

Latest commit

History

Repository files navigation

AlphaStream

Why I Built This

What Actually Works

Quick Start

Architecture

Project Structure

Features

200+ Technical Indicators

Ensemble Methods

API Usage

REST Endpoints

WebSocket Streaming

Monitoring

What Was Hard

Testing

Deployment

What I'd Do Differently

Roadmap

License

Docs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages