End-to-end pipeline for forecasting Bitcoin’s next-day close. The project ingests market data, engineers technical + macro features, trains a small ensemble of classical models, and surfaces predictions in both a CLI and a Streamlit dashboard (with optional LLM commentary).
- Multi-source ingestion: OHLCV via ccxt, Google News sentiment with optional
kk08/CryptoBERT, and US Treasury yield curves (Fiscal Data API). - Consistent feature store: rolling stats, RSI, volatility, momentum, interest rate pivots, and day-ahead sentiment alignment.
- Model ensemble: Linear Regression, Ridge, Random Forest, and XGBoost with shared scaler + feature list artifacts.
- Streamlit dashboard: run forecasts, trigger fresh ingestion, compare model outputs, chart price history in UTC or Pacific time, and request Gemini commentary with citations.
- Reproducible CLI + notebooks: train/predict scripts, analysis notebooks, and timestamped
data/+models/directories for every run. - Quantitative Performance: Validation across multiple models using Mean Absolute Error (MAE)

- Interpretability: Dives into model feature influence using SHAP

- Model Fit: Visual comparison showing how the best-performing models track the actual historical close price.

conda env create -f environment.yml
conda activate crypto_env
# or
python -m venv .venv
.venv\Scripts\activate # Windows
pip install -r requirements.txtCreate .env in the repo root:
GEMINI_API_KEY=your-gemini-keyGemini is only required if you want AI commentary inside Streamlit.
streamlit run streamlit_app.pyKey workflow:
- Pick a saved
models/run; paths auto-populate. - Choose an existing
data/snapshot or toggle “Ingest fresh data (~10-15 min)” to pull a new 20-day window. - Adjust the history slider and timezone (UTC or Pacific with DST) for the chart + preview table.
- Run the forecast. Results persist in-session so you can tweak settings without re-running the pipeline.
- (Optional) Toggle AI prediction and commentary to call Gemini; citations are appended automatically via Google Search tools.

Outputs include:
- Summary metrics (latest close, ensemble average, model spread).
- Chart with historical closes plus the Linear Regression prediction marker.
- Downloadable prediction table and feature preview with both UTC and PST timestamps.
The CLI mirrors the dashboard but is script-friendly.
python main.py train --ingest --save-models \
--lookback-days 1095 --hours 26280 --ir-lookback-days 1095python main.py predict --models-dir models/20251112_170018 --ingest#refer to train_with_training_data.ipynb| Path | Purpose |
|---|---|
main.py |
CLI entrypoint (train, predict, forecast) |
streamlit_app.py |
Dashboard with ingestion toggle, charting, downloads, Gemini commentary |
main_scripts/train.py |
Preprocess, engineer features, train/evaluate, persist artifacts |
main_scripts/test.py |
Feature rebuild + model inference |
helpers/data_ingestation.py |
Independent quant/news/interest ingestion commands |
helpers/feature_engineering.py |
RSI, lags, rolling stats, volatility, momentum features |
helpers/llm_support.py |
Prompt builder + citation helper for Gemini |
helpers/queries.py |
Default news/search query lists |
analysis/*.ipynb |
Evaluation visuals (plots_for_readers.ipynb, true_vs_predicted.ipynb, etc.) |
standalone_training/ |
Sandbox scripts/notebooks for reproducible experiments |
data/{timestamp}/ |
Saved CSV snapshots (quant, sentiment, interest) |
models/{timestamp}/ |
Serialized models (mlr, ridge, rf, xgb), scaler, feature list |
Every ingest run creates:
data/{YYYYMMDD_HHMMSS}/quant/quant_bitcoin_test_*.csv
data/{YYYYMMDD_HHMMSS}/sentiment/google_news_sentiment_*.csv
data/{YYYYMMDD_HHMMSS}/interest/interest_rates_test_*.csv
Training saves to:
models/{YYYYMMDD_HHMMSS}/mlr_model.joblib
models/{YYYYMMDD_HHMMSS}/ridge_model.joblib
models/{YYYYMMDD_HHMMSS}/rf_model.joblib
models/{YYYYMMDD_HHMMSS}/xgb_model.joblib
models/{YYYYMMDD_HHMMSS}/scaler.joblib
models/{YYYYMMDD_HHMMSS}/feature_list.joblib
The Streamlit app and CLI auto-discover these directories, so keeping timestamped folders untouched preserves reproducibility.
- Sentiment model missing: if
kk08/CryptoBERTcannot be loaded, sentiment defaults to0.0; warnings are logged. - Feature guard: predictions need at least 15 historical rows after preprocessing; otherwise the pipeline aborts early.
- Gemini errors: ensure
GEMINI_API_KEYis set; the dashboard will gracefully disable commentary when unavailable. - Widget deprecations: the app already uses the new
widthparameter (replacinguse_container_width) to stay compatible with Streamlit >= 1.40.
analysis/plots_for_readers.ipynb– curated visuals for presentations/blog posts.analysis/true_vs_predicted.ipynb– compares model forecasts vs actual closes.next_day_forecast_llm.ipynb– scripted forecast run with LLM prompt generation.standalone_training/train_with_training_data.ipynb– reproducible training experiments outside the CLI.
- Expand model zoo (CatBoost/LightGBM, LSTM or Transformer for Hourly/Minute Predictions).
- Deploy Streamlit as a scheduled Cloud Run/Spaces app.
- Add automated backtesting metrics and alerting hooks.
helpers/llm_support.py
get_prompt(predictions, yesterdays_close): structured forecasting prompt for LLMadd_citations(response): attach citation links if metadata available
