The information provided here is for testing and educational purposes only and should not be construed as financial advice. Please consult with a licensed financial advisor before making any financial decisions. This is all theoretical and not proven to work! This is a work in progress and nothing is guaranteed and things may/will break.
An advanced stonk market prediction model using Generative Reinforcement Policy Optimization (GRPO) with the Qwen2.5-1.5B-Instruct model, featuring enhanced reward functions and training methodology.
- Overview
- Key Features
- Installation
- Usage
- Training Process
- Training Architecture Diagram
- Model Architecture
- Reward Function
- Directory Structure
- Results & Evaluation
- Super Saiyan Mode (Stage II)
- Troubleshooting
- Contributing
Stonk-Trainer v2 is a specialized framework for training language models to predict stonk market movements. It leverages GRPO (Generative Reinforcement Policy Optimization) to train LLMs specifically on stonk prediction tasks, incorporating concepts from reinforcement learning and behavioral economics.
The model is trained to analyze a company's stonk information (including historical data, price movements, and news) and predict whether the stonk will go up or down, along with percentage change estimates and confidence levels.
The reward function has been significantly improved to encourage data-driven predictions and appropriate confidence calibration:
- Confidence Penalty for Wrong Predictions: Applies a sliding scale penalty (from -0.2 to -1.5) based on confidence level when the prediction direction is incorrect.
- Data Utilization Reward (15%): Rewards the model for referencing specific data points in its reasoning.
- Rebalanced Components: Direction (35%), Magnitude (20%), Format (20%), Confidence (10%), Data Utilization (15%).
- Implements filtering to create a balanced dataset with equal representation of upward and downward price movements.
- Applies diversity penalties during training if the model shows bias toward consistently predicting one direction.
- Enhanced KL divergence calculation for more stable GRPO training.
- Better gradient handling and loss computation.
- 4-bit quantization support for efficient training.
- Stage I: Balanced dataset training using modified GRPO.
- Stage II: "Super Saiyan" mode with natural distribution training (see Super Saiyan Mode).
Linux is the only OS that has been tested to work. It is possible that this will work on Windows but it has not been confirmed. The current implementation requires CUDA and therefore Mac is not supported at this time.
- Python 3.10+
- CUDA-capable GPU with 11GB+ VRAM (24GB recommended for larger batch sizes)
- 32GB+ System RAM (recommended)
- Clone this repository:
git clone https://github.com/yourusername/Stonk-Trainer.git
cd Stonk-Trainer/Stonk-Trainer_v2- Create a conda environment (recommended):
conda create -n stonk_env python=3.10
conda activate stonk_env- Install dependencies:
pip install -r requirements.txt- Make scripts executable:
chmod +x *.shTwo options are provided for running the training:
./train_direct.sh./train_large_dataset.shYou may need to edit the Python path in the scripts to match your environment.
Key parameters (customizable in the scripts):
--epochs: Number of training epochs (default: 5)--batch_size: Batch size for training (default: 8)--lr: Learning rate (default: 1e-5) # Experiment with other values ex.: 2e-1-5e-5--kl_coef: KL divergence coefficient (default: 0.15) # Experiment with other values ex.: 0.05-0.2--save_steps: Steps between saving checkpoints (default: 100)--diverse_predictions: Enable diversity penalties--max_train_samples: Maximum number of training samples (default: 2000)
After training completes, test the model with:
./test_direct.sh./test_model.shTest results will be saved to test_results.log.
- Data Loading: The model loads the 2084Collective/deepstock-sp500-companies-with-info-and-user-prompt_buy_sell_v2 dataset.
- Data Filtering: Creates a balanced dataset with equal up/down examples (~505 up, ~504 down).
- Model Preparation: Loads the Qwen2.5-1.5B-Instruct model with 4-bit quantization and applies LoRA adapters.
- GRPO Training Loop:
- Generates predictions for stonk data
- Computes rewards based on prediction quality
- Calculates policy gradient loss and KL divergence
- Updates model parameters
- Checkpointing: Saves the model regularly and keeps the best-performing version.
The Stonk-Trainer v2 employs a two-stage GRPO (Generative Reinforcement Policy Optimization) training architecture:
┌──────────────────────────── Stage I ────────────────────────────┐ ┌──────────────────────── Stage II ───────────────────────────┐
│ │ │ │
│ ┌─────────┐ ┌──────────────┐ ┌────────────────┐ │ │ ┌─────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ Qwen2.5 │────>│ 4-bit Quant │────>│ Balanced │ │ │ │ Stage I │───>│ 4-bit Quant │───>│ Natural │ │
│ │ Model │ │ LoRA (r=16) │ │ Dataset │ │ │ │ Model │ │ LoRA (r=8) │ │ Distribution │ │
│ └─────────┘ └──────────────┘ └────────────────┘ │ │ └─────────────┘ └──────────────┘ └────────────────┘ │
│ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │
│ v v v │ │ v v v │
│ ┌─────────┐ ┌──────────────┐ ┌────────────────┐ │ │ ┌─────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │Reference│ │Reward Function│ │Policy Gradient │ │ │ │ Reference │ │Enhanced │ │Market Stats │ │
│ │Model │────>│Format: 20% │────>│KL coef: 0.15 │ │ │ │ Model │───>│Reward │───>│Tracking │ │
│ │(Frozen) │ │Direction: 35%│ │ │ │ │ │ (Stage I) │ │Direction: 25%│ │Up/Down Ratio │ │
│ └─────────┘ │Magnitude: 20%│ └────────────────┘ │ │ └─────────────┘ │Magnitude: 20%│ └────────────────┘ │
│ │Confidence:10%│ │ │ │ │Confidence:25%│ │ │
│ │Data Use: 15% │ │ │ │ │Data Use: 30% │ │ │
│ └──────────────┘ v │ │ └──────────────┘ v │
│ ┌────────────────┐ │ │ ┌────────────────┐ │
│ │Best Model │ │ │ │KL-Divergence │ │
│ │Checkpointing │───────┼──┼────────────────────────────────────────>│KL coef: 0.05 │ │
│ └────────────────┘ │ │ └────────────────┘ │
│ │ │ │ │
└─────────────────────────────────────────────────────────────────┘ └────────────────────────────────────────────────┼────────────┘
│
v
┌────────────────────┐
│ Final Adapted │
│ Stonk Prediction │
│ Model │
└────────────────────┘
-
Two-Stage Process:
- Stage I: Training with balanced dataset (equal up/down examples)
- Stage II: Fine-tuning with natural market distribution
-
Reference Model: Frozen copy of base model used to calculate KL divergence
-
Reward Function Components:
- Direction reward (35%): Correctly predicting up/down movement
- Magnitude reward (20%): Accuracy of percentage change prediction
- Format reward (20%): Following correct output format
- Confidence reward (10%): Appropriate confidence calibration
- Data utilization reward (15%): Referencing specific data points
-
Policy Gradient with KL Divergence:
- Stage I: Higher KL coefficient (0.15) for stability
- Stage II: Lower KL coefficient (0.05) for adaptation
-
Implementation Details:
- Stage I: 4-bit quantization with LoRA (r=16)
- Stage II: 4-bit quantization with LoRA (r=8)
- Best model checkpointing between stages
This unified GRPO approach ensures the model learns fundamental prediction patterns in a balanced setting before adapting to real-world market distributions while maintaining prediction quality.
- Base Model: Qwen2.5-1.5B-Instruct
- Adaptation: LoRA (Low-Rank Adaptation) with r=16
- Quantization: 4-bit precision for memory efficiency
- Optimization: AdamW with gradient clipping
- Regularization: KL divergence from reference model
The reward function is a critical component that evaluates prediction quality across multiple dimensions:
- Checks if model follows the correct format with thinking sections.
- Rewards correct prediction of price movement direction (up/down).
- Evaluates the accuracy of the percentage change prediction.
- Rewards appropriate confidence levels and penalizes overconfidence on wrong predictions.
- Rewards referencing specific data points (ticker, price, news, etc.).
Stonk_Trainer.py- Main training and testing script with improved reward functionrequirements.txt- Dependencies required for running the modeltrain_direct.sh/train_large_dataset.sh- Scripts to run training with optimized parameterstest_direct.sh/test_model.sh- Scripts to test the trained modelSuper_Sayin_GRPO_Trainer_StageII/- Advanced Stage II training frameworkstonk_trainer_grpo/- Output directory for trained models and checkpoints
Training produces several outputs:
training_large_dataset.log: Training progress and statisticstest_results.log: Evaluation results after testingstonk_trainer_grpo/: Directory containing:checkpoints/: Periodic model snapshotsbest_model/: Best performing model based on avg rewardevaluation_results/: Visualizations and metrics (when using test scripts)
After training a model with the balanced dataset (Stage I), you can proceed to "Super Saiyan Mode" (Stage II) which:
- Trains on a natural market distribution (not artificially balanced)
- Uses enhanced reward functions that prioritize data utilization
- Adapts to market bias through advanced tracking
- Implements low-rank fine-tuning to prevent catastrophic forgetting
To use Super Saiyan Mode:
cd Super_Sayin_GRPO_Trainer_StageII
./setup.sh
./train_stage2.sh ../stonk_trainer_grpo/best_modelSee the Super_Sayin_GRPO_Trainer_StageII/README.md for detailed instructions.
-
CUDA out of memory errors:
- Reduce batch size in the training script
- Ensure no other processes are using GPU memory
- Enable gradient accumulation (modify Stonk_Trainer.py)
-
NaN losses during training:
- Reduce learning rate
- Check for extreme reward values
- Increase KL coefficient for more stable training
-
Model produces low-quality predictions:
- Ensure dataset is properly loaded and filtered
- Check reward function components and weights
- Try increasing the training samples
-
Script path errors:
- Modify the PYTHON_PATH in the .sh files to match your environment
Contributions to improve Stonk-Trainer are welcome! Please feel free to submit issues or pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- The Qwen team for the base model
- Special thanks to Lukas Nel the creator of the 2084Collective/deepstock-sp500-companies-with-info-and-user-prompt_buy_sell_v2 dataset for the Stonk-Trainer
- The PyTorch and Hugging Face communities