This document explains the architecture and design of the Stonk-Trainer v2 model, including both Stage I and Stage II components.
The Stonk-Trainer v2 project consists of a two-stage training approach using Generative Reinforcement Policy Optimization (GRPO):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Stonk-Trainer v2 β
βββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ€
β β β
β Stage I Training β Stage II Training β
β (Balanced Distribution) β (Natural Distribution) β
β β β
βββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ€
β β β
β β’ Base Model: Qwen2.5-1.5B-Instruct β β’ Base Model: Stage I Model β
β β’ Dataset: 50/50 Up/Down Balance β β’ Dataset: Natural Market Skew β
β β’ Learning Rate: Higher (1e-5) β β’ Learning Rate: Lower (5e-6) β
β β’ Reward Function: Basic GRPO β β’ Reward Function: Enhanced GRPO β
β β β
βββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββ
The foundation of the Stonk-Trainer is the Qwen2.5-1.5B-Instruct language model, which provides powerful language understanding and generation capabilities while being efficient enough to train on consumer-grade hardware.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Qwen2.5-1.5B-Instruct Model β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β β’ Parameters: 1.5 Billion β
β β’ Context Window: 2048 tokens β
β β’ Architecture: Transformer-based language model β
β β’ Quantization: 4-bit precision (for memory efficiency) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Low-Rank Adaptation (LoRA) is used for efficient fine-tuning:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LoRA Configuration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β β’ R: 8 (rank of low-rank matrices) β
β β’ Alpha: 16 (scaling factor) β
β β’ Dropout: 0.05 β
β β’ Target Modules: Query, Key, Value, Output projections β
β β’ Bias: "none" β
β β’ Task_type: "CAUSAL_LM" β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Dataset Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ βββββββββββββββββββββββββ β
β β Raw Market β β β β
β β Data Source βββββββΊβ Filtering Logic β β
β β β β β β
β βββββββββββββββββββ βββββββββββββ¬ββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββ βββββββββββββββββββββββββ β
β β Stage I: β β β β
β β Balanced Set ββββββββ Data Processor β β
β β (50/50) β β β β
β βββββββββββββββββββ βββββββββββββ¬ββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββ βββββββββββββββββββββββββ β
β β Stage II: β β β β
β β Natural Distr. ββββββββ Formatting β β
β β β β β β
β βββββββββββββββββββ βββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The Generative Reinforcement Policy Optimization (GRPO) training loop is a specialized reinforcement learning approach for language models:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GRPO Training Loop β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββ β
β β β β
β β Forward Pass β β
β β Generate Response βββββ β
β β β β β
β βββββββββββββββββββββββββββββ β β
β β β
β βΌ β
β βββββββββββββββββββββββββββββ β βββββββββββββββββββββ β
β β β β β β β
β β Compute Reward βββββ΄ββββ Extract Predictionβ β
β β β β and Reasoning β β
β βββββββββββββββββ¬ββββββββββββ β β β
β β βββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββ β
β β β β
β β Compute Policy Loss β β
β β with KL Penalty β β
β β β β
β βββββββββββββββββ¬ββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββ β
β β β β
β β Backward Pass β β
β β Update Parameters β β
β β β β
β βββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The reward function is composed of multiple components that evaluate different aspects of the model's performance:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Reward Function β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββ ββββββββββββββββββββββββββ β
β β β β β β
β β Prediction Accuracy β β Reasoning Quality β β
β β Component (0.0-0.6) β β Component (0.0-0.4) β β
β β β β β β
β βββββββββββββββββ¬ββββββββββββ ββββββββββββββ¬ββββββββββββ β
β β β β
β βββββββββββββ¬ββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Total Reward Score β β
β β (0.0-1.0) β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The reasoning quality component evaluates the explanations provided by the model:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Reasoning Quality Evaluation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββ ββββββββββββββββββββββββββ β
β β β β β β
β β Data Usage Score β β Logical Structure β β
β β (0.0-0.2) β β Score (0.0-0.2) β β
β β β β β β
β βββββββββββββββββ¬ββββββββββββ ββββββββββββββ¬ββββββββββββ β
β β β β
β βββββββββββββ¬ββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Total Reasoning Quality Score β β
β β (0.0-0.4) β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Stage I to Stage II Transition β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββ ββββββββββββββββββββββββββ β
β β β β β β
β β Save Best Stage I β β Load Stage I Model β β
β β Model Checkpoint β β for Stage II β β
β β β β β β
β βββββββββββββββββ¬ββββββββββββ ββββββββββββββ¬ββββββββββββ β
β β β β
β βββββββββββββ¬ββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Adapt Learning Rate and Dataset β β
β β for Natural Distribution Training β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Performance Evaluation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββ βββββββββββββββββββββ β
β β β β β β
β β Accuracy β β Average Reward β β
β β Metrics β β Metrics β β
β β β β β β
β βββββββββββ¬ββββββββββ ββββββββββ¬βββββββββββ β
β β β β
β β ββββββββββββββββββββ΄βββββββββββ β
β β β β β
β β β Reasoning Quality Analysis β β
β β β β β
β β ββββββββββββββββ¬βββββββββββββββ β
β β β β
β βββββββββββ¬ββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Comprehensive Evaluation Report β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
One advantage of using the Qwen2.5-1.5B-Instruct model is that it requires less hardware resources than larger models, making it accessible for more users:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hardware Requirements β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β β’ GPU: NVIDIA GPU with at least 8GB VRAM β
β - Recommended: GTX 1080 Ti (11GB) or better β
β - The code has been optimized to run at 95-100% β
β utilization on a GTX 1080 Ti without OOM errors β
β β
β β’ RAM: 32GB recommended (16GB minimum) β
β - Training processes can use 8-12GB RAM β
β - Additional RAM needed for data processing β
β - Less RAM may result in slower performance β
β β
β β’ Storage: At least 50GB free space β
β - Python environment: ~6-8GB for Conda environment β
β - HuggingFace cache: ~15-20GB for models and datasets β
β - Training datasets: ~12GB β
β - Model checkpoints: ~5-10GB depending on saved epochs β
β - Logs and evaluation results: ~1GB β
β β
β β’ CPU: 4+ cores recommended for data preprocessing β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Software Environment β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β β’ Python: 3.10+ β
β β
β β’ PyTorch: 2.0.0+ β
β β
β β’ CUDA: 11.7+ (11.8 recommended) β
β β
β β’ Transformers: 4.38.0+ β
β β
β β’ PEFT: 0.6.0+ (for LoRA adapter implementation) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ