A PyTorch-based project for training language models using Hugging Face Transformers, with uv for modern Python dependency management.
🚀 New to this project? Check out QUICKSTART.md for a step-by-step guide!
🐳 Want to use Docker? See DOCKER.md for isolated development environment!
📊 Need to monitor training? See TENSORBOARD.md and WANDB.md for visualization guides!
💾 Want to use trained models? See CHECKPOINT_USAGE.md for loading checkpoints and running inference!
🔷 Need PyTorch Geometric? See PYTORCH_GEOMETRIC.md for installing PyG with PyTorch 2.9.1 + CUDA 12.8!
- PyTorch: State-of-the-art deep learning framework (CUDA 12.8)
- Hugging Face Transformers: Access to thousands of pre-trained models
- Accelerate: Simplified distributed training
- DeepSpeed: Advanced distributed training optimization
- Datasets: Easy access to datasets from Hugging Face Hub
- TensorBoard: Visualization and experiment tracking
- Weights & Biases: Experiment tracking and collaboration
- Loguru: Better Python logging
- JupyterLab: Interactive development environment
- NetworkX: Graph and network analysis
- PyTorch Geometric (optional): Graph neural networks (build from source for PyTorch 2.9.1)
- Gdown: Download files from Google Drive
- Peft: Parameter-efficient fine-tuning
- Click: Command-line interface
- Bitsandbytes: CUDA-accelerated matrix operations
- Python-dotenv: Load environment variables from .env file
- Evaluate: Evaluate model performance
- uv: Fast, reliable Python package management
- Docker: Isolated, reproducible development environment
- Checkpoint Loading: Easy model loading for testing and inference
- And other necessary packages for LLM training
- Python 3.12 or higher
- uv (Python package manager)
- CUDA 12.8 (for GPU support)
- Docker with GPU support (nvidia-docker2)
- NVIDIA drivers
For isolated, reproducible environment:
# One-command setup
./docker_setup.sh
# Enter container and start working
./docker_shell.shSee DOCKER.md for complete Docker documentation.
-
Install uv (if not already installed):
pip install uv
-
Clone the repository:
git clone https://github.com/quy-ng/GLMs.git cd GLMs -
Install dependencies:
uv sync
This will create a virtual environment and install all required dependencies:
- PyTorch with CUDA 12.8 support
- Hugging Face Transformers
- Hugging Face Datasets
- Accelerate
- DeepSpeed
- TensorBoard
- Weights & Biases (wandb)
- Loguru
- JupyterLab
- NetworkX
- scikit-learn
- And all their dependencies
-
For VS Code users: Select Python interpreter
- Open VS Code:
code . - Press
Ctrl+Shift+P→ "Python: Select Interpreter" - Choose
.venv/bin/python
See .vscode/README.md if you have issues selecting the interpreter.
- Open VS Code:
If you need Graph Neural Networks (PyG), install from source:
# Run directly with bash (activates venv internally)
bash scripts/install_pyg_from_source.sh
# Verify installation
uv run python scripts/verify_pyg.pyThis builds PyG and core dependencies (pyg-lib, torch-scatter, torch-sparse, torch-cluster) for PyTorch 2.9.1 + CUDA 12.8. torch-spline-conv is skipped by default (set export SKIP_SPLINE_CONV=0 to include it if needed).
Features:
- Saves .whl files to
pyg_build_artifacts/for reuse - Comprehensive verification with functionality tests
- Can reuse wheels on other machines (see PYTORCH_GEOMETRIC.md)
Note: Building from source takes 15-30 minutes. Reusing saved wheels takes ~1 minute.
First, verify that all dependencies are properly installed:
uv run python scripts/verify_setup.pyThis will test all libraries, check hardware capabilities (BF16/FP16 support), and provide recommendations.
The project supports multi-GPU distributed training with Accelerate and DeepSpeed:
For 2 GPUs (default):
uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_pretrain.jsonFor 4 GPUs:
uv run accelerate launch --config_file accelerate_ds_config_4gpu.yaml scripts/run.py --train_config configs/config_pretrain.jsonFor 8 GPUs:
uv run accelerate launch --config_file accelerate_ds_config_8gpu.yaml scripts/run.py --train_config configs/config_pretrain.jsonFor quick test:
uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_quick_test.jsonCompare DeepSpeed vs Non-DeepSpeed:
# With DeepSpeed (more memory efficient)
uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_quick_test.json
# Without DeepSpeed (standard DDP, slightly faster)
uv run accelerate launch --config_file accelerate_config_no_deepspeed.yaml scripts/run.py --train_config configs/config_quick_test.jsonSee ACCELERATE_CONFIGS.md for detailed configuration options, multi-GPU setup, and DeepSpeed vs non-DeepSpeed comparison.
uv run python main.pyuv run python scripts/train_example.pyThis example demonstrates:
- Loading a dataset from Hugging Face
- Loading a pre-trained BERT model
- Generating embeddings
- Using Accelerate for device management
uv run jupyter labThis launches JupyterLab for interactive development and experimentation.
After starting a training run, you can monitor training progress in real-time using TensorBoard.
1. Start TensorBoard:
uv run tensorboard --logdir ./logs2. Open in browser:
Navigate to http://localhost:6006 to view:
- Training and validation loss curves
- Learning rate schedule
- GPU utilization
- Custom metrics
3. Monitor specific run:
# For a specific run directory
uv run tensorboard --logdir ./logs/run_20260113_120000
# To use a different port
uv run tensorboard --logdir ./logs --port 60074. Access remotely (Docker/SSH):
# Forward port when connecting via SSH
ssh -L 6006:localhost:6006 user@remote-server
# Or bind to all interfaces (use with caution)
uv run tensorboard --logdir ./logs --bind_allThe training script automatically logs to ./logs by default (configured in training config JSON files).
For detailed TensorBoard usage, troubleshooting, and advanced features, see TENSORBOARD.md.
Weights & Biases (wandb) provides cloud-based experiment tracking and team collaboration. It's enabled by default in all training configs.
1. First-time setup (one-time only):
uv run wandb loginFollow the prompts to authenticate (or create a free account at https://wandb.ai).
2. Run training:
Once logged in, wandb automatically tracks your experiments:
uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_quick_test.jsonThe script will output a wandb URL to view your experiment in real-time from anywhere.
3. Disable wandb (optional):
To turn off wandb, edit your config and set use_wandb: false:
{
"logging_config": {
"use_wandb": false,
...
}
}Key features:
- Real-time metrics from any device
- Automatic hyperparameter tracking
- Team collaboration and sharing
- Experiment comparison
- Cloud storage of all runs
- Loss graphs, learning rate curves, and all metrics visualized
For detailed wandb setup, configuration options, and best practices, see WANDB.md.
Use both TensorBoard and wandb: Both are enabled by default! Use TensorBoard for local visualization and wandb for cloud monitoring.
After training, you can load saved model checkpoints for evaluation, testing, or interactive inference.
1. Test Evaluation:
Evaluate a checkpoint on the test dataset:
uv run python scripts/load_checkpoint.py \
--checkpoint_dir ./outputs/checkpoint-1000 \
--mode test \
--test_config configs/config_quick_test.jsonThis will show test loss, perplexity, and accuracy metrics with SOTA comparison.
2. Interactive Mode:
Run an interactive session for masked language modeling:
uv run python scripts/load_checkpoint.py \
--checkpoint_dir ./outputs/checkpoint-1000 \
--mode interactiveType text with [MASK] tokens to see predictions:
Enter text: The capital of France is [MASK].
> Predictions: Paris (82%), Lyon (4%), Marseille (2%), ...
3. Inference Mode:
Run inference on specific text:
uv run python scripts/load_checkpoint.py \
--checkpoint_dir ./outputs/checkpoint-1000 \
--mode inference \
--text "The cat [MASK] on the mat."For complete checkpoint usage guide, examples, and troubleshooting, see CHECKPOINT_USAGE.md.
# Activate the virtual environment
source .venv/bin/activate # On Linux/Mac
# or
.venv\Scripts\activate # On Windows
# Run scripts
python main.py
python scripts/train_example.pyGLMs/
├── accelerate_ds_config.yaml # Accelerate + DeepSpeed configuration (2 GPUs)
├── accelerate_config_no_deepspeed.yaml # Standard DDP configuration (no DeepSpeed)
├── configs/ # Training configuration files
│ ├── README.md # Config documentation
│ ├── config_pretrain.json # Full pretraining configuration
│ └── config_quick_test.json # Quick test configuration
├── scripts/
│ ├── run.py # Main training script
│ ├── load_checkpoint.py # Checkpoint loading and inference
│ ├── train_example.py # Example training script
│ └── verify_setup.py # Setup verification script
├── src/
│ └── glms/ # Source code package
├── logs/ # Training logs (created automatically)
├── outputs/ # Model checkpoints (created automatically)
├── main.py # Main entry point
├── pyproject.toml # Project configuration and dependencies
├── README.md # This file (main documentation)
├── QUICKSTART.md # Quick start guide
├── CHECKPOINT_USAGE.md # Checkpoint loading guide
├── TENSORBOARD.md # TensorBoard guide
├── WANDB.md # Weights & Biases guide
├── DOCKER.md # Docker usage guide
├── TROUBLESHOOTING.md # Common issues and solutions
├── ACCELERATE_CONFIGS.md # Multi-GPU and DeepSpeed config guide
└── .gitignore # Git ignore rules
All dependencies are managed in pyproject.toml:
Core ML Libraries:
- torch: PyTorch deep learning framework (CUDA 12.8)
- transformers: Hugging Face Transformers library
- datasets: Hugging Face Datasets library
- accelerate: Training acceleration library
- deepspeed: Advanced distributed training
- scikit-learn: Machine learning utilities
Experiment Tracking & Logging:
- tensorboard: Visualization and tracking
- wandb: Weights & Biases experiment tracking
- loguru: Better Python logging
Development Tools:
- jupyterlab: Interactive development environment
- networkx: Graph and network analysis
uv add package-nameuv remove package-nameuv sync --upgradeuv treeThe project uses two types of configuration files:
-
Accelerate Configuration (
accelerate_ds_config.yaml):- Controls distributed training settings
- DeepSpeed ZeRO stage configuration
- Mixed precision settings
- Multi-GPU/multi-node setup
-
Training Configuration (
configs/config_*.json):- Model architecture settings
- Training hyperparameters (learning rate, batch size, etc.)
- Dataset configuration
- Logging and checkpoint settings
- Create or modify a training configuration in
configs/ - Adjust
accelerate_ds_config.yamlfor your hardware setup - Run training with:
accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/your_config.json
- Monitor training with TensorBoard:
tensorboard --logdir ./logs
- Track experiments with Weights & Biases (set
use_wandb: truein config)
- Use
config_quick_test.jsonfor quick validation - Leverage Accelerate or DeepSpeed for distributed training
- Track experiments with TensorBoard or Weights & Biases
- Use Loguru for better logging (logs saved to
logs/directory) - Checkpoints are saved to
outputs/directory
This project is open source and available under the MIT License.
Contributions are welcome! Please feel free to submit a Pull Request.