LLM Project Bootstrap (using vibe code and best practices for researchers)

A PyTorch-based project for training language models using Hugging Face Transformers, with uv for modern Python dependency management.

🚀 New to this project? Check out QUICKSTART.md for a step-by-step guide!

🐳 Want to use Docker? See DOCKER.md for isolated development environment!

📊 Need to monitor training? See TENSORBOARD.md and WANDB.md for visualization guides!

💾 Want to use trained models? See CHECKPOINT_USAGE.md for loading checkpoints and running inference!

🔷 Need PyTorch Geometric? See PYTORCH_GEOMETRIC.md for installing PyG with PyTorch 2.9.1 + CUDA 12.8!

⚠️ Having issues? See TROUBLESHOOTING.md for common problems and solutions!

Features

PyTorch: State-of-the-art deep learning framework (CUDA 12.8)
Hugging Face Transformers: Access to thousands of pre-trained models
Accelerate: Simplified distributed training
DeepSpeed: Advanced distributed training optimization
Datasets: Easy access to datasets from Hugging Face Hub
TensorBoard: Visualization and experiment tracking
Weights & Biases: Experiment tracking and collaboration
Loguru: Better Python logging
JupyterLab: Interactive development environment
NetworkX: Graph and network analysis
PyTorch Geometric (optional): Graph neural networks (build from source for PyTorch 2.9.1)
Gdown: Download files from Google Drive
Peft: Parameter-efficient fine-tuning
Click: Command-line interface
Bitsandbytes: CUDA-accelerated matrix operations
Python-dotenv: Load environment variables from .env file
Evaluate: Evaluate model performance
uv: Fast, reliable Python package management
Docker: Isolated, reproducible development environment
Checkpoint Loading: Easy model loading for testing and inference
And other necessary packages for LLM training

Prerequisites

Option 1: Local Installation

Python 3.12 or higher
uv (Python package manager)
CUDA 12.8 (for GPU support)

Option 2: Docker (Recommended)

Docker with GPU support (nvidia-docker2)
NVIDIA drivers

Installation

Docker Installation (Recommended)

For isolated, reproducible environment:

# One-command setup
./docker_setup.sh

# Enter container and start working
./docker_shell.sh

See DOCKER.md for complete Docker documentation.

Local Installation

Install uv (if not already installed):
```
pip install uv
```

Clone the repository:

git clone https://github.com/quy-ng/GLMs.git
cd GLMs

Install dependencies:
```
uv sync
```
This will create a virtual environment and install all required dependencies:
- PyTorch with CUDA 12.8 support
- Hugging Face Transformers
- Hugging Face Datasets
- Accelerate
- DeepSpeed
- TensorBoard
- Weights & Biases (wandb)
- Loguru
- JupyterLab
- NetworkX
- scikit-learn
- And all their dependencies
For VS Code users: Select Python interpreter
- Open VS Code: code .
- Press Ctrl+Shift+P → "Python: Select Interpreter"
- Choose .venv/bin/python
See .vscode/README.md if you have issues selecting the interpreter.

Optional: Install PyTorch Geometric

If you need Graph Neural Networks (PyG), install from source:

# Run directly with bash (activates venv internally)
bash scripts/install_pyg_from_source.sh

# Verify installation
uv run python scripts/verify_pyg.py

This builds PyG and core dependencies (pyg-lib, torch-scatter, torch-sparse, torch-cluster) for PyTorch 2.9.1 + CUDA 12.8. torch-spline-conv is skipped by default (set export SKIP_SPLINE_CONV=0 to include it if needed).

Features:

Saves .whl files to pyg_build_artifacts/ for reuse
Comprehensive verification with functionality tests
Can reuse wheels on other machines (see PYTORCH_GEOMETRIC.md)

Note: Building from source takes 15-30 minutes. Reusing saved wheels takes ~1 minute.

Usage

Verify the setup

First, verify that all dependencies are properly installed:

uv run python scripts/verify_setup.py

This will test all libraries, check hardware capabilities (BF16/FP16 support), and provide recommendations.

Run training with Accelerate + DeepSpeed

The project supports multi-GPU distributed training with Accelerate and DeepSpeed:

For 2 GPUs (default):

uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_pretrain.json

For 4 GPUs:

uv run accelerate launch --config_file accelerate_ds_config_4gpu.yaml scripts/run.py --train_config configs/config_pretrain.json

For 8 GPUs:

uv run accelerate launch --config_file accelerate_ds_config_8gpu.yaml scripts/run.py --train_config configs/config_pretrain.json

For quick test:

uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_quick_test.json

Compare DeepSpeed vs Non-DeepSpeed:

# With DeepSpeed (more memory efficient)
uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_quick_test.json

# Without DeepSpeed (standard DDP, slightly faster)
uv run accelerate launch --config_file accelerate_config_no_deepspeed.yaml scripts/run.py --train_config configs/config_quick_test.json

See ACCELERATE_CONFIGS.md for detailed configuration options, multi-GPU setup, and DeepSpeed vs non-DeepSpeed comparison.

Run the main entry point

uv run python main.py

Run the example training script

uv run python scripts/train_example.py

This example demonstrates:

Loading a dataset from Hugging Face
Loading a pre-trained BERT model
Generating embeddings
Using Accelerate for device management

Start JupyterLab

uv run jupyter lab

This launches JupyterLab for interactive development and experimentation.

Monitor Training with TensorBoard

After starting a training run, you can monitor training progress in real-time using TensorBoard.

1. Start TensorBoard:

uv run tensorboard --logdir ./logs

2. Open in browser:

Navigate to http://localhost:6006 to view:

Training and validation loss curves
Learning rate schedule
GPU utilization
Custom metrics

3. Monitor specific run:

# For a specific run directory
uv run tensorboard --logdir ./logs/run_20260113_120000

# To use a different port
uv run tensorboard --logdir ./logs --port 6007

4. Access remotely (Docker/SSH):

# Forward port when connecting via SSH
ssh -L 6006:localhost:6006 user@remote-server

# Or bind to all interfaces (use with caution)
uv run tensorboard --logdir ./logs --bind_all

The training script automatically logs to ./logs by default (configured in training config JSON files).

For detailed TensorBoard usage, troubleshooting, and advanced features, see TENSORBOARD.md.

Monitor Training with Weights & Biases

Weights & Biases (wandb) provides cloud-based experiment tracking and team collaboration. It's enabled by default in all training configs.

1. First-time setup (one-time only):

uv run wandb login

Follow the prompts to authenticate (or create a free account at https://wandb.ai).

2. Run training:

Once logged in, wandb automatically tracks your experiments:

uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_quick_test.json

The script will output a wandb URL to view your experiment in real-time from anywhere.

3. Disable wandb (optional):

To turn off wandb, edit your config and set use_wandb: false:

{
  "logging_config": {
    "use_wandb": false,
    ...
  }
}

Key features:

Real-time metrics from any device
Automatic hyperparameter tracking
Team collaboration and sharing
Experiment comparison
Cloud storage of all runs
Loss graphs, learning rate curves, and all metrics visualized

For detailed wandb setup, configuration options, and best practices, see WANDB.md.

Use both TensorBoard and wandb: Both are enabled by default! Use TensorBoard for local visualization and wandb for cloud monitoring.

Load Checkpoints for Testing and Inference

After training, you can load saved model checkpoints for evaluation, testing, or interactive inference.

1. Test Evaluation:

Evaluate a checkpoint on the test dataset:

uv run python scripts/load_checkpoint.py \
    --checkpoint_dir ./outputs/checkpoint-1000 \
    --mode test \
    --test_config configs/config_quick_test.json

This will show test loss, perplexity, and accuracy metrics with SOTA comparison.

2. Interactive Mode:

Run an interactive session for masked language modeling:

uv run python scripts/load_checkpoint.py \
    --checkpoint_dir ./outputs/checkpoint-1000 \
    --mode interactive

Type text with [MASK] tokens to see predictions:

Enter text: The capital of France is [MASK].
> Predictions: Paris (82%), Lyon (4%), Marseille (2%), ...

3. Inference Mode:

Run inference on specific text:

uv run python scripts/load_checkpoint.py \
    --checkpoint_dir ./outputs/checkpoint-1000 \
    --mode inference \
    --text "The cat [MASK] on the mat."

For complete checkpoint usage guide, examples, and troubleshooting, see CHECKPOINT_USAGE.md.

Run with Python directly in the virtual environment

# Activate the virtual environment
source .venv/bin/activate  # On Linux/Mac
# or
.venv\Scripts\activate  # On Windows

# Run scripts
python main.py
python scripts/train_example.py

Project Structure

GLMs/
├── accelerate_ds_config.yaml      # Accelerate + DeepSpeed configuration (2 GPUs)
├── accelerate_config_no_deepspeed.yaml  # Standard DDP configuration (no DeepSpeed)
├── configs/                        # Training configuration files
│   ├── README.md                  # Config documentation
│   ├── config_pretrain.json       # Full pretraining configuration
│   └── config_quick_test.json     # Quick test configuration
├── scripts/
│   ├── run.py                     # Main training script
│   ├── load_checkpoint.py         # Checkpoint loading and inference
│   ├── train_example.py           # Example training script
│   └── verify_setup.py            # Setup verification script
├── src/
│   └── glms/                      # Source code package
├── logs/                          # Training logs (created automatically)
├── outputs/                       # Model checkpoints (created automatically)
├── main.py                        # Main entry point
├── pyproject.toml                 # Project configuration and dependencies
├── README.md                      # This file (main documentation)
├── QUICKSTART.md                  # Quick start guide
├── CHECKPOINT_USAGE.md            # Checkpoint loading guide
├── TENSORBOARD.md                 # TensorBoard guide
├── WANDB.md                       # Weights & Biases guide
├── DOCKER.md                      # Docker usage guide
├── TROUBLESHOOTING.md             # Common issues and solutions
├── ACCELERATE_CONFIGS.md          # Multi-GPU and DeepSpeed config guide
└── .gitignore                     # Git ignore rules

Dependencies

All dependencies are managed in pyproject.toml:

Core ML Libraries:

torch: PyTorch deep learning framework (CUDA 12.8)
transformers: Hugging Face Transformers library
datasets: Hugging Face Datasets library
accelerate: Training acceleration library
deepspeed: Advanced distributed training
scikit-learn: Machine learning utilities

Experiment Tracking & Logging:

tensorboard: Visualization and tracking
wandb: Weights & Biases experiment tracking
loguru: Better Python logging

Development Tools:

jupyterlab: Interactive development environment
networkx: Graph and network analysis

Development

Add new dependencies

uv add package-name

Remove dependencies

uv remove package-name

Update dependencies

uv sync --upgrade

Show dependency tree

uv tree

Training Your Own Models

Configuration Files

The project uses two types of configuration files:

Accelerate Configuration (accelerate_ds_config.yaml):
- Controls distributed training settings
- DeepSpeed ZeRO stage configuration
- Mixed precision settings
- Multi-GPU/multi-node setup
Training Configuration (configs/config_*.json):
- Model architecture settings
- Training hyperparameters (learning rate, batch size, etc.)
- Dataset configuration
- Logging and checkpoint settings

Training Workflow

Create or modify a training configuration in configs/
Adjust accelerate_ds_config.yaml for your hardware setup

Run training with:

accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/your_config.json

Monitor training with TensorBoard:
```
tensorboard --logdir ./logs
```
Track experiments with Weights & Biases (set use_wandb: true in config)

Tips

Use config_quick_test.json for quick validation
Leverage Accelerate or DeepSpeed for distributed training
Track experiments with TensorBoard or Weights & Biases
Use Loguru for better logging (logs saved to logs/ directory)
Checkpoints are saved to outputs/ directory

License

This project is open source and available under the MIT License.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
configs		configs
scripts		scripts
src/glms		src/glms
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
ACCELERATE_CONFIGS.md		ACCELERATE_CONFIGS.md
CHECKPOINT_USAGE.md		CHECKPOINT_USAGE.md
DOCKER.md		DOCKER.md
PYTORCH_GEOMETRIC.md		PYTORCH_GEOMETRIC.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
TENSORBOARD.md		TENSORBOARD.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
WANDB.md		WANDB.md
accelerate_config_no_deepspeed.yaml		accelerate_config_no_deepspeed.yaml
accelerate_config_no_deepspeed_fp32.yaml		accelerate_config_no_deepspeed_fp32.yaml
accelerate_ds_config.yaml		accelerate_ds_config.yaml
accelerate_ds_config_4gpu.yaml		accelerate_ds_config_4gpu.yaml
accelerate_ds_config_8gpu.yaml		accelerate_ds_config_8gpu.yaml
accelerate_ds_config_fp32.yaml		accelerate_ds_config_fp32.yaml
docker_jupyter.sh		docker_jupyter.sh
docker_setup.sh		docker_setup.sh
docker_shell.sh		docker_shell.sh
docker_train.sh		docker_train.sh
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

LLM Project Bootstrap (using vibe code and best practices for researchers)

Features

Prerequisites

Option 1: Local Installation

Option 2: Docker (Recommended)

Installation

Docker Installation (Recommended)

Local Installation

Optional: Install PyTorch Geometric

Usage

Verify the setup

Run training with Accelerate + DeepSpeed

Run the main entry point

Run the example training script

Start JupyterLab

Monitor Training with TensorBoard

Monitor Training with Weights & Biases

Load Checkpoints for Testing and Inference

Run with Python directly in the virtual environment

Project Structure

Dependencies

Development

Add new dependencies

Remove dependencies

Update dependencies

Show dependency tree

Training Your Own Models

Configuration Files

Training Workflow

Tips

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages