Skip to content

quy-ng/vibe-llm-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Project Bootstrap (using vibe code and best practices for researchers)

A PyTorch-based project for training language models using Hugging Face Transformers, with uv for modern Python dependency management.

🚀 New to this project? Check out QUICKSTART.md for a step-by-step guide!

🐳 Want to use Docker? See DOCKER.md for isolated development environment!

📊 Need to monitor training? See TENSORBOARD.md and WANDB.md for visualization guides!

💾 Want to use trained models? See CHECKPOINT_USAGE.md for loading checkpoints and running inference!

🔷 Need PyTorch Geometric? See PYTORCH_GEOMETRIC.md for installing PyG with PyTorch 2.9.1 + CUDA 12.8!

⚠️ Having issues? See TROUBLESHOOTING.md for common problems and solutions!

Features

  • PyTorch: State-of-the-art deep learning framework (CUDA 12.8)
  • Hugging Face Transformers: Access to thousands of pre-trained models
  • Accelerate: Simplified distributed training
  • DeepSpeed: Advanced distributed training optimization
  • Datasets: Easy access to datasets from Hugging Face Hub
  • TensorBoard: Visualization and experiment tracking
  • Weights & Biases: Experiment tracking and collaboration
  • Loguru: Better Python logging
  • JupyterLab: Interactive development environment
  • NetworkX: Graph and network analysis
  • PyTorch Geometric (optional): Graph neural networks (build from source for PyTorch 2.9.1)
  • Gdown: Download files from Google Drive
  • Peft: Parameter-efficient fine-tuning
  • Click: Command-line interface
  • Bitsandbytes: CUDA-accelerated matrix operations
  • Python-dotenv: Load environment variables from .env file
  • Evaluate: Evaluate model performance
  • uv: Fast, reliable Python package management
  • Docker: Isolated, reproducible development environment
  • Checkpoint Loading: Easy model loading for testing and inference
  • And other necessary packages for LLM training

Prerequisites

Option 1: Local Installation

  • Python 3.12 or higher
  • uv (Python package manager)
  • CUDA 12.8 (for GPU support)

Option 2: Docker (Recommended)

  • Docker with GPU support (nvidia-docker2)
  • NVIDIA drivers

Installation

Docker Installation (Recommended)

For isolated, reproducible environment:

# One-command setup
./docker_setup.sh

# Enter container and start working
./docker_shell.sh

See DOCKER.md for complete Docker documentation.

Local Installation

  1. Install uv (if not already installed):

    pip install uv
  2. Clone the repository:

    git clone https://github.com/quy-ng/GLMs.git
    cd GLMs
  3. Install dependencies:

    uv sync

    This will create a virtual environment and install all required dependencies:

    • PyTorch with CUDA 12.8 support
    • Hugging Face Transformers
    • Hugging Face Datasets
    • Accelerate
    • DeepSpeed
    • TensorBoard
    • Weights & Biases (wandb)
    • Loguru
    • JupyterLab
    • NetworkX
    • scikit-learn
    • And all their dependencies
  4. For VS Code users: Select Python interpreter

    • Open VS Code: code .
    • Press Ctrl+Shift+P → "Python: Select Interpreter"
    • Choose .venv/bin/python

    See .vscode/README.md if you have issues selecting the interpreter.

Optional: Install PyTorch Geometric

If you need Graph Neural Networks (PyG), install from source:

# Run directly with bash (activates venv internally)
bash scripts/install_pyg_from_source.sh

# Verify installation
uv run python scripts/verify_pyg.py

This builds PyG and core dependencies (pyg-lib, torch-scatter, torch-sparse, torch-cluster) for PyTorch 2.9.1 + CUDA 12.8. torch-spline-conv is skipped by default (set export SKIP_SPLINE_CONV=0 to include it if needed).

Features:

  • Saves .whl files to pyg_build_artifacts/ for reuse
  • Comprehensive verification with functionality tests
  • Can reuse wheels on other machines (see PYTORCH_GEOMETRIC.md)

Note: Building from source takes 15-30 minutes. Reusing saved wheels takes ~1 minute.

Usage

Verify the setup

First, verify that all dependencies are properly installed:

uv run python scripts/verify_setup.py

This will test all libraries, check hardware capabilities (BF16/FP16 support), and provide recommendations.

Run training with Accelerate + DeepSpeed

The project supports multi-GPU distributed training with Accelerate and DeepSpeed:

For 2 GPUs (default):

uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_pretrain.json

For 4 GPUs:

uv run accelerate launch --config_file accelerate_ds_config_4gpu.yaml scripts/run.py --train_config configs/config_pretrain.json

For 8 GPUs:

uv run accelerate launch --config_file accelerate_ds_config_8gpu.yaml scripts/run.py --train_config configs/config_pretrain.json

For quick test:

uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_quick_test.json

Compare DeepSpeed vs Non-DeepSpeed:

# With DeepSpeed (more memory efficient)
uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_quick_test.json

# Without DeepSpeed (standard DDP, slightly faster)
uv run accelerate launch --config_file accelerate_config_no_deepspeed.yaml scripts/run.py --train_config configs/config_quick_test.json

See ACCELERATE_CONFIGS.md for detailed configuration options, multi-GPU setup, and DeepSpeed vs non-DeepSpeed comparison.

Run the main entry point

uv run python main.py

Run the example training script

uv run python scripts/train_example.py

This example demonstrates:

  • Loading a dataset from Hugging Face
  • Loading a pre-trained BERT model
  • Generating embeddings
  • Using Accelerate for device management

Start JupyterLab

uv run jupyter lab

This launches JupyterLab for interactive development and experimentation.

Monitor Training with TensorBoard

After starting a training run, you can monitor training progress in real-time using TensorBoard.

1. Start TensorBoard:

uv run tensorboard --logdir ./logs

2. Open in browser:

Navigate to http://localhost:6006 to view:

  • Training and validation loss curves
  • Learning rate schedule
  • GPU utilization
  • Custom metrics

3. Monitor specific run:

# For a specific run directory
uv run tensorboard --logdir ./logs/run_20260113_120000

# To use a different port
uv run tensorboard --logdir ./logs --port 6007

4. Access remotely (Docker/SSH):

# Forward port when connecting via SSH
ssh -L 6006:localhost:6006 user@remote-server

# Or bind to all interfaces (use with caution)
uv run tensorboard --logdir ./logs --bind_all

The training script automatically logs to ./logs by default (configured in training config JSON files).

For detailed TensorBoard usage, troubleshooting, and advanced features, see TENSORBOARD.md.

Monitor Training with Weights & Biases

Weights & Biases (wandb) provides cloud-based experiment tracking and team collaboration. It's enabled by default in all training configs.

1. First-time setup (one-time only):

uv run wandb login

Follow the prompts to authenticate (or create a free account at https://wandb.ai).

2. Run training:

Once logged in, wandb automatically tracks your experiments:

uv run accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/config_quick_test.json

The script will output a wandb URL to view your experiment in real-time from anywhere.

3. Disable wandb (optional):

To turn off wandb, edit your config and set use_wandb: false:

{
  "logging_config": {
    "use_wandb": false,
    ...
  }
}

Key features:

  • Real-time metrics from any device
  • Automatic hyperparameter tracking
  • Team collaboration and sharing
  • Experiment comparison
  • Cloud storage of all runs
  • Loss graphs, learning rate curves, and all metrics visualized

For detailed wandb setup, configuration options, and best practices, see WANDB.md.

Use both TensorBoard and wandb: Both are enabled by default! Use TensorBoard for local visualization and wandb for cloud monitoring.

Load Checkpoints for Testing and Inference

After training, you can load saved model checkpoints for evaluation, testing, or interactive inference.

1. Test Evaluation:

Evaluate a checkpoint on the test dataset:

uv run python scripts/load_checkpoint.py \
    --checkpoint_dir ./outputs/checkpoint-1000 \
    --mode test \
    --test_config configs/config_quick_test.json

This will show test loss, perplexity, and accuracy metrics with SOTA comparison.

2. Interactive Mode:

Run an interactive session for masked language modeling:

uv run python scripts/load_checkpoint.py \
    --checkpoint_dir ./outputs/checkpoint-1000 \
    --mode interactive

Type text with [MASK] tokens to see predictions:

Enter text: The capital of France is [MASK].
> Predictions: Paris (82%), Lyon (4%), Marseille (2%), ...

3. Inference Mode:

Run inference on specific text:

uv run python scripts/load_checkpoint.py \
    --checkpoint_dir ./outputs/checkpoint-1000 \
    --mode inference \
    --text "The cat [MASK] on the mat."

For complete checkpoint usage guide, examples, and troubleshooting, see CHECKPOINT_USAGE.md.

Run with Python directly in the virtual environment

# Activate the virtual environment
source .venv/bin/activate  # On Linux/Mac
# or
.venv\Scripts\activate  # On Windows

# Run scripts
python main.py
python scripts/train_example.py

Project Structure

GLMs/
├── accelerate_ds_config.yaml      # Accelerate + DeepSpeed configuration (2 GPUs)
├── accelerate_config_no_deepspeed.yaml  # Standard DDP configuration (no DeepSpeed)
├── configs/                        # Training configuration files
│   ├── README.md                  # Config documentation
│   ├── config_pretrain.json       # Full pretraining configuration
│   └── config_quick_test.json     # Quick test configuration
├── scripts/
│   ├── run.py                     # Main training script
│   ├── load_checkpoint.py         # Checkpoint loading and inference
│   ├── train_example.py           # Example training script
│   └── verify_setup.py            # Setup verification script
├── src/
│   └── glms/                      # Source code package
├── logs/                          # Training logs (created automatically)
├── outputs/                       # Model checkpoints (created automatically)
├── main.py                        # Main entry point
├── pyproject.toml                 # Project configuration and dependencies
├── README.md                      # This file (main documentation)
├── QUICKSTART.md                  # Quick start guide
├── CHECKPOINT_USAGE.md            # Checkpoint loading guide
├── TENSORBOARD.md                 # TensorBoard guide
├── WANDB.md                       # Weights & Biases guide
├── DOCKER.md                      # Docker usage guide
├── TROUBLESHOOTING.md             # Common issues and solutions
├── ACCELERATE_CONFIGS.md          # Multi-GPU and DeepSpeed config guide
└── .gitignore                     # Git ignore rules

Dependencies

All dependencies are managed in pyproject.toml:

Core ML Libraries:

  • torch: PyTorch deep learning framework (CUDA 12.8)
  • transformers: Hugging Face Transformers library
  • datasets: Hugging Face Datasets library
  • accelerate: Training acceleration library
  • deepspeed: Advanced distributed training
  • scikit-learn: Machine learning utilities

Experiment Tracking & Logging:

  • tensorboard: Visualization and tracking
  • wandb: Weights & Biases experiment tracking
  • loguru: Better Python logging

Development Tools:

  • jupyterlab: Interactive development environment
  • networkx: Graph and network analysis

Development

Add new dependencies

uv add package-name

Remove dependencies

uv remove package-name

Update dependencies

uv sync --upgrade

Show dependency tree

uv tree

Training Your Own Models

Configuration Files

The project uses two types of configuration files:

  1. Accelerate Configuration (accelerate_ds_config.yaml):

    • Controls distributed training settings
    • DeepSpeed ZeRO stage configuration
    • Mixed precision settings
    • Multi-GPU/multi-node setup
  2. Training Configuration (configs/config_*.json):

    • Model architecture settings
    • Training hyperparameters (learning rate, batch size, etc.)
    • Dataset configuration
    • Logging and checkpoint settings

Training Workflow

  1. Create or modify a training configuration in configs/
  2. Adjust accelerate_ds_config.yaml for your hardware setup
  3. Run training with:
    accelerate launch --config_file accelerate_ds_config.yaml scripts/run.py --train_config configs/your_config.json
  4. Monitor training with TensorBoard:
    tensorboard --logdir ./logs
  5. Track experiments with Weights & Biases (set use_wandb: true in config)

Tips

  • Use config_quick_test.json for quick validation
  • Leverage Accelerate or DeepSpeed for distributed training
  • Track experiments with TensorBoard or Weights & Biases
  • Use Loguru for better logging (logs saved to logs/ directory)
  • Checkpoints are saved to outputs/ directory

License

This project is open source and available under the MIT License.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors