Note: All Python imports in this project use absolute imports relative to the
src/directory. Scripts and modules should be run withsrc/as the working directory or as the Python path root. For example:cd src python cli/cli.py ...If you encounter
ImportError, ensure your working directory issrc/.
This repository contains code for training a natural language to SQL (NL2SQL) transformer model with comprehensive TensorBoard logging.
Disclaimer: This project was originally built as a learning exercise and a proof of concept for research on relation-aware attention and pointer-generator mechanisms. The codebase has not been hardened for production use and has only minimal testing. It is provided as a reference for educational and experimental purposes.
This means that certain parts of the training pipeline may be brittle or incomplete. There is no formal support or guarantee of backwards compatibility, and the repository is maintained mainly for illustrative purposes. Feel free to fork it and adapt pieces for your own experiments.
-
Clone the repository:
git clone https://github.com/Shaurya-Sethi/nl2sql-rat-pointer.git cd nl2sql-rat-pointer -
Install dependencies:
pip install -r requirements.txt
-
Ensure you have the required data files and model checkpoints.
The model uses a YAML configuration file for both model architecture and training parameters. TensorBoard logging is configured in the logging section of the config:
logging:
tensorboard_log_dir: "runs" # Base directory for TensorBoard logs
log_every_n_steps: 10 # Log training metrics every N steps
log_grad_norm: true # Whether to log gradient norms
log_grad_histogram: false # Whether to log parameter histograms (expensive)
log_memory: true # Whether to log memory usageFor pretraining:
python src/train.py --phase pretrain --config src/config.yamlFor supervised fine-tuning:
python src/train.py --phase sft --config src/config.yaml --pretrained_model path/to/pretrained_model.ptTensorBoard is integrated for comprehensive metric logging during training.
The following metrics are logged:
- Loss (per step and per epoch)
- Perplexity
- Token accuracy
- Learning rate
- Gradient norms
- Memory usage
- Loss
- Perplexity
- Token accuracy
- Throughput (tokens per second)
- Weight histograms
- Gradient histograms
- Per-layer gradient norms
To view training metrics locally:
# Start TensorBoard server
tensorboard --logdir=runs
# Access TensorBoard in your browser at
# http://localhost:6006You can also compare multiple runs:
tensorboard --logdir runs_to_compare/run1:runs/20230601-120000_sft,runs_to_compare/run2:runs/20230602-130000_sftIf your model is training on Google Cloud Platform (GCP), follow these steps to monitor training:
-
SSH into your GCP instance with port forwarding:
gcloud compute ssh your-instance-name -- -L 6006:localhost:6006
-
Start TensorBoard on the remote instance:
tensorboard --logdir=runs --bind_all
-
Access TensorBoard in your local browser at
http://localhost:6006
-
Install the tensorboard plugin:
pip install tensorboard-plugin-wit
-
Upload logs to tensorboard.dev:
tensorboard dev upload --logdir runs \ --name "NL2SQL Experiment" \ --description "Training results for NL2SQL model"
-
Follow the link provided to view your results online. They'll be available for 90 days.
- Training Loss: Should steadily decrease
- Validation Loss: Should decrease but may plateau
- Perplexity: The exponentiated loss value; lower is better
- Gradient Norm: Measures the magnitude of gradients
- Very high values (>10) might indicate potential instability
- Very low values (<0.01) might indicate vanishing gradients
- Watch for sudden spikes or drops
- Should follow the expected schedule (warmup, decay, etc.)
- Monitor GPU memory to detect leaks or inefficiencies
- Direction should correlate with loss improvement
- Useful for tracking concrete progress
To modify what's being logged:
- Edit the TensorBoard configuration in
src/config.yaml - For more detailed changes, modify the logging code in
src/utils/training.py
- "No data found": Ensure your log directory is correct and that training has saved some data
- High memory usage: Set
log_grad_histogram: falseto reduce memory overhead - Missing GPU metrics: Install pynvml with
pip install pynvml