-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
There's a pattern in this codebase of swallowing errors or not checking for problems until it's too late.
Example 1: Tokenizer files
# litgpt/tokenizer.py:45
except json.JSONDecodeError: # Some files like the Llama 3.2 one have bugsThis silently catches JSON errors and moves on. What if the file is actually corrupted? What if the user passed the wrong path? They'll get weird behavior 10 steps later and have no idea why.
Example 2: No checkpoint validation
When you load a checkpoint, the code just... loads it. No checks for:
- Is this actually a valid checkpoint?
- Does it match the model architecture you're trying to load it into?
- Is it corrupted?
- Are the tensor shapes what you expect?
You find out about problems when you get a cryptic PyTorch error during the first forward pass.
Example 3: Device compatibility
No upfront checks for whether the model will fit in memory. You start training, it OOMs 30 minutes in, and you've wasted GPU time and money.
What to do:
Add validation layers:
- Check checkpoint metadata before loading weights
- Validate JSON properly and give clear error messages
- Estimate memory requirements and warn before starting
- Add a
--validatemode that checks everything without actually running
Real users will hit these issues. Make the error messages helpful.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request