Skip to content

Error Handling is... Trusting #2192

@AI-God-Dev

Description

@AI-God-Dev

There's a pattern in this codebase of swallowing errors or not checking for problems until it's too late.

Example 1: Tokenizer files

# litgpt/tokenizer.py:45
except json.JSONDecodeError:  # Some files like the Llama 3.2 one have bugs

This silently catches JSON errors and moves on. What if the file is actually corrupted? What if the user passed the wrong path? They'll get weird behavior 10 steps later and have no idea why.

Example 2: No checkpoint validation

When you load a checkpoint, the code just... loads it. No checks for:

  • Is this actually a valid checkpoint?
  • Does it match the model architecture you're trying to load it into?
  • Is it corrupted?
  • Are the tensor shapes what you expect?

You find out about problems when you get a cryptic PyTorch error during the first forward pass.

Example 3: Device compatibility

No upfront checks for whether the model will fit in memory. You start training, it OOMs 30 minutes in, and you've wasted GPU time and money.

What to do:

Add validation layers:

  • Check checkpoint metadata before loading weights
  • Validate JSON properly and give clear error messages
  • Estimate memory requirements and warn before starting
  • Add a --validate mode that checks everything without actually running

Real users will hit these issues. Make the error messages helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions