Error Handling is... Trusting

There's a pattern in this codebase of swallowing errors or not checking for problems until it's too late.

### Example 1: Tokenizer files

```python
# litgpt/tokenizer.py:45
except json.JSONDecodeError:  # Some files like the Llama 3.2 one have bugs
```

This silently catches JSON errors and moves on. What if the file is actually corrupted? What if the user passed the wrong path? They'll get weird behavior 10 steps later and have no idea why.

### Example 2: No checkpoint validation

When you load a checkpoint, the code just... loads it. No checks for:
- Is this actually a valid checkpoint?
- Does it match the model architecture you're trying to load it into?
- Is it corrupted?
- Are the tensor shapes what you expect?

You find out about problems when you get a cryptic PyTorch error during the first forward pass.

### Example 3: Device compatibility

No upfront checks for whether the model will fit in memory. You start training, it OOMs 30 minutes in, and you've wasted GPU time and money.

### What to do:

Add validation layers:
- Check checkpoint metadata before loading weights
- Validate JSON properly and give clear error messages
- Estimate memory requirements and warn before starting
- Add a `--validate` mode that checks everything without actually running

Real users will hit these issues. Make the error messages helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Handling is... Trusting #2192

Example 1: Tokenizer files

Example 2: No checkpoint validation

Example 3: Device compatibility

What to do:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error Handling is... Trusting #2192

Description

Example 1: Tokenizer files

Example 2: No checkpoint validation

Example 3: Device compatibility

What to do:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions