|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +This is an ML model deployment project that trains a RandomForest classifier on Census income data and serves predictions via a FastAPI REST API. The model predicts whether income exceeds $50K/year based on census attributes. |
| 8 | + |
| 9 | +## Common Commands |
| 10 | + |
| 11 | +### Run the API Server Locally |
| 12 | +```bash |
| 13 | +uvicorn main:app --reload |
| 14 | +``` |
| 15 | + |
| 16 | +### Train the Model |
| 17 | +```bash |
| 18 | +python train.py |
| 19 | +``` |
| 20 | + |
| 21 | +Run specific pipeline steps: |
| 22 | +```bash |
| 23 | +python train.py main.steps=data_cleaning |
| 24 | +python train.py main.steps=train_model |
| 25 | +python train.py main.steps=check_score |
| 26 | +``` |
| 27 | + |
| 28 | +### Run Tests |
| 29 | +```bash |
| 30 | +pytest |
| 31 | +``` |
| 32 | + |
| 33 | +Run a single test file: |
| 34 | +```bash |
| 35 | +pytest test_case/test_api_server.py |
| 36 | +``` |
| 37 | + |
| 38 | +### Linting |
| 39 | +```bash |
| 40 | +flake8 --max-line-length 99 |
| 41 | +``` |
| 42 | + |
| 43 | +### DVC Commands |
| 44 | +```bash |
| 45 | +dvc pull # Pull data/models from remote storage |
| 46 | +dvc push # Push data/models to remote storage |
| 47 | +``` |
| 48 | + |
| 49 | +## Architecture |
| 50 | + |
| 51 | +### API Layer (`main.py`, `schema.py`) |
| 52 | +- FastAPI application with GET (welcome message) and POST (inference) endpoints |
| 53 | +- `ModelInput` Pydantic model validates inference requests with strict Literal types for categorical fields |
| 54 | +- Field name mapping handled via `config.yml` (converts Python-friendly names like `marital_status` to dataset names like `marital-status`) |
| 55 | + |
| 56 | +### Training Pipeline (`train.py`, `training/`) |
| 57 | +- Uses Hydra for configuration management (`config.yml`) |
| 58 | +- Three pipeline stages: `data_cleaning` -> `train_model` -> `check_score` |
| 59 | +- `training/modelling/data.py`: Data preprocessing with OneHotEncoder for categoricals, LabelBinarizer for labels |
| 60 | +- `training/modelling/model.py`: RandomForestClassifier with 10-fold stratified cross-validation |
| 61 | +- `training/val_model.py`: Model validation with per-category slice metrics |
| 62 | + |
| 63 | +### Inference (`training/inferance_model.py`) |
| 64 | +- Loads serialized model artifacts from `model/` directory (model.joblib, encoder.joblib, lb.joblib) |
| 65 | +- Applies same preprocessing pipeline used during training |
| 66 | + |
| 67 | +### Data Flow |
| 68 | +1. Raw data: `data/census.csv` |
| 69 | +2. Cleaned data: `data/clean_census.csv` (removes nulls, duplicates, drops education-num/capital columns) |
| 70 | +3. Model artifacts: `model/model.joblib`, `model/encoder.joblib`, `model/lb.joblib` |
| 71 | + |
| 72 | +## Configuration |
| 73 | + |
| 74 | +All configuration is in `config.yml`: |
| 75 | +- `data.cat_features`: List of categorical feature column names |
| 76 | +- `infer.update_keys`: Mapping from API field names to dataset column names |
| 77 | +- `infer.columns`: Expected column order for inference |
| 78 | + |
| 79 | +## Deployment |
| 80 | + |
| 81 | +- Heroku deployment via `Procfile` using uvicorn |
| 82 | +- DVC automatically pulls data on Heroku startup (see conditional in `main.py`) |
| 83 | +- GitHub Actions runs flake8 and pytest on push to main |
| 84 | +- AWS S3 used as DVC remote storage |
0 commit comments