This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This is an ML model deployment project that trains a RandomForest classifier on Census income data and serves predictions via a FastAPI REST API. The model predicts whether income exceeds $50K/year based on census attributes.
uvicorn main:app --reloadpython train.pyRun specific pipeline steps:
python train.py main.steps=data_cleaning
python train.py main.steps=train_model
python train.py main.steps=check_scorepytestRun a single test file:
pytest test_case/test_api_server.pyflake8 --max-line-length 99dvc pull # Pull data/models from remote storage
dvc push # Push data/models to remote storage- FastAPI application with GET (welcome message) and POST (inference) endpoints
ModelInputPydantic model validates inference requests with strict Literal types for categorical fields- Field name mapping handled via
config.yml(converts Python-friendly names likemarital_statusto dataset names likemarital-status)
- Uses Hydra for configuration management (
config.yml) - Three pipeline stages:
data_cleaning->train_model->check_score training/modelling/data.py: Data preprocessing with OneHotEncoder for categoricals, LabelBinarizer for labelstraining/modelling/model.py: RandomForestClassifier with 10-fold stratified cross-validationtraining/val_model.py: Model validation with per-category slice metrics
- Loads serialized model artifacts from
model/directory (model.joblib, encoder.joblib, lb.joblib) - Applies same preprocessing pipeline used during training
- Raw data:
data/census.csv - Cleaned data:
data/clean_census.csv(removes nulls, duplicates, drops education-num/capital columns) - Model artifacts:
model/model.joblib,model/encoder.joblib,model/lb.joblib
All configuration is in config.yml:
data.cat_features: List of categorical feature column namesinfer.update_keys: Mapping from API field names to dataset column namesinfer.columns: Expected column order for inference
- Heroku deployment via
Procfileusing uvicorn - DVC automatically pulls data on Heroku startup (see conditional in
main.py) - GitHub Actions runs flake8 and pytest on push to main
- AWS S3 used as DVC remote storage