Skip to content

Commit f7db9ac

Browse files
committed
add claude md
1 parent 1f8d865 commit f7db9ac

File tree

1 file changed

+84
-0
lines changed

1 file changed

+84
-0
lines changed

CLAUDE.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
This is an ML model deployment project that trains a RandomForest classifier on Census income data and serves predictions via a FastAPI REST API. The model predicts whether income exceeds $50K/year based on census attributes.
8+
9+
## Common Commands
10+
11+
### Run the API Server Locally
12+
```bash
13+
uvicorn main:app --reload
14+
```
15+
16+
### Train the Model
17+
```bash
18+
python train.py
19+
```
20+
21+
Run specific pipeline steps:
22+
```bash
23+
python train.py main.steps=data_cleaning
24+
python train.py main.steps=train_model
25+
python train.py main.steps=check_score
26+
```
27+
28+
### Run Tests
29+
```bash
30+
pytest
31+
```
32+
33+
Run a single test file:
34+
```bash
35+
pytest test_case/test_api_server.py
36+
```
37+
38+
### Linting
39+
```bash
40+
flake8 --max-line-length 99
41+
```
42+
43+
### DVC Commands
44+
```bash
45+
dvc pull # Pull data/models from remote storage
46+
dvc push # Push data/models to remote storage
47+
```
48+
49+
## Architecture
50+
51+
### API Layer (`main.py`, `schema.py`)
52+
- FastAPI application with GET (welcome message) and POST (inference) endpoints
53+
- `ModelInput` Pydantic model validates inference requests with strict Literal types for categorical fields
54+
- Field name mapping handled via `config.yml` (converts Python-friendly names like `marital_status` to dataset names like `marital-status`)
55+
56+
### Training Pipeline (`train.py`, `training/`)
57+
- Uses Hydra for configuration management (`config.yml`)
58+
- Three pipeline stages: `data_cleaning` -> `train_model` -> `check_score`
59+
- `training/modelling/data.py`: Data preprocessing with OneHotEncoder for categoricals, LabelBinarizer for labels
60+
- `training/modelling/model.py`: RandomForestClassifier with 10-fold stratified cross-validation
61+
- `training/val_model.py`: Model validation with per-category slice metrics
62+
63+
### Inference (`training/inferance_model.py`)
64+
- Loads serialized model artifacts from `model/` directory (model.joblib, encoder.joblib, lb.joblib)
65+
- Applies same preprocessing pipeline used during training
66+
67+
### Data Flow
68+
1. Raw data: `data/census.csv`
69+
2. Cleaned data: `data/clean_census.csv` (removes nulls, duplicates, drops education-num/capital columns)
70+
3. Model artifacts: `model/model.joblib`, `model/encoder.joblib`, `model/lb.joblib`
71+
72+
## Configuration
73+
74+
All configuration is in `config.yml`:
75+
- `data.cat_features`: List of categorical feature column names
76+
- `infer.update_keys`: Mapping from API field names to dataset column names
77+
- `infer.columns`: Expected column order for inference
78+
79+
## Deployment
80+
81+
- Heroku deployment via `Procfile` using uvicorn
82+
- DVC automatically pulls data on Heroku startup (see conditional in `main.py`)
83+
- GitHub Actions runs flake8 and pytest on push to main
84+
- AWS S3 used as DVC remote storage

0 commit comments

Comments
 (0)