Project Context Implementation of a Two-Stage Recommendation Architecture (Two-Tower Retrieval + CatBoost Ranking), engineered as an installable Python package. The workflow is managed by a custom orchestrator (
run_pipeline.py) that enforces strict time-based data splitting, reproducibility, and experiment tracking via MLflow.
"Act in the entire ML lifecycle: from mathematical conception and model architecture, feature engineering and experimentation, to the implementation of a robust 'production-grade' MLOps pipeline."
This project implements a Two-Stage Recommendation Architecture for the H&M Personalized Fashion Recommendations challenge, combining a Neural Retrieval stage (Two-Tower) with a Gradient Boosting Ranking stage (CatBoost).
The solution is structured as a production-ready Python package, utilizing a modular pipeline that automates data processing, training, evaluation, and artifact management.
Key Technical Philosophy:
- Architecture-First: Implementation of a standard RecSys pattern (Retrieval + Ranking) rather than ad-hoc scripts.
- MLOps Orchestration: Centralized control via
run_pipeline.pywith full MLflow integration for experiment tracking. - Local Reproducibility: Dependency management via
uvand removal of cloud-specific dependencies to ensure varying environments can reproduce results.
- Two-Stage Recommendation System:
- Stage 1 (Retrieval): Neural Dual Encoder (Two-Tower) built with TensorFlow Recommenders (TFRS) to map users and items into a shared 32D embedding space. Generates top-K candidates via efficient BruteForce similarity search.
- Stage 2 (Ranking): CatBoost Classifier trained to re-rank the retrieved candidates using dense behavioral features and item metadata.
- Feature Engineering: Strict separation of static and dynamic features, including calculated metrics like
purchase_cycleandprice_sensitivity. - Deep Feature Engineering: Embeddings + behavioral features (Category Affinity, Price Sensitivity, Tenure).
- Hyperparameter Tuning: Optuna integration for CatBoost with MLflow tracking.
- Pipeline Orchestrator: A custom Python script (
scripts/run_pipeline.py) manages the execution DAG, ensuring correct dependency order (Preprocess -> Train -> Rank -> Evaluate). - Experiment Tracking (MLflow):
- Nested Runs: Hierarchical tracking of pipeline steps.
- Artifact Management: Storage of serialized models, scalers, and metric plots.
- Metric Logging: Tracking of MAP@12 at both Retrieval and Ranking stages.
- Reproducibility: Strictly pinned dependencies via
uv.lockand config-driven parameterization.
- Time-Based Split: Strict temporal separation for Training (365 days), Fine-Tuning (30 days) and Validation (7 days) to mimic production forecasting and prevent data leakage.
- Incremental Benchmarking: Evaluation of each stage independently (Baseline vs. Retrieval vs. Final Ranking).
- Interpretability: SHAP analysis applied to the Ranker to explain feature importance.
- Smoke Tests (
tests/fast_test.py): Fast execution checks for model compilation and pipeline integrity. - Logic Validation (
tests/verify_features.py): Verification of feature engineering logic to guarantee handling of temporal constraints. - Model Inspection (
tests/inspect_model.py): Utilities to validate input signatures and saved model artifacts.
The repository follows a modular "src-layout" pattern:
.
├── data/ # Data lake (Raw CSVs & Processed Parquet)
├── scripts/ # Controller Layer (Imperative Shell)
│ ├── run_pipeline.py # MAIN ENTRY POINT (Orchestrator)
│ ├── train.py # Training logic
│ └── ...
├── src/ # Service Layer (Functional Core)
│ ├── model.py # TFRS Two-Tower Model Architecture
│ ├── data_utils.py # tf.data pipelines & Preprocessing
│ └── config.py # Single Source of Truth for Configs
├── mlruns/ # Local MLflow Tracking Store
├── docs/ # Additional Documentation
│ ├── KAGGLE_LEARNINGS.md # Benchmarking & Strategy
│ └── FINAL_RESULTS.md # Methodologies & Results
├── pyproject.toml # Project Dependencies (uv managed)
└── uv.lock # Exact Dependency Lockfile
Note: This project has been fully refactored for Local Execution. All Cloud/GCP dependencies were removed to ensure cost-effective, high-performance local training.
- Python 3.8+
- uv (Fast Python package installer)
-
Clone and Setup Environment:
# Install uv if not present pip install uv # Create virtual environment uv venv # Activate (Windows) .venv\Scripts\Activate.ps1 # Install dependencies uv pip install -e .
-
Prepare Data: Place the H&M competition CSV files in
data/and run:python scripts/convert_csv_to_parquet.py
Execute the full end-to-end pipeline with a single command:
python scripts/run_pipeline.py| Step | Command | Description |
|---|---|---|
| 1. Preprocess | preprocess |
Partitions data into Training and Validation sets using strict temporal splitting logic to prevent data leakage. |
| 2. TFRecord | tfrecord |
Transforms processed Parquet files into optimized TFRecord format to maximize GPU throughput. |
| 3. Baseline | baseline |
Establishes a performance benchmark (MAP@12) using a simple 'Most Popular' heuristic strategy. |
| 4. Train | train |
Executes a two-phase training strategy: Base Training on 365 days of history followed by Fine-Tuning on recent data to adapt to shifting trends. |
| 5. Evaluate-TT | evaluate-tt |
Measures the retrieval quality (MAP@12) of the Two-Tower model in isolation against the validation set. |
| 6. Candidates | candidates |
Performs efficient similarity search to generate the top-K candidate items for each user. |
| 7. Tune | tune |
Executes Optuna Bayesian optimization to find the best hyperparameters for the CatBoost ranker. |
| 8. Ranking | ranking |
Trains a CatBoost classifier to re-rank the candidate list based on fine-grained interaction probabilities. |
| 9. Evaluate | evaluate |
Computes the final MAP@12 of the integrated system (Retrieval + Ranking) on the validation set. |
| 10. SHAP | shap |
Runs SHAP (SHapley Additive exPlanations) to analyze feature contributions to the Ranker's predictions. |
| 11. Submission | submission |
Generates the specialized submission file formatted for the Kaggle competition leaderboard. |
# Run specific steps only
python scripts/run_pipeline.py --steps train candidates ranking evaluate
# Run all steps including tuning
python scripts/run_pipeline.py --steps all
# Skip tuning (use saved or default params)
python scripts/run_pipeline.py --skip-tuningWhy use the pipeline?
- Reproducibility: Ensures steps run in the correct order.
- Tracking: Automatically logs all params, metrics, and artifacts to MLflow (Nested Runs).
- Incremental Evaluation: Compares MAP@12 across stages (Baseline → Two-Tower → 2-Stage).
To visualize experiments, launch the MLflow UI:
mlflow uiIf training completes but the model fails to save (or you need to re-save with a new signature), use the recovery script to avoid re-training:
python scripts/recover_model.pyThis script loads the last best checkpoint, re-indexes the candidates, and saves the final model artifact.
Jordão Fernandes de Andrade Data Scientist & Economist (MSc) [email protected]
This project is licensed under the MIT License.