Skip to content

gogainda/floportop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

105 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Will This Movie Be Good?

Flop Or Top Logo

A machine learning tool that predicts a movie's IMDb rating from its metadata and plot description — with the ability to suggest similar existing movies for reference.

The Problem

It's hard to judge a movie concept early because:

  • Complex Factors: Success depends on story, genre, and audience taste
  • Human Intuition: Limited comparisons and subjective biases
  • No Reference Points: "Great" ideas can be risky without context

Our Solution

Learn patterns from thousands of past movies to:

  1. Predict expected audience rating
  2. Find similar movies for context and benchmarking

All inputs are available before release — making predictions realistic and useful for creators.

Datasets

Dataset Purpose Key Features
IMDb Dataset Core Training Labels Year, runtime, genres, IMDb score/votes
TMDB Movies Dataset NLP Features Plot overview, budget, revenue, credits

Model Inputs

  • Plot Summary ("overview") - Converted to embeddings via NLP
  • Genres (Action, Drama, Sci-Fi, etc.)
  • Runtime + Release Year
  • Budget (optional)

Deliverables

Must Have

  1. Rating Prediction (Metadata)

    • Input: Year, runtime, genres
    • Output: Predicted IMDb rating (e.g., 7.3/10)
  2. Rating + Plot Understanding (NLP)

    • Enhanced predictions using plot summary, keywords, and tagline
    • Better accuracy through content understanding
  3. Simple Demo UI

    • Select genre + runtime
    • Paste a plot description
    • Get instant predicted rating

Stretch Goals

  • Similar Movies Suggestions: Top 5 most similar existing movies with their ratings as benchmarks
  • Explainability: Show which factors (keywords, genre, runtime) influenced the prediction
  • Confidence Scoring: High/Medium/Low confidence levels based on training data coverage

Example

Input:

  • Genre: Sci-Fi, Thriller
  • Runtime: 118 min
  • Plot: "A detective investigates crimes in a city controlled by AI..."

Output:

  • Predicted Rating: 7.2/10
  • Similar Movies:
    • Blade Runner 2049 (8.0)
    • Minority Report (7.6)
    • Ex Machina (7.7)

Model Performance

Metric Value Notes
R² Score 0.42 For new movies (no vote data)
Training Data 39k movies Rich dataset with TMDB plot data
Algorithm GradientBoosting Best performer across 18 experiments
Features 49 IMDb metadata + 20 PCA components from plot embeddings

See notebooks/03_model_training.ipynb for full experiment results.

Tools & Models

Python 3.12 · pandas · NumPy · scikit-learn · GradientBoostingRegressor · sentence-transformers · all-MiniLM-L6-v2 · BAAI/bge-base-en-v1.5 · FAISS · FastAPI · jQuery · Select2 · Docker · Fly.io · Google Cloud Storage

Project Structure

floportop/
├── apps/
│   ├── api/                 # FastAPI app
│   └── frontend/            # Streamlit app
├── src/
│   └── floportop/           # Shared prediction/search package
├── deploy/
│   ├── cloudbuild.yaml      # Google Cloud Build config
│   └── docker/
│       ├── Dockerfile
│       └── .dockerignore
├── requirements/
│   ├── prod.in
│   ├── prod.lock
│   └── dev.txt
├── models/                  # Trained model artifacts
├── cache/                   # Runtime model caches
├── data/                    # Local datasets (not in production image)
├── notebooks/
│   ├── 01_data_pipeline.ipynb        # IMDb + TMDB → clean datasets
│   ├── 02_feature_engineering.ipynb   # Embeddings, PCA, genre encoding → features
│   ├── 03_model_training.ipynb       # 18 experiments → model v5
│   └── archive/                      # Team explorations & earlier iterations
├── scripts/                 # Data and notebook helpers
├── docs/
│   └── restructure-plan.md
├── Makefile
├── start.sh
└── README.md

API

Running the API

pip install -e .
PYTHONPATH=src:. uvicorn apps.api.app:app --reload

The API will be available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.

Endpoints

Endpoint Method Description
/ GET Health check
/predict GET Predict movie rating from metadata
/similar-film GET Find similar movies by text query

Examples

Predict rating (v5):

curl "http://localhost:8000/predict?startYear=2024&runtimeMinutes=148&genres=Action,Sci-Fi&overview=A%20team%20of%20astronauts%20travel%20through%20a%20wormhole%20in%20search%20of%20a%20new%20home%20for%20humanity"

Parameters:

  • startYear (required): Release year
  • runtimeMinutes (required): Movie length
  • overview (required): Plot description - used for semantic analysis
  • genres (optional): Comma-separated genres (default: "Drama")
  • budget (optional): Production budget in dollars

Find similar movies:

curl "http://localhost:8000/similar-film?query=dark+sci-fi+time+travel&k=5"

Note: The similarity search index is built lazily on the first /similar-film call. Subsequent calls use the cached index from cache/.

Search engine CLI

PYTHONPATH=src:. python -m floportop.movie_search "dark sci-fi time travel"

Team

Name Role
Igor Novokshonov Team Leader
Benjamin Steinborn Developer
Jesús López Developer
Kyle Thomas Developer
mucahit TIMAR Developer

🚀Deployment

🚀 Deployment & Operations Guide

1. Prerequisites & Container Engine

  • Project ID: wagon-bootcamp-479218
  • Region: europe-west1
  • Engine: Use OrbStack (recommended for Mac) or Docker Desktop.
  • Note: OrbStack is a lightweight, drop-in replacement that uses the same docker commands but with better performance on Apple Silicon.

2. Architecture & Platform Fix

Critical: Google Cloud Run requires linux/amd64 images.

  • The Issue: Apple Silicon Macs (M1/M2/M3) build arm64 images by default.
  • The Fix: Use Remote Builds. By running gcloud builds submit, the image is built natively on Google’s amd64 servers, bypassing local architecture mismatches.

3. Deployment Commands

Task Command Description
Build & Push make gcp_build Remote build on GCP; ensures amd64 compatibility.
Live Deploy make gcp_deploy Launches the latest image to the public Cloud Run URL.
Full Ship make gcp_ship Runs both build and deploy in one sequence.

Example of a manual deploy with required resources

gcloud run deploy floportop-v2
--image gcr.io/wagon-bootcamp-479218/floportop-v2
--memory 2Gi
--set-env-vars KAGGLE_API_TOKEN=your_token_here
--region europe-west1


4. Monitoring & App Access

  • Streamlit UI: https://floportop-v2-25462497140.europe-west1.run.app
  • Features: Rating prediction + Similar films search (two tabs)
  • Note: Cold starts take ~60s due to model loading. The container runs both Streamlit (port 8501, exposed) and FastAPI (port 8080, internal).
  • Logs: View live server logs in the terminal:
    gcloud run services logs read floportop-v2 --region europe-west1
    

5. Troubleshooting: exec format error

If the app deploys but the logs show exec user process caused "exec format error", you have pushed an arm64 image instead of amd64. Verification: Run docker inspect [IMAGE_NAME] | grep Architecture.The Fix: Re-run make gcp_build or use the manual --platform linux/amd64 flag.

⚠️ Critical Deployment Notes

  • Memory Requirements: This service requires at least 2Gi of RAM to load the FAISS index and models.
  • Image Size: Optimized to ~1.8GB using CPU-only PyTorch and production-only dependencies.
  • Ports: Container runs API on 8080 (internal) and Streamlit on 8501 (exposed to Cloud Run).
  • FAISS Index: Downloaded from GCS during build (https://storage.googleapis.com/floportop-models/index.faiss).
  • Lazy Imports: Do not move the Kaggle import back to the top of movie_search.py; it must remain inside the function to allow the API to boot.

Docker Build

# Build optimized image (CPU-only, ~1.8GB)
docker build -f deploy/docker/Dockerfile -t floportop .

# Run locally (exposes both API and Streamlit UI)
docker run -p 8080:8080 -p 8501:8501 floportop

# Access:
# - Streamlit UI: http://localhost:8501
# - API directly: http://localhost:8080

# Test API endpoints
curl http://localhost:8080/
curl "http://localhost:8080/predict?startYear=2024&runtimeMinutes=120&genres=Action&overview=A%20hero%20saves%20the%20world"
curl "http://localhost:8080/similar-film?query=comedy&k=5"

Le Wagon Data Science & AI Bootcamp

Final project for Le Wagon Batch #2201 (2025)


This project demonstrates real-world data processing, NLP, and machine learning — combining prediction with discovery to help creators and fans alike.

About

Moview recomendation engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors