MovieRAG

Overview

MovieRAG is a Retrieval (and Cache) -Augmented Generation application that uses an external movie data source (IMDb) to provide context-aware responses to movie recommendation requests. It supports the following three types of architectures:

1. VectorRAG

Combines a ChromaDB based retriever with an OpenAI generator. The retriever fetches relevant vector-embedded movie data (based on user prompt) and provides it to the generator with the original user’s query. The generator uses this additional context to generate an accurate and relevant response.

2. GraphRAG

Converts user prompt into a Cypher query using the gpt-4, which retrieves accurate movie data from a cloud neo4j graph database instance.

3. CAG

Cache-Augmented Generation passes in the entire dataset (much smaller than the original) into the chat buffer memory.

Installation

Ensure you have the necessary dependencies installed by running:

pip install -r requirements.txt

Usage

Curate Dataset (Optional)

Raw IMDb structured data has already been curated for word embedding based RAG usage. This curated data has been stored as data/curated/movie_curated.csv. This step can be skipped if the existing curated data is satisfactory.

1. Download Raw Data

Download the following four IMDb Non-Commercial Datasets and move them to the data/raw directory:

name.basics.tsv
title.basics.tsv
title.principals.tsv
title.ratings.tsv

2. Execute Data Curator

The data curator will use the .tsv data in data/raw to create a curated .csv movie data set in data/curated, formatted and structured to allow for useful word embeddings. With default settings, this application takes approximately 2 minutes to run to completion on my computer.

python data_curator.py [-h] [-r RATING] [-v VOTE] [-c]

-h, --help: Show the help message and exit.
-r RATING, --rating RATING: Minimum rating required for a movie to be included in the curated dataset. Type: float | Default: 7.0
-v VOTE, --vote VOTE: Minimum number of votes required for a movie to qualify. Type: int | Default: 1000
-c, --cag: Store curated data in the data/curated/cag folder (as opposed to the default data/curated/rag folder)

Export OpenAI API key

To authenticate your API reference with OpenAI, create an API key and set environment variable OPENAI_API_KEY.

Running ChromaRAG

To execute the ChromaRAG application, run:

python chroma_rag.py [-h] [-v] [-k K]

-h, --help: Show help message and exit.
-v, --verbose: Enable verbose mode to print results from the query similarity search.
-k K: Specify the number of documents to return from the query similarity search (default: 3).

Export neo4j Instance Credentials

Set up a free cloud instance of the Neo4j database on Neo4j Aura (or locally using the Neo4j Desktop application). Use NEO4J_URI, NEO4J_USERNAME, and NEO4J_PASSWORD values from the instance to establish environment variables. Import data to this instance from the data/curated/graph folder. Feel free to use the attached neo4j_importer_cypher_script.cypher to set up graph schema.

Running Neo4jRAG

To execute the Neo4jRAG application, run:

python neo4j_rag.py [-h] [-v]

-h, --help: Show help message and exit.
-v, --verbose: Enable verbose mode to print results from the graph cypher chain.

Running MovieCAG

To execute the MovieCAG application, run:

python cag.py [-h] [-v]

-h, --help: Show help message and exit.
-v, --verbose: Enable verbose mode to print the entire input fed into the chat model.

Features

Vector Retrieval-Augmented Generation (RAG)
Combines ChromaDB-based document retrieval with OpenAI's GPT to generate context-aware movie recommendations. Ideal for unstructured data.
Cache-Augmented Generation (CAG)
Provides a cache memory to OpenAI's GPT to generate context-aware movie recommendations.
Context-Enriched Responses
Enhances natural language queries using semantically relevant data pulled from a curated IMDb dataset.
Custom Curated Dataset
Includes a script to process raw IMDb .tsv data into a clean .csv format optimized for word embeddings and vector search.
Flexible Query Parameters
Easily configure the number of context documents returned with -k, or enable debug mode with -d for insight into retrieval results.
Modular Design for Experimentation
Clean separation between data curation, retrieval, and generation logic for easier experimentation and extension.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MovieRAG

Overview

1. VectorRAG

2. GraphRAG

3. CAG

Table of Contents

Installation

Usage

Curate Dataset (Optional)

1. Download Raw Data

2. Execute Data Curator

Export OpenAI API key

Running ChromaRAG

Export neo4j Instance Credentials

Running Neo4jRAG

Running MovieCAG

Features

Vector Retrieval-Augmented Generation (RAG)

Cache-Augmented Generation (CAG)

Context-Enriched Responses

Custom Curated Dataset

Flexible Query Parameters

Modular Design for Experimentation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/curated		data/curated
README.md		README.md
cag.py		cag.py
chroma_rag.py		chroma_rag.py
data_curator.py		data_curator.py
neo4j_importer_cypher_script.cypher		neo4j_importer_cypher_script.cypher
neo4j_rag.py		neo4j_rag.py
requirements.txt		requirements.txt

axj2613/MovieRAG

Folders and files

Latest commit

History

Repository files navigation

MovieRAG

Overview

1. VectorRAG

2. GraphRAG

3. CAG

Table of Contents

Installation

Usage

Curate Dataset (Optional)

1. Download Raw Data

2. Execute Data Curator

Export OpenAI API key

Running ChromaRAG

Export neo4j Instance Credentials

Running Neo4jRAG

Running MovieCAG

Features

Vector Retrieval-Augmented Generation (RAG)

Cache-Augmented Generation (CAG)

Context-Enriched Responses

Custom Curated Dataset

Flexible Query Parameters

Modular Design for Experimentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages