MovieRAG is a Retrieval (and Cache) -Augmented Generation application that uses an external movie data source (IMDb) to provide context-aware responses to movie recommendation requests. It supports the following three types of architectures:
Combines a ChromaDB based retriever with an OpenAI generator. The retriever fetches relevant vector-embedded movie data (based on user prompt) and provides it to the generator with the original user’s query. The generator uses this additional context to generate an accurate and relevant response.
Converts user prompt into a Cypher query using the gpt-4, which retrieves accurate movie data from a cloud neo4j graph database instance.
Cache-Augmented Generation passes in the entire dataset (much smaller than the original) into the chat buffer memory.
Ensure you have the necessary dependencies installed by running:
pip install -r requirements.txt
Raw IMDb structured data has already been curated for word embedding based RAG usage. This curated data has been stored
as data/curated/movie_curated.csv. This step can be skipped if the existing curated data is satisfactory.
Download the following four IMDb Non-Commercial Datasets and move them to the data/raw
directory:
name.basics.tsvtitle.basics.tsvtitle.principals.tsvtitle.ratings.tsv
The data curator will use the .tsv data in data/raw to create a curated .csv movie data set in data/curated,
formatted and structured to allow for useful word embeddings. With default settings, this application takes approximately
2 minutes to run to completion on my computer.
python data_curator.py [-h] [-r RATING] [-v VOTE] [-c]
- -h, --help: Show the help message and exit.
- -r RATING, --rating RATING: Minimum rating required for a movie to be included in the curated dataset. Type: float | Default: 7.0
- -v VOTE, --vote VOTE: Minimum number of votes required for a movie to qualify. Type: int | Default: 1000
- -c, --cag: Store curated data in the
data/curated/cagfolder (as opposed to the defaultdata/curated/ragfolder)
To authenticate your API reference with OpenAI, create an API key and set
environment variable OPENAI_API_KEY.
To execute the ChromaRAG application, run:
python chroma_rag.py [-h] [-v] [-k K]
- -h, --help: Show help message and exit.
- -v, --verbose: Enable verbose mode to print results from the query similarity search.
- -k K: Specify the number of documents to return from the query similarity search (default: 3).
Set up a free cloud instance of the Neo4j database on Neo4j Aura (or locally using
the Neo4j Desktop application). Use NEO4J_URI, NEO4J_USERNAME, and NEO4J_PASSWORD
values from the instance to establish environment variables. Import data to this instance from the data/curated/graph
folder. Feel free to use the attached neo4j_importer_cypher_script.cypher to set up graph schema.
To execute the Neo4jRAG application, run:
python neo4j_rag.py [-h] [-v]
- -h, --help: Show help message and exit.
- -v, --verbose: Enable verbose mode to print results from the graph cypher chain.
To execute the MovieCAG application, run:
python cag.py [-h] [-v]
- -h, --help: Show help message and exit.
- -v, --verbose: Enable verbose mode to print the entire input fed into the chat model.
- Combines ChromaDB-based document retrieval with OpenAI's GPT to generate context-aware movie recommendations. Ideal for unstructured data.
- Provides a cache memory to OpenAI's GPT to generate context-aware movie recommendations.
- Enhances natural language queries using semantically relevant data pulled from a curated IMDb dataset.
- Includes a script to process raw IMDb .tsv data into a clean .csv format optimized for word embeddings and vector search.
- Easily configure the number of context documents returned with -k, or enable debug mode with -d for insight into retrieval results.
- Clean separation between data curation, retrieval, and generation logic for easier experimentation and extension.