diff --git a/README.md b/README.md index deac85df..0b576ce7 100644 --- a/README.md +++ b/README.md @@ -99,6 +99,7 @@ For further insights on enhancing RAG applications with dense content representa | Recipe | Description | | --- | --- | | [/recommendation-systems/content_filtering.ipynb](python-recipes/recommendation-systems/content_filtering.ipynb) | Intro content filtering example with redisvl | +| [/recommendation-systems/collaborative_filtering.ipynb](python-recipes/recommendation-systems/collaborative_filtering.ipynb) | Intro collaborative filtering example with redisvl | ### See also An exciting example of how Redis can power production-ready systems is highlighted in our collaboration with [NVIDIA](https://developer.nvidia.com/blog/offline-to-online-feature-storage-for-real-time-recommendation-systems-with-nvidia-merlin/) to construct a state-of-the-art recommendation system. diff --git a/python-recipes/recommendation-systems/collaborative_filtering.ipynb b/python-recipes/recommendation-systems/collaborative_filtering.ipynb new file mode 100644 index 00000000..e96054d3 --- /dev/null +++ b/python-recipes/recommendation-systems/collaborative_filtering.ipynb @@ -0,0 +1,1706 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)\n", + "\n", + "# Collaborative Filtering in RedisVL\n", + "\n", + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Recommendation systems are a common application of machine learning and serve many industries from e-commerce to music streaming platforms.\n", + "\n", + "There are many different architectures that can be followed to build a recommendation system. In a previous example notebook we demonstrated how to do [content filtering with RedisVL](content_filtering.ipynb). We encourage you to start there before diving into this notebook.\n", + "\n", + "In this notebook we'll demonstrate how to build a [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering)\n", + "recommendation system and use the large IMDB movies dataset as our example data.\n", + "\n", + "To generate our vectors we'll use the popular Python package [Surprise](https://surpriselib.com/)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# NBVAL_SKIP\n", + "!pip install scikit-surprise --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import requests\n", + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "from surprise import SVD\n", + "from surprise import Dataset, Reader\n", + "from surprise.model_selection import train_test_split\n", + "\n", + "\n", + "# Replace values below with your own if using Redis Cloud instance\n", + "REDIS_HOST = os.getenv(\"REDIS_HOST\", \"localhost\") # ex: \"redis-18374.c253.us-central1-1.gce.cloud.redislabs.com\"\n", + "REDIS_PORT = os.getenv(\"REDIS_PORT\", \"6379\") # ex: 18374\n", + "REDIS_PASSWORD = os.getenv(\"REDIS_PASSWORD\", \"\") # ex: \"1TNxTEdYRDgIDKM2gDfasupCADXXXX\"\n", + "\n", + "# If SSL is enabled on the endpoint, use rediss:// as the URL prefix\n", + "REDIS_URL = f\"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To build a collaborative filtering example using the Surprise library and the Movies dataset, we need to first load the data, format it according to the requirements of Surprise, and then apply a collaborative filtering algorithm like SVD." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "def fetch_dataframe(file_name):\n", + " try:\n", + " df = pd.read_csv('datasets/collaborative_filtering/' + file_name)\n", + " except:\n", + " url = 'https://redis-ai-resources.s3.us-east-2.amazonaws.com/recommenders/datasets/collaborative-filtering/'\n", + " r = requests.get(url + file_name)\n", + " if not os.path.exists('datasets/collaborative_filtering'):\n", + " os.makedirs('datasets/collaborative_filtering')\n", + " with open('datasets/collaborative_filtering/' + file_name, 'wb') as f:\n", + " f.write(r.content)\n", + " df = pd.read_csv('datasets/collaborative_filtering/' + file_name)\n", + " return df\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "ratings_df = fetch_dataframe('ratings_small.csv') # for a larger example use 'ratings.csv' instead\n", + "\n", + "# only keep the columns we need: userId, movieId, rating\n", + "ratings_df = ratings_df[['userId', 'movieId', 'rating']]\n", + "\n", + "reader = Reader(rating_scale=(0.0, 5.0))\n", + "\n", + "ratings_data = Dataset.load_from_df(ratings_df, reader)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# What is Collaborative Filtering" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A lot is going to happen in the code cell below. We split our full data into train and test sets. We defined the collaborative filtering algorithm to use, which in this case is the Singular Value Decomposition (SVD) algorithm. lastly, we fit our model to our data.\n", + "\n", + "It's worth going into more detail why we chose this algorithm and what it is computing in the `svd.fit(train_set)` method we're calling.\n", + "First, let's think about what data it's receiving - our ratings data. This only contains the userIds, movieIds, and the user's ratings of their watched movies on a scale of 1 to 5.\n", + "\n", + "We can put this data into a matrix with rows being users and columns being movies\n", + "\n", + "| RATINGS| movie_1 | movie_2 | movie_3 | movie_4 | movie_5 | movie_6 | ....... |\n", + "| ----- | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |\n", + "| user_1 | 4 | 1 | | 4 | | 5 | |\n", + "| user_2 | | 5 | 5 | 2 | 1 | | |\n", + "| user_3 | | | | | 1 | | |\n", + "| user_4 | 4 | 1 | | 4 | | ? | |\n", + "| user_5 | | 4 | 5 | 2 | | | |\n", + "| ...... | | | | | | | |\n", + "\n", + "Our empty cells aren't zero's, they're missing ratings, so `user_1` has never rated `movie_3`. They may like it or hate it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Unlike Content Filtering, here we're only considering the ratings that users assign. We don't know the plot or genre or release year of any of these films. We don't even know the title.\n", + "But we can still build a recommender by assuming that users have similar tastes to each other. As an intuitive example, we can see that `user_1` and `user_4` have very similar ratings on several movies, so we will assume that `user_4` will rate `movie_6` highly, just as `user_1` did. This is the idea behind collaborative filtering." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That's the intuition, but what about the math? Since we only have this matrix to work with, what we want to do is decompose it into two constituent matrices.\n", + "Lets call our ratings matrix `[R]`. We want to find two other matrices, a user matrix `[U]`, and a movies matrix `[M]` that fit the equation:\n", + "\n", + "`[U] * [M] = [R]`\n", + "\n", + "`[U]` will look like:\n", + "|user_1_feature_1 | user_1_feature_2 | user_1_feature_3 | user_1_feature_4 | ... | user_1_feature_k |\n", + "| ----- | --------- | --------- | --------- | --- | --------- |\n", + "|user_2_feature_1 | user_2_feature_2 | user_2_feature_3 | user_2_feature_4 | ... | user_2_feature_k |\n", + "|user_3_feature_1 | user_3_feature_2 | user_3_feature_3 | user_3_feature_4 | ... | user_3_feature_k |\n", + "| ... | . | . | . | ... | . |\n", + "|user_N_feature_1 | user_N_feature_2 | user_N_feature_3 | user_N_feature_4 | ... | user_N_feature_k |\n", + "\n", + "`[M]` will look like:\n", + "\n", + "| movie_1_feature_1 | movie_2_feature_1 | movie_3_feature_1 | ... | movie_M_feature_1 |\n", + "| --- | --- | --- | --- | --- |\n", + "| movie_1_feature_2 | movie_2_feature_2 | movie_3_feature_2 | ... | movie_M_feature_2 |\n", + "| movie_1_feature_3 | movie_2_feature_3 | movie_3_feature_3 | ... | movie_M_feature_3 |\n", + "| movie_1_feature_4 | movie_2_feature_4 | movie_3_feature_4 | ... | movie_M_feature_4 |\n", + "| ... | . | . | ... | . |\n", + "| movie_1_feature_k | movie_2_feature_k | movie_3_feature_k | ... | movie_M_feature_k |\n", + "\n", + "\n", + "these features are called the latent features (or latent factors) and are the values we're trying to find when we call the `svd.fit(training_data)` method. The algorithm that computes these features from our ratings matrix is the SVD algorithm. The number of users and movies is set by our data. The size of the latent feature vectors `k` is a parameter we choose. We'll keep it at the default 100 for this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# split the data into training and testing sets (80% train, 20% test)\n", + "train_set, test_set = train_test_split(ratings_data, test_size=0.2)\n", + "\n", + "# use SVD (Singular Value Decomposition) for collaborative filtering\n", + "svd = SVD(n_factors=100, biased=False) # we'll set biased to False so that predictions are of the form \"rating_prediction = user_vector dot item_vector\"\n", + "\n", + "# train the algorithm on the train_set\n", + "svd.fit(train_set)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extracting The User and Movie Vectors" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that the SVD algorithm has computed our `[U]` and `[M]` matrices - which are both really just lists of vectors - we can load them into our Redis instance.\n", + "\n", + "The Surprise SVD model stores user and movie vectors in two attributes:\n", + "\n", + "`svd.pu`: user features matrix (a matrix where each row corresponds to the latent features of a user).\n", + "`svd.qi`: item features matrix (a matrix where each row corresponds to the latent features of an item/movie).\n", + "\n", + "It's worth noting that the matrix `svd.qi` is the transpose of the matrix `[M]` we defined above. This way each row corresponds to one movie." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "we have 671 users with feature vectors of size 100\n", + "we have 8397 movies with feature vectors of size 100\n" + ] + } + ], + "source": [ + "user_vectors = svd.pu # user latent features (matrix)\n", + "movie_vectors = svd.qi # movie latent features (matrix)\n", + "\n", + "print(f'we have {user_vectors.shape[0]} users with feature vectors of size {user_vectors.shape[1]}')\n", + "print(f'we have {movie_vectors.shape[0]} movies with feature vectors of size {movie_vectors.shape[1]}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Predicting User Ratings\n", + "The great thing about collaborative filtering is that using our user and movie vectors we can predict the rating any user will give to any movie in our dataset.\n", + "And unlike content filtering, there is no assumption that all the movies a user will be recommended are similar to each other. A user can be recommended dark horror films and light-hearted animations.\n", + "\n", + "Looking back at our SVD algorithm the equation is [User_features] * [Movie_features].transpose = [Ratings]\n", + "So to get a prediction of what a user will rate a movie they haven't seen yet we just need to take the dot product of that user's feature vector and a movie's feature vector." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "the predicted rating of user 347 on movie 5515 is 1.1069607933289707\n" + ] + } + ], + "source": [ + "# surprise casts userId and movieId to inner ids, so we have to use their mapping to know which rows to use\n", + "inner_uid = train_set.to_inner_uid(347) # userId\n", + "inner_iid = train_set.to_inner_iid(5515) # movieId\n", + "\n", + "# predict one user's rating of one film\n", + "predicted_rating = np.dot(user_vectors[inner_uid], movie_vectors[inner_iid])\n", + "print(f'the predicted rating of user {347} on movie {5515} is {predicted_rating}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Adding Movie Data\n", + "while our collaborative filtering algorithm was trained solely on user's ratings of movies, and doesn't require any data about the movies themselves - like the title, genres, or release year - we'll want that information stored as metadata.\n", + "\n", + "We can grab this data from our `movies_metadata.csv` file, clean it, and join it to our user ratings via the `movieId` column" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
belongs_to_collectionbudgetgenreshomepageidimdb_idoriginal_languageoriginal_titleoverviewpopularity...release_daterevenueruntimespoken_languagesstatustaglinetitlevideovote_averagevote_count
0{'id': 10194, 'name': 'Toy Story Collection', ...30000000[{'id': 16, 'name': 'Animation'}, {'id': 35, '...http://toystory.disney.com/toy-story862tt0114709enToy StoryLed by Woody, Andy's toys live happily in his ...21.946943...1995-10-3037355403381.0[{'iso_639_1': 'en', 'name': 'English'}]ReleasedNaNToy StoryFalse7.75415
1NaN65000000[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...NaN8844tt0113497enJumanjiWhen siblings Judy and Peter discover an encha...17.015539...1995-12-15262797249104.0[{'iso_639_1': 'en', 'name': 'English'}, {'iso...ReleasedRoll the dice and unleash the excitement!JumanjiFalse6.92413
2{'id': 119050, 'name': 'Grumpy Old Men Collect...0[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...NaN15602tt0113228enGrumpier Old MenA family wedding reignites the ancient feud be...11.712900...1995-12-220101.0[{'iso_639_1': 'en', 'name': 'English'}]ReleasedStill Yelling. Still Fighting. Still Ready for...Grumpier Old MenFalse6.592
3NaN16000000[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...NaN31357tt0114885enWaiting to ExhaleCheated on, mistreated and stepped on, the wom...3.859495...1995-12-2281452156127.0[{'iso_639_1': 'en', 'name': 'English'}]ReleasedFriends are the people who let you be yourself...Waiting to ExhaleFalse6.134
4{'id': 96871, 'name': 'Father of the Bride Col...0[{'id': 35, 'name': 'Comedy'}]NaN11862tt0113041enFather of the Bride Part IIJust when George Banks has recovered from his ...8.387519...1995-02-1076578911106.0[{'iso_639_1': 'en', 'name': 'English'}]ReleasedJust When His World Is Back To Normal... He's ...Father of the Bride Part IIFalse5.7173
\n", + "

5 rows × 23 columns

\n", + "
" + ], + "text/plain": [ + " belongs_to_collection budget \\\n", + "0 {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 \n", + "1 NaN 65000000 \n", + "2 {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 \n", + "3 NaN 16000000 \n", + "4 {'id': 96871, 'name': 'Father of the Bride Col... 0 \n", + "\n", + " genres \\\n", + "0 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... \n", + "1 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... \n", + "2 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... \n", + "3 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... \n", + "4 [{'id': 35, 'name': 'Comedy'}] \n", + "\n", + " homepage id imdb_id original_language \\\n", + "0 http://toystory.disney.com/toy-story 862 tt0114709 en \n", + "1 NaN 8844 tt0113497 en \n", + "2 NaN 15602 tt0113228 en \n", + "3 NaN 31357 tt0114885 en \n", + "4 NaN 11862 tt0113041 en \n", + "\n", + " original_title \\\n", + "0 Toy Story \n", + "1 Jumanji \n", + "2 Grumpier Old Men \n", + "3 Waiting to Exhale \n", + "4 Father of the Bride Part II \n", + "\n", + " overview popularity ... \\\n", + "0 Led by Woody, Andy's toys live happily in his ... 21.946943 ... \n", + "1 When siblings Judy and Peter discover an encha... 17.015539 ... \n", + "2 A family wedding reignites the ancient feud be... 11.712900 ... \n", + "3 Cheated on, mistreated and stepped on, the wom... 3.859495 ... \n", + "4 Just when George Banks has recovered from his ... 8.387519 ... \n", + "\n", + " release_date revenue runtime \\\n", + "0 1995-10-30 373554033 81.0 \n", + "1 1995-12-15 262797249 104.0 \n", + "2 1995-12-22 0 101.0 \n", + "3 1995-12-22 81452156 127.0 \n", + "4 1995-02-10 76578911 106.0 \n", + "\n", + " spoken_languages status \\\n", + "0 [{'iso_639_1': 'en', 'name': 'English'}] Released \n", + "1 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released \n", + "2 [{'iso_639_1': 'en', 'name': 'English'}] Released \n", + "3 [{'iso_639_1': 'en', 'name': 'English'}] Released \n", + "4 [{'iso_639_1': 'en', 'name': 'English'}] Released \n", + "\n", + " tagline \\\n", + "0 NaN \n", + "1 Roll the dice and unleash the excitement! \n", + "2 Still Yelling. Still Fighting. Still Ready for... \n", + "3 Friends are the people who let you be yourself... \n", + "4 Just When His World Is Back To Normal... He's ... \n", + "\n", + " title video vote_average vote_count \n", + "0 Toy Story False 7.7 5415 \n", + "1 Jumanji False 6.9 2413 \n", + "2 Grumpier Old Men False 6.5 92 \n", + "3 Waiting to Exhale False 6.1 34 \n", + "4 Father of the Bride Part II False 5.7 173 \n", + "\n", + "[5 rows x 23 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies_df = fetch_dataframe('movies_metadata.csv')\n", + "movies_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "budget 0\n", + "genres 0\n", + "id 0\n", + "imdb_id 0\n", + "original_language 0\n", + "overview 0\n", + "popularity 0\n", + "release_date 0\n", + "revenue 0\n", + "runtime 0\n", + "status 0\n", + "tagline 0\n", + "title 0\n", + "vote_average 0\n", + "vote_count 0\n", + "dtype: int64" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\n", + "import datetime\n", + "movies_df.drop(columns=['homepage', 'production_countries', 'production_companies', 'spoken_languages', 'video', 'original_title', 'video', 'poster_path', 'belongs_to_collection'], inplace=True)\n", + "\n", + "# drop rows that have missing values\n", + "movies_df.dropna(subset=['imdb_id'], inplace=True)\n", + "\n", + "movies_df['original_language'] = movies_df['original_language'].fillna('unknown')\n", + "movies_df['overview'] = movies_df['overview'].fillna('')\n", + "movies_df['popularity'] = movies_df['popularity'].fillna(0)\n", + "movies_df['release_date'] = movies_df['release_date'].fillna('1900-01-01').apply(lambda x: datetime.datetime.strptime(x, \"%Y-%m-%d\").timestamp())\n", + "movies_df['revenue'] = movies_df['revenue'].fillna(0)\n", + "movies_df['runtime'] = movies_df['runtime'].fillna(0)\n", + "movies_df['status'] = movies_df['status'].fillna('unknown')\n", + "movies_df['tagline'] = movies_df['tagline'].fillna('')\n", + "movies_df['title'] = movies_df['title'].fillna('')\n", + "movies_df['vote_average'] = movies_df['vote_average'].fillna(0)\n", + "movies_df['vote_count'] = movies_df['vote_count'].fillna(0)\n", + "movies_df['genres'] = movies_df['genres'].apply(lambda x: [g['name'] for g in eval(x)] if x != '' else []) # convert to a list of genre names\n", + "movies_df['imdb_id'] = movies_df['imdb_id'].apply(lambda x: x[2:] if str(x).startswith('tt') else x).astype(int) # remove leading 'tt' from imdb_id\n", + "\n", + "# make sure we've filled all missing values\n", + "movies_df.isnull().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll have to map these movies to their ratings, which we'll do so with the `links.csv` file that matches `movieId`, `imdbId`, and `tmdbId`.\n", + "Let's do that now." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "links_df = fetch_dataframe('links_small.csv') # for a larger example use 'links.csv' instead\n", + "\n", + "movies_df = movies_df.merge(links_df, left_on='imdb_id', right_on='imdbId', how='inner')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll want to move our SVD user vectors and movie vectors and their corresponding userId and movieId into 2 dataframes for later processing." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
budgetgenresidimdb_idoriginal_languageoverviewpopularityrelease_daterevenueruntimestatustaglinetitlevote_averagevote_countmovieIdimdbIdtmdbIdmovie_vector
030000000[Animation, Comedy, Family]862114709enLed by Woody, Andy's toys live happily in his ...21.946943815040000.037355403381.0ReleasedToy Story7.754151114709862.0[0.12184447241197785, -0.16994406060791697, 0....
165000000[Adventure, Fantasy, Family]8844113497enWhen siblings Judy and Peter discover an encha...17.015539819014400.0262797249104.0ReleasedRoll the dice and unleash the excitement!Jumanji6.9241321134978844.0[0.14683581574270926, -0.06365576587872183, 0....
20[Romance, Comedy]15602113228enA family wedding reignites the ancient feud be...11.712900819619200.00101.0ReleasedStill Yelling. Still Fighting. Still Ready for...Grumpier Old Men6.592311322815602.0[0.16698051985699827, -0.02406109383254372, 0....
316000000[Comedy, Drama, Romance]31357114885enCheated on, mistreated and stepped on, the wom...3.859495819619200.081452156127.0ReleasedFriends are the people who let you be yourself...Waiting to Exhale6.134411488531357.0[-0.10740791019437969, 0.09007945525146789, 0....
40[Comedy]11862113041enJust when George Banks has recovered from his ...8.387519792403200.076578911106.0ReleasedJust When His World Is Back To Normal... He's ...Father of the Bride Part II5.7173511304111862.0[0.11311012532803581, 0.025998675845395405, 0....
\n", + "
" + ], + "text/plain": [ + " budget genres id imdb_id original_language \\\n", + "0 30000000 [Animation, Comedy, Family] 862 114709 en \n", + "1 65000000 [Adventure, Fantasy, Family] 8844 113497 en \n", + "2 0 [Romance, Comedy] 15602 113228 en \n", + "3 16000000 [Comedy, Drama, Romance] 31357 114885 en \n", + "4 0 [Comedy] 11862 113041 en \n", + "\n", + " overview popularity \\\n", + "0 Led by Woody, Andy's toys live happily in his ... 21.946943 \n", + "1 When siblings Judy and Peter discover an encha... 17.015539 \n", + "2 A family wedding reignites the ancient feud be... 11.712900 \n", + "3 Cheated on, mistreated and stepped on, the wom... 3.859495 \n", + "4 Just when George Banks has recovered from his ... 8.387519 \n", + "\n", + " release_date revenue runtime status \\\n", + "0 815040000.0 373554033 81.0 Released \n", + "1 819014400.0 262797249 104.0 Released \n", + "2 819619200.0 0 101.0 Released \n", + "3 819619200.0 81452156 127.0 Released \n", + "4 792403200.0 76578911 106.0 Released \n", + "\n", + " tagline \\\n", + "0 \n", + "1 Roll the dice and unleash the excitement! \n", + "2 Still Yelling. Still Fighting. Still Ready for... \n", + "3 Friends are the people who let you be yourself... \n", + "4 Just When His World Is Back To Normal... He's ... \n", + "\n", + " title vote_average vote_count movieId imdbId \\\n", + "0 Toy Story 7.7 5415 1 114709 \n", + "1 Jumanji 6.9 2413 2 113497 \n", + "2 Grumpier Old Men 6.5 92 3 113228 \n", + "3 Waiting to Exhale 6.1 34 4 114885 \n", + "4 Father of the Bride Part II 5.7 173 5 113041 \n", + "\n", + " tmdbId movie_vector \n", + "0 862.0 [0.12184447241197785, -0.16994406060791697, 0.... \n", + "1 8844.0 [0.14683581574270926, -0.06365576587872183, 0.... \n", + "2 15602.0 [0.16698051985699827, -0.02406109383254372, 0.... \n", + "3 31357.0 [-0.10740791019437969, 0.09007945525146789, 0.... \n", + "4 11862.0 [0.11311012532803581, 0.025998675845395405, 0.... " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# build a dataframe out of the user vectors and their userIds\n", + "user_vectors_and_ids = {train_set.to_raw_uid(inner_id): user_vectors[inner_id].tolist() for inner_id in train_set.all_users()}\n", + "user_vector_df = pd.Series(user_vectors_and_ids).to_frame('user_vector')\n", + "\n", + "# now do the same for the movie vectors and their movieIds\n", + "movie_vectors_and_ids = {train_set.to_raw_iid(inner_id): movie_vectors[inner_id].tolist() for inner_id in train_set.all_items()}\n", + "movie_vector_df = pd.Series(movie_vectors_and_ids).to_frame('movie_vector')\n", + "\n", + "# merge the movie vector series with the movies dataframe using the movieId and id fields\n", + "movies_df = movies_df.merge(movie_vector_df, left_on='movieId', right_index=True, how='inner')\n", + "movies_df['movieId'] = movies_df['movieId'].apply(lambda x: str(x)) # need to cast to a string as this is a tag field in our search schema\n", + "movies_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## RedisVL Handles the Scale\n", + "\n", + "Especially for large datasets like the 45,000 movie catalog we're dealing with, you'll want Redis to do the heavy lifting of vector search.\n", + "All that's needed is to define the search index and load our data we've cleaned and merged with our vectors.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "12:05:35 redisvl.index.index INFO Index already exists, overwriting.\n" + ] + } + ], + "source": [ + "from redis import Redis\n", + "from redisvl.schema import IndexSchema\n", + "from redisvl.index import SearchIndex\n", + "\n", + "client = Redis.from_url(REDIS_URL)\n", + "\n", + "movie_schema = IndexSchema.from_yaml(\"collaborative_filtering_schema.yaml\")\n", + "\n", + "movie_index = SearchIndex(movie_schema, redis_client=client)\n", + "movie_index.create(overwrite=True, drop=True)\n", + "\n", + "movie_keys = movie_index.load(movies_df.to_dict(orient='records'))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "number of movies 8358\n", + "size of movie df 8358\n", + "unique movie ids 8352\n", + "unique movie titles 8115\n", + "unique movies rated 9065\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
budgetgenresidimdb_idoriginal_languageoverviewpopularityrelease_daterevenueruntimestatustaglinetitlevote_averagevote_countmovieIdimdbIdtmdbIdmovie_vector
030000000[Animation, Comedy, Family]862114709enLed by Woody, Andy's toys live happily in his ...21.946943815040000.037355403381.0ReleasedToy Story7.754151114709862.0[0.12184447241197785, -0.16994406060791697, 0....
165000000[Adventure, Fantasy, Family]8844113497enWhen siblings Judy and Peter discover an encha...17.015539819014400.0262797249104.0ReleasedRoll the dice and unleash the excitement!Jumanji6.9241321134978844.0[0.14683581574270926, -0.06365576587872183, 0....
20[Romance, Comedy]15602113228enA family wedding reignites the ancient feud be...11.712900819619200.00101.0ReleasedStill Yelling. Still Fighting. Still Ready for...Grumpier Old Men6.592311322815602.0[0.16698051985699827, -0.02406109383254372, 0....
316000000[Comedy, Drama, Romance]31357114885enCheated on, mistreated and stepped on, the wom...3.859495819619200.081452156127.0ReleasedFriends are the people who let you be yourself...Waiting to Exhale6.134411488531357.0[-0.10740791019437969, 0.09007945525146789, 0....
40[Comedy]11862113041enJust when George Banks has recovered from his ...8.387519792403200.076578911106.0ReleasedJust When His World Is Back To Normal... He's ...Father of the Bride Part II5.7173511304111862.0[0.11311012532803581, 0.025998675845395405, 0....
\n", + "
" + ], + "text/plain": [ + " budget genres id imdb_id original_language \\\n", + "0 30000000 [Animation, Comedy, Family] 862 114709 en \n", + "1 65000000 [Adventure, Fantasy, Family] 8844 113497 en \n", + "2 0 [Romance, Comedy] 15602 113228 en \n", + "3 16000000 [Comedy, Drama, Romance] 31357 114885 en \n", + "4 0 [Comedy] 11862 113041 en \n", + "\n", + " overview popularity \\\n", + "0 Led by Woody, Andy's toys live happily in his ... 21.946943 \n", + "1 When siblings Judy and Peter discover an encha... 17.015539 \n", + "2 A family wedding reignites the ancient feud be... 11.712900 \n", + "3 Cheated on, mistreated and stepped on, the wom... 3.859495 \n", + "4 Just when George Banks has recovered from his ... 8.387519 \n", + "\n", + " release_date revenue runtime status \\\n", + "0 815040000.0 373554033 81.0 Released \n", + "1 819014400.0 262797249 104.0 Released \n", + "2 819619200.0 0 101.0 Released \n", + "3 819619200.0 81452156 127.0 Released \n", + "4 792403200.0 76578911 106.0 Released \n", + "\n", + " tagline \\\n", + "0 \n", + "1 Roll the dice and unleash the excitement! \n", + "2 Still Yelling. Still Fighting. Still Ready for... \n", + "3 Friends are the people who let you be yourself... \n", + "4 Just When His World Is Back To Normal... He's ... \n", + "\n", + " title vote_average vote_count movieId imdbId \\\n", + "0 Toy Story 7.7 5415 1 114709 \n", + "1 Jumanji 6.9 2413 2 113497 \n", + "2 Grumpier Old Men 6.5 92 3 113228 \n", + "3 Waiting to Exhale 6.1 34 4 114885 \n", + "4 Father of the Bride Part II 5.7 173 5 113041 \n", + "\n", + " tmdbId movie_vector \n", + "0 862.0 [0.12184447241197785, -0.16994406060791697, 0.... \n", + "1 8844.0 [0.14683581574270926, -0.06365576587872183, 0.... \n", + "2 15602.0 [0.16698051985699827, -0.02406109383254372, 0.... \n", + "3 31357.0 [-0.10740791019437969, 0.09007945525146789, 0.... \n", + "4 11862.0 [0.11311012532803581, 0.025998675845395405, 0.... " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# sanity check we merged all dataframes properly and have the right sizes of movies, users, vectors, ids, etc.\n", + "number_of_movies = len(movies_df.to_dict(orient='records'))\n", + "size_of_movie_df = movies_df.shape[0]\n", + "\n", + "print('number of movies', number_of_movies)\n", + "print('size of movie df', size_of_movie_df)\n", + "\n", + "unique_movie_ids = movies_df['id'].nunique()\n", + "print('unique movie ids', unique_movie_ids)\n", + "\n", + "unique_movie_titles = movies_df['title'].nunique()\n", + "print('unique movie titles', unique_movie_titles)\n", + "\n", + "unique_movies_rated = ratings_df['movieId'].nunique()\n", + "print('unique movies rated', unique_movies_rated)\n", + "movies_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For a complete solution we'll store the user vectors and their watched list in Redis also. We won't be searching over these user vectors so no need to define an index for them. A direct JSON look up will suffice." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "from redis.commands.json.path import Path\n", + "\n", + "# use a Redis pipeline to store user data and verify it in a single transaction\n", + "with client.pipeline() as pipe:\n", + " for user_id, user_vector in user_vectors_and_ids.items():\n", + " user_key = f\"user:{user_id}\"\n", + " watched_list_ids = ratings_df[ratings_df['userId'] == user_id]['movieId'].tolist()\n", + "\n", + " user_data = {\n", + " \"user_vector\": user_vector,\n", + " \"watched_list_ids\": watched_list_ids\n", + " }\n", + " pipe.json().set(user_key, Path.root_path(), user_data)\n", + " pipe.execute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Unlike in content filtering, where we want to compute vector similarity between items and we use cosine distance between items vectors to do so, in collaborative filtering we instead try to compute the predicted rating a user will give to a movie by taking the inner product of the user and movie vector.\n", + "\n", + "This is why in our `collaborative_filtering_schema.yaml` we use `ip` (inner product) as our distance metric.\n", + "\n", + "It's also why we'll use our user vector as the query vector when we do a query. Let's pick a random user and their corresponding user vector to see what this looks like." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "vector distance: -3.63527393,\t predicted rating: 4.63527393,\t title: Fight Club, \n", + "vector distance: -3.60445881,\t predicted rating: 4.60445881,\t title: All About Eve, \n", + "vector distance: -3.60197020,\t predicted rating: 4.60197020,\t title: Lock, Stock and Two Smoking Barrels, \n", + "vector distance: -3.59518766,\t predicted rating: 4.59518766,\t title: Midnight in Paris, \n", + "vector distance: -3.58543396,\t predicted rating: 4.58543396,\t title: It Happened One Night, \n", + "vector distance: -3.54092789,\t predicted rating: 4.54092789,\t title: Anne Frank Remembered, \n", + "vector distance: -3.51044893,\t predicted rating: 4.51044893,\t title: Pulp Fiction, \n", + "vector distance: -3.50941706,\t predicted rating: 4.50941706,\t title: Raging Bull, \n", + "vector distance: -3.49180365,\t predicted rating: 4.49180365,\t title: Cool Hand Luke, \n", + "vector distance: -3.47437143,\t predicted rating: 4.47437143,\t title: Rear Window, \n", + "vector distance: -3.41378117,\t predicted rating: 4.41378117,\t title: The Usual Suspects, \n", + "vector distance: -3.40533876,\t predicted rating: 4.40533876,\t title: Princess Mononoke, \n" + ] + } + ], + "source": [ + "from redisvl.query import RangeQuery\n", + "\n", + "user_vector = client.json().get(f\"user:{352}\")[\"user_vector\"]\n", + "\n", + "# the distance metric 'ip' inner product is computing \"score = 1 - u * v\" and returning the minimum, which corresponds to the max of \"u * v\"\n", + "# this is what we want. The predicted rating on a scale of 0 to 5 is then -(score - 1) == -score + 1\n", + "query = RangeQuery(vector=user_vector,\n", + " vector_field_name='movie_vector',\n", + " num_results=12,\n", + " return_score=True,\n", + " return_fields=['title', 'genres']\n", + " )\n", + "\n", + "results = movie_index.query(query)\n", + "\n", + "for r in results:\n", + " # compute our predicted rating on a scale of 0 to 5 from our vector distance\n", + " r['predicted_rating'] = - float(r['vector_distance']) + 1.\n", + " print(f\"vector distance: {float(r['vector_distance']):.08f},\\t predicted rating: {r['predicted_rating']:.08f},\\t title: {r['title']}, \")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Adding All the Bells & Whistles\n", + "Vector search handles the bulk of our collaborative filtering recommendation system and is a great approach to generating personalized recommendations that are unique to each user.\n", + "\n", + "To up our RecSys game even further we can leverage RedisVL Filter logic to give more control to what users are shown. Why have only one feed of recommended movies when you can have several, each with its own theme and personalized to each user." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "from redisvl.query.filter import Tag, Num, Text\n", + "\n", + "def get_recommendations(user_id, filters=None, num_results=10):\n", + " user_vector = client.json().get(f\"user:{user_id}\")[\"user_vector\"]\n", + " query = RangeQuery(vector=user_vector,\n", + " vector_field_name='movie_vector',\n", + " num_results=num_results,\n", + " filter_expression=filters,\n", + " return_fields=['title', 'overview', 'genres'])\n", + "\n", + " results = movie_index.query(query)\n", + "\n", + " return [(r['title'], r['overview'], r['genres'], r['vector_distance']) for r in results]\n", + "\n", + "Top_picks_for_you = get_recommendations(user_id=42) # general SVD results, no filter\n", + "\n", + "block_buster_filter = Num('revenue') > 30_000_000\n", + "block_buster_hits = get_recommendations(user_id=42, filters=block_buster_filter)\n", + "\n", + "classics_filter = Num('release_date') < datetime.datetime(1990, 1, 1).timestamp()\n", + "classics = get_recommendations(user_id=42, filters=classics_filter)\n", + "\n", + "popular_filter = (Num('popularity') > 50) & (Num('vote_average') > 7)\n", + "Whats_popular = get_recommendations(user_id=42, filters=popular_filter)\n", + "\n", + "indie_filter = (Num('revenue') < 1_000_000) & (Num('popularity') > 10)\n", + "indie_hits = get_recommendations(user_id=42, filters=indie_filter)\n", + "\n", + "fruity = Text('title') % 'apple|orange|peach|banana|grape|pineapple'\n", + "fruity_films = get_recommendations(user_id=42, filters=fruity)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
top picksblock bustersclassicswhat's popularindie hitsfruity films
0The Shawshank RedemptionForrest GumpCinema ParadisoThe Shawshank RedemptionCastle in the SkyWhat's Eating Gilbert Grape
1Forrest GumpThe Silence of the LambsThe African QueenPulp FictionMy Neighbor TotoroA Clockwork Orange
2Cinema ParadisoPulp FictionRaiders of the Lost ArkThe Dark KnightAll Quiet on the Western FrontThe Grapes of Wrath
3Lock, Stock and Two Smoking BarrelsRaiders of the Lost ArkThe Empire Strikes BackFight ClubArmy of DarknessPineapple Express
4The African QueenThe Empire Strikes BackIndiana Jones and the Last CrusadeWhiplashAll About EveJames and the Giant Peach
5The Silence of the LambsIndiana Jones and the Last CrusadeStar WarsBlade RunnerThe ProfessionalBananas
6Pulp FictionSchindler's ListThe Manchurian CandidateThe AvengersShineOrange County
7Raiders of the Lost ArkThe Lord of the Rings: The Return of the KingThe Godfather: Part IIGuardians of the GalaxyYojimboHerbie Goes Bananas
8The Empire Strikes BackThe Lord of the Rings: The Two TowersCastle in the SkyGone GirlBelle de JourThe Apple Dumpling Gang
9Indiana Jones and the Last CrusadeTerminator 2: Judgment DayBack to the FutureBig Hero 6Local HeroAdam's Apples
\n", + "
" + ], + "text/plain": [ + " top picks \\\n", + "0 The Shawshank Redemption \n", + "1 Forrest Gump \n", + "2 Cinema Paradiso \n", + "3 Lock, Stock and Two Smoking Barrels \n", + "4 The African Queen \n", + "5 The Silence of the Lambs \n", + "6 Pulp Fiction \n", + "7 Raiders of the Lost Ark \n", + "8 The Empire Strikes Back \n", + "9 Indiana Jones and the Last Crusade \n", + "\n", + " block busters \\\n", + "0 Forrest Gump \n", + "1 The Silence of the Lambs \n", + "2 Pulp Fiction \n", + "3 Raiders of the Lost Ark \n", + "4 The Empire Strikes Back \n", + "5 Indiana Jones and the Last Crusade \n", + "6 Schindler's List \n", + "7 The Lord of the Rings: The Return of the King \n", + "8 The Lord of the Rings: The Two Towers \n", + "9 Terminator 2: Judgment Day \n", + "\n", + " classics what's popular \\\n", + "0 Cinema Paradiso The Shawshank Redemption \n", + "1 The African Queen Pulp Fiction \n", + "2 Raiders of the Lost Ark The Dark Knight \n", + "3 The Empire Strikes Back Fight Club \n", + "4 Indiana Jones and the Last Crusade Whiplash \n", + "5 Star Wars Blade Runner \n", + "6 The Manchurian Candidate The Avengers \n", + "7 The Godfather: Part II Guardians of the Galaxy \n", + "8 Castle in the Sky Gone Girl \n", + "9 Back to the Future Big Hero 6 \n", + "\n", + " indie hits fruity films \n", + "0 Castle in the Sky What's Eating Gilbert Grape \n", + "1 My Neighbor Totoro A Clockwork Orange \n", + "2 All Quiet on the Western Front The Grapes of Wrath \n", + "3 Army of Darkness Pineapple Express \n", + "4 All About Eve James and the Giant Peach \n", + "5 The Professional Bananas \n", + "6 Shine Orange County \n", + "7 Yojimbo Herbie Goes Bananas \n", + "8 Belle de Jour The Apple Dumpling Gang \n", + "9 Local Hero Adam's Apples " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# put all these titles into a single pandas dataframe, where each column is one category\n", + "all_recommendations = pd.DataFrame(columns=[\"top picks\", \"block busters\", \"classics\", \"what's popular\", \"indie hits\", \"fruity films\"])\n", + "all_recommendations[\"top picks\"] = [m[0] for m in Top_picks_for_you]\n", + "all_recommendations[\"block busters\"] = [m[0] for m in block_buster_hits]\n", + "all_recommendations[\"classics\"] = [m[0] for m in classics]\n", + "all_recommendations[\"what's popular\"] = [m[0] for m in Whats_popular]\n", + "all_recommendations[\"indie hits\"] = [m[0] for m in indie_hits]\n", + "all_recommendations[\"fruity films\"] = [m[0] for m in fruity_films]\n", + "\n", + "all_recommendations.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Keeping Things Fresh\n", + "You've probably noticed that a few movies get repeated in these lists. That's not surprising as all our results are personalized and things like `popularity` and `user_rating` and `revenue` are likely highly correlated. And it's more than likely that at least some of the recommendations we're expecting to be highly rated by a given user are ones they've already watched and rated highly.\n", + "\n", + "We need a way to filter out movies that a user has already seen, and movies that we've already recommended to them before.\n", + "We could use a Tag filter on our queries to filter out movies by their id, but this gets cumbersome quickly.\n", + "Luckily Redis offers an easy answer to keeping recommendations new and interesting, and that answer is Bloom Filters." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# rewrite the get_recommendations() function to use a bloom filter and apply it before we return results\n", + "def get_unique_recommendations(user_id, filters=None, num_results=10):\n", + " user_data = client.json().get(f\"user:{user_id}\")\n", + " user_vector = user_data[\"user_vector\"]\n", + " watched_movies = user_data[\"watched_list_ids\"]\n", + "\n", + " # use a Bloom Filter to filter out movies that the user has already watched\n", + " client.bf().insert('user_watched_list', [f\"{user_id}:{movie_id}\" for movie_id in watched_movies])\n", + "\n", + " query = RangeQuery(vector=user_vector,\n", + " vector_field_name='movie_vector',\n", + " num_results=num_results * 5, # fetch more results to account for watched movies\n", + " filter_expression=filters,\n", + " return_fields=['title', 'overview', 'genres', 'movieId'],\n", + " )\n", + " results = movie_index.query(query)\n", + "\n", + " matches = client.bf().mexists(\"user_watched_list\", *[f\"{user_id}:{r['movieId']}\" for r in results])\n", + "\n", + " recommendations = [\n", + " (r['title'], r['overview'], r['genres'], r['vector_distance'], r['movieId'])\n", + " for i, r in enumerate(results) if matches[i] == 0\n", + " ][:num_results]\n", + "\n", + " # add these recommendations to the bloom filter so they don't appear again\n", + " client.bf().insert('user_watched_list', [f\"{user_id}:{r[4]}\" for r in recommendations])\n", + " return recommendations\n", + "\n", + "# example usage\n", + "# create a bloom filter for all our users\n", + "try:\n", + " client.bf().create(f\"user_watched_list\", 0.01, 10000)\n", + "except Exception as e:\n", + " client.delete(\"user_watched_list\")\n", + " client.bf().create(f\"user_watched_list\", 0.01, 10000)\n", + "\n", + "user_id = 42\n", + "\n", + "top_picks_for_you = get_unique_recommendations(user_id=user_id, num_results=5) # general SVD results, no filter\n", + "block_buster_hits = get_unique_recommendations(user_id=user_id, filters=block_buster_filter, num_results=5)\n", + "classics = get_unique_recommendations(user_id=user_id, filters=classics_filter, num_results=5)\n", + "whats_popular = get_unique_recommendations(user_id=user_id, filters=popular_filter, num_results=5)\n", + "indie_hits = get_unique_recommendations(user_id=user_id, filters=indie_filter, num_results=5)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "vscode": { + "languageId": "ruby" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
top picksblock bustersclassicswhat's popularindie hits
0Cinema ParadisoThe Manchurian CandidateCastle in the SkyFight ClubAll Quiet on the Western Front
1Lock, Stock and Two Smoking BarrelsToy Story12 Angry MenWhiplashArmy of Darkness
2The African QueenThe Godfather: Part IIMy Neighbor TotoroBlade RunnerAll About Eve
3The Silence of the LambsBack to the FutureIt Happened One NightGone GirlThe Professional
4Eat Drink Man WomanThe GodfatherStand by MeBig Hero 6Shine
\n", + "
" + ], + "text/plain": [ + " top picks block busters \\\n", + "0 Cinema Paradiso The Manchurian Candidate \n", + "1 Lock, Stock and Two Smoking Barrels Toy Story \n", + "2 The African Queen The Godfather: Part II \n", + "3 The Silence of the Lambs Back to the Future \n", + "4 Eat Drink Man Woman The Godfather \n", + "\n", + " classics what's popular indie hits \n", + "0 Castle in the Sky Fight Club All Quiet on the Western Front \n", + "1 12 Angry Men Whiplash Army of Darkness \n", + "2 My Neighbor Totoro Blade Runner All About Eve \n", + "3 It Happened One Night Gone Girl The Professional \n", + "4 Stand by Me Big Hero 6 Shine " + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# put all these titles into a single pandas dataframe , where each column is one category\n", + "all_recommendations = pd.DataFrame(columns=[\"top picks\", \"block busters\", \"classics\", \"what's popular\", \"indie hits\"])\n", + "all_recommendations[\"top picks\"] = [m[0] for m in top_picks_for_you]\n", + "all_recommendations[\"block busters\"] = [m[0] for m in block_buster_hits]\n", + "all_recommendations[\"classics\"] = [m[0] for m in classics]\n", + "all_recommendations[\"what's popular\"] = [m[0] for m in whats_popular]\n", + "all_recommendations[\"indie hits\"] = [m[0] for m in indie_hits]\n", + "\n", + "all_recommendations.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "That's it! That's all it takes to build a highly scalable, personalized, customizable collaborative filtering recommendation system with Redis and RedisVL.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Deleted 4358 keys\n", + "Deleted 2000 keys\n", + "Deleted 1000 keys\n", + "Deleted 500 keys\n", + "Deleted 500 keys\n" + ] + }, + { + "data": { + "text/plain": [ + "671" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# clean up your index\n", + "while remaining := movie_index.clear():\n", + " print(f\"Deleted {remaining} keys\")\n", + "\n", + "client.delete(\"user_watched_list\")\n", + "client.delete(*[f\"user:{user_id}\" for user_id in user_vectors_and_ids.keys()])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "redis-ai-res", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/python-recipes/recommendation-systems/collaborative_filtering_schema.yaml b/python-recipes/recommendation-systems/collaborative_filtering_schema.yaml new file mode 100644 index 00000000..af58d793 --- /dev/null +++ b/python-recipes/recommendation-systems/collaborative_filtering_schema.yaml @@ -0,0 +1,40 @@ +index: + name: movies + prefix: movie + storage_type: json + +fields: + - name: movieId + type: tag + - name: genres + type: tag + - name: original_language + type: tag + - name: overview + type: text + - name: popularity + type: numeric + - name: release_date + type: numeric + - name: revenue + type: numeric + - name: runtime + type: numeric + - name: status + type: tag + - name: tagline + type: text + - name: title + type: text + - name: vote_average + type: numeric + - name: vote_count + type: numeric + + - name: movie_vector + type: vector + attrs: + dims: 100 + distance_metric: ip + algorithm: flat + datatype: float32 \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 216712e5..08d8a236 100644 --- a/requirements.txt +++ b/requirements.txt @@ -20,4 +20,4 @@ redisvl>=0.3.0 pytest ragas datasets - +scikit-surprise