redis-developer
diff --git a/‎python-recipes/recommender-systems/content_filtering.ipynb‎
Lines changed: 94 additions & 41 deletions b/‎python-recipes/recommender-systems/content_filtering.ipynb‎
Lines changed: 94 additions & 41 deletions
diff --git a/‎python-recipes/recommender-systems/content_filtering_schema.yaml‎
Lines changed: 34 additions & 0 deletions b/‎python-recipes/recommender-systems/content_filtering_schema.yaml‎
Lines changed: 34 additions & 0 deletions
@@ -26,9 +26,22 @@
     "recommender and use the movies dataset as our example data."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Content Filtering recommender systems are built on the premise that a person will want to be recommended things that are similar to things they already like.\n",
+    "\n",
+    "In the case of movies, if a person watches and enjoys a nature documentary we should recommend other nature documentaries. Or if they like classic black & white horror films we should recommend more of those.\n",
+    "\n",
+    "The question we need to answer is, 'what does it mean for movies to be similar?'. There are exact matching strategies, like using a movie's labelled genre like 'Horror', or 'Sci Fi', but that can lock people in to only a few genres. Or what if it's not the genre that a person likes, but certain story arcs that are common among many genres?\n",
+    "\n",
+    "For our content filtering recommender we'll measure similarity between movies as semantic similarity of their descriptions and keywords."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 63,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -56,12 +69,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Start by loading the movies data and doing a quick inspection of it."
+    "Start by downloading the movies data and doing a quick inspection of it."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 64,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [
     {
@@ -223,15 +236,26 @@
        "4  1914  /title/tt0004457/  "
       ]
      },
-     "execution_count": 64,
+     "execution_count": 2,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
+    "try:\n",
+    "    df = pd.read_csv(\"datasets/content_filtering/25k_imdb_movie_dataset.csv\")\n",
+    "except:\n",
+    "    import requests\n",
+    "    # download the file\n",
+    "    url = 'https://redis-ai-resources.s3.us-east-2.amazonaws.com/recommenders/datasets/content-filtering/25k_imdb_movie_dataset.csv'\n",
+    "    r = requests.get(url)\n",
+    "\n",
+    "    #save the file as a csv\n",
+    "    os.mkdir('./datasets/content_filtering')\n",
+    "    with open('./datasets/content_filtering/25k_imdb_movie_dataset.csv', 'wb') as f:\n",
+    "        f.write(r.content)\n",
+    "    df = pd.read_csv(\"datasets/content_filtering/25k_imdb_movie_dataset.csv\")\n",
     "\n",
-    "# modified from https://www.kaggle.com/datasets/utsh0dey/25k-movie-dataset\n",
-    "df = pd.read_csv(\"datasets/imdb_movies/25k_imdb_movie_dataset.csv\")\n",
     "df.head()"
    ]
   },
@@ -248,7 +272,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 65,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [
     {
@@ -266,7 +290,7 @@
        "dtype: int64"
       ]
      },
-     "execution_count": 65,
+     "execution_count": 3,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -303,17 +327,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "RedisVL supports complex query logic, beyond just vector similarity. To showcase this we'll generate an embedding from each movies' `overview` text and list of `plot keywords`,\n",
-    "and use the remaining fields like, `genres`, `year`, and `rating` as filterable fields to target our vector queries to.\n",
+    "Since we movie similarity as semantic similarity of movie descriptions we need a way to generate semantic vector embeddings of these descriptions.\n",
     "\n",
-    "#There are many choices for text vectorization, but here we'll use a pretrained model from HuggingFace's transformers library.\n",
+    "RedisVL supports many different embedding generators. For this example we'll use a HuggingFace model that is rated well for semantic similarity use cases.\n",
     "\n",
-    "RedisVL supports many different embedding generators. For this example we'll use a HuggingFace model that is rated well for semantic similarity use cases."
+    "RedisVL also supports complex query logic, beyond just vector similarity. To showcase this we'll generate an embedding from each movies' `overview` text and list of `plot keywords`,\n",
+    "and use the remaining fields like, `genres`, `year`, and `rating` as filterable fields to target our vector queries to.\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 66,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
@@ -322,7 +346,7 @@
        "'The Story of the Kelly Gang. Story of Ned Kelly, an infamous 19th-century Australian outlaw. ned kelly, australia, historic figure, australian western, first of its kind, directorial debut, australian history, 19th century, victoria australia, australian'"
       ]
      },
-     "execution_count": 66,
+     "execution_count": 4,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -335,23 +359,40 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 67,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "# this step will take a while, but only needs to be done once for your entire dataset\n",
-    "# currently taking 10 minutes to run, so we've gone ahead and saved the vectors to a file\n",
+    "# currently taking 10 minutes to run, so we've gone ahead and saved the vectors to a file for you\n",
+    "# if you don't want to wait, you can skip the cell and load the vectors from the file in the next cell\n",
     "import pickle\n",
     "from redisvl.utils.vectorize import HFTextVectorizer\n",
     "\n",
     "vectorizer = HFTextVectorizer(model = 'sentence-transformers/paraphrase-MiniLM-L6-v2')\n",
     "\n",
+    "df['embedding'] = df['full_text'].apply(lambda x: vectorizer.embed(x, as_buffer=False))\n",
+    "pickle.dump(df['embedding'], open('datasets/content_filtering/text_embeddings.pkl', 'wb'))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "\n",
     "try:\n",
-    "    with open('datasets/imdb_movies/text_embeddings.pkl', 'rb') as vector_file:\n",
+    "    with open('datasets/content_filtering/text_embeddings.pkl', 'rb') as vector_file:\n",
     "        df['embedding'] = pickle.load(vector_file)\n",
     "except:\n",
-    "    df['embedding'] = df['full_text'].apply(lambda x: vectorizer.embed(x, as_buffer=False))\n",
-    "    pickle.dump(df['embedding'], open('datasets/imdb_movies/text_embeddings.pkl', 'wb'))"
+    "    embeddings_url = 'https://redis-ai-resources.s3.us-east-2.amazonaws.com/recommenders/datasets/content-filtering/text_embeddings.pkl'\n",
+    "    r = requests.get(embeddings_url)\n",
+    "    with open('./datasets/content_filtering/text_embeddings.pkl', 'wb') as f:\n",
+    "        f.write(r.content)\n",
+    "    with open('datasets/content_filtering/text_embeddings.pkl', 'rb') as vector_file:\n",
+    "        df['embedding'] = pickle.load(vector_file)"
    ]
   },
   {
@@ -364,16 +405,25 @@
     "We'll load this from the accompanying `content_filtering_schema.yaml` file."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This schema defines what each entry will look like within Redis. It will need to specify the name of each field, like `title`, `rating`, and `rating-count`, as well as the type of each field, like `text` or `numeric`.\n",
+    "\n",
+    "The vector component of each entry similarly needs its dimension (dims), distance metric, algorithm and datatype (dtype) attributes specified."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 68,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "16:54:56 redisvl.index.index INFO   Index already exists, overwriting.\n"
+      "15:15:43 redisvl.index.index INFO   Index already exists, overwriting.\n"
      ]
     }
    ],
@@ -402,7 +452,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 69,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -423,18 +473,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 70,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "{'id': 'movie:b60e3c2f9e0d43dd8ed33f2f835fb4e0', 'vector_distance': '0.584870040417', 'title': 'The Odyssey', 'overview': 'The aquatic adventure of the highly influential and fearlessly ambitious pioneer, innovator, filmmaker, researcher, and conservationist, Jacques-Yves Cousteau, covers roughly thirty years of an inarguably rich in achievements life.'}\n",
-      "{'id': 'movie:82808ccd86c44864814c67d8c88ca0d1', 'vector_distance': '0.63329231739', 'title': 'The Inventor', 'overview': 'Inventing flying contraptions, war machines and studying cadavers, Leonardo da Vinci tackles the meaning of life itself with the help of French princess Marguerite de Nevarre.'}\n",
-      "{'id': 'movie:32ca15de9f6f4054b01fddd93e24eba6', 'vector_distance': '0.658123672009', 'title': 'Ruin', 'overview': 'The film follows a nameless ex-Nazi captain who navigates the ruins of post-WWII Germany determined to atone for his crimes during the war by hunting down the surviving members of his former SS Death Squad.'}\n",
-      "{'id': 'movie:d2c917e916cc47f3af335f0ec0e1bb50', 'vector_distance': '0.688094437122', 'title': 'The Raven', 'overview': 'A man with incredible powers is sought by the government and military.'}\n",
-      "{'id': 'movie:fbc21874295d479292eff1486bc49c20', 'vector_distance': '0.694671392441', 'title': 'Get the Girl', 'overview': 'Sebastain \"Bash\" Danye, a legendary gun for hire hangs up his weapon to retire peacefully with his \\'it\\'s complicated\\' partner Renee. Their quiet lives are soon interrupted when they find an unconscious woman on their property, Maddie. While nursing her back to health, some bad me... Read all'}\n"
+      "{'id': 'movie:be648c0ed83b460d9e01f03940c7c7cf', 'vector_distance': '0.584870040417', 'title': 'The Odyssey', 'overview': 'The aquatic adventure of the highly influential and fearlessly ambitious pioneer, innovator, filmmaker, researcher, and conservationist, Jacques-Yves Cousteau, covers roughly thirty years of an inarguably rich in achievements life.'}\n",
+      "{'id': 'movie:bc1375b4d7dd47e2a117c94bebdffa28', 'vector_distance': '0.63329231739', 'title': 'The Inventor', 'overview': 'Inventing flying contraptions, war machines and studying cadavers, Leonardo da Vinci tackles the meaning of life itself with the help of French princess Marguerite de Nevarre.'}\n",
+      "{'id': 'movie:469d553ae60846b9b8c7128fc57fe079', 'vector_distance': '0.658123672009', 'title': 'Ruin', 'overview': 'The film follows a nameless ex-Nazi captain who navigates the ruins of post-WWII Germany determined to atone for his crimes during the war by hunting down the surviving members of his former SS Death Squad.'}\n",
+      "{'id': 'movie:493527b640bc48a2941ee46df71a4018', 'vector_distance': '0.688094437122', 'title': 'The Raven', 'overview': 'A man with incredible powers is sought by the government and military.'}\n",
+      "{'id': 'movie:bd123271d6a04128acb93dc48b8e5847', 'vector_distance': '0.694671392441', 'title': 'Get the Girl', 'overview': 'Sebastain \"Bash\" Danye, a legendary gun for hire hangs up his weapon to retire peacefully with his \\'it\\'s complicated\\' partner Renee. Their quiet lives are soon interrupted when they find an unconscious woman on their property, Maddie. While nursing her back to health, some bad me... Read all'}\n"
      ]
     }
    ],
@@ -468,7 +518,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 71,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -478,7 +528,7 @@
     "    flexible_filter = (\n",
     "        (Num(\"year\") > release_year) &  # only show movies released after this year\n",
     "        (Tag(\"genres\") == genres) &     # only show movies that match at least one in list of genres\n",
-    "        (Text(\"full_text\") % keywords)   # only show movies that contain at least one of the keywords\n",
+    "        (Text(\"full_text\") % keywords)  # only show movies that contain at least one of the keywords\n",
     "    )\n",
     "    return flexible_filter\n",
     "\n",
@@ -495,9 +545,20 @@
     "    return recommendations"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As a final demonstration we'll find movies similar to the classic horror film 'Nosferatu'.\n",
+    "The process has 3 steps:\n",
+    "- fetch the vector embedding of our film Nosferatu\n",
+    "- optionally define any hard filters we want. Here we'll specify we want horror movies made on or after 1990\n",
+    "- perform the vector range query to find similar movies that meet our filter criteria"
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 72,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [
     {
@@ -542,17 +603,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 73,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Deleted 143 keys\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "# clean up your index\n",
     "while remaining := index.clear():\n",
 
@@ -0,0 +1,34 @@
+index:
+    name: movies_recommendation
+    prefix: movie
+    storage_type: json
+
+fields:
+    - name: title
+      type: text
+    - name: rating
+      type: numeric
+    - name: rating_count
+      type: numeric
+    - name: genres
+      type: tag
+    - name: overview
+      type: text
+    - name: keywords
+      type: tag
+    - name: cast
+      type: tag
+    - name: writer
+      type: text
+    - name: year
+      type: numeric
+    - name: full_text
+      type: text
+
+    - name: embedding
+      type: vector
+      attrs:
+          dims: 384
+          distance_metric: cosine
+          algorithm: flat
+          dtype: float32