|
954 | 954 | "docs = await loader.aload()\n",
|
955 | 955 | "print(docs)"
|
956 | 956 | ]
|
| 957 | + }, |
| 958 | + { |
| 959 | + "cell_type": "markdown", |
| 960 | + "metadata": {}, |
| 961 | + "source": [ |
| 962 | + "# Hybrid Search with AlloyDBVectorStore\n", |
| 963 | + "\n", |
| 964 | + "A Hybrid Search combines multiple lookup strategies to provide more comprehensive and relevant search results. Specifically, it leverages both dense embedding vector search (for semantic similarity) and TSV (Text Search Vector) based keyword search (for lexical matching). This approach is particularly powerful for applications requiring efficient searching through customized text and metadata, especially when a specialized embedding model isn't feasible or necessary.\n", |
| 965 | + "\n", |
| 966 | + "By integrating both semantic and lexical capabilities, hybrid search helps overcome the limitations of each individual method:\n", |
| 967 | + "* **Semantic Search**: Excellent for understanding the meaning of a query, even if the exact keywords aren't present. However, it can sometimes miss highly relevant documents that contain the precise keywords but have a slightly different semantic context.\n", |
| 968 | + "* **Keyword Search**: Highly effective for finding documents with exact keyword matches and is generally fast. Its weakness lies in its inability to understand synonyms, misspellings, or conceptual relationships." |
| 969 | + ] |
| 970 | + }, |
| 971 | + { |
| 972 | + "cell_type": "markdown", |
| 973 | + "metadata": {}, |
| 974 | + "source": [ |
| 975 | + "## Hybrid Search Config\n", |
| 976 | + "\n", |
| 977 | + "You can take advantage of hybrid search with AlloyDBVectorStore using the `HybridSearchConfig`.\n", |
| 978 | + "\n", |
| 979 | + "With a `HybridSearchConfig` provided, the `AlloyDBVectorStore` class can efficiently manage a hybrid search vector store using AlloyDB as the backend, automatically handling the creation and population of the necessary TSV columns when possible." |
| 980 | + ] |
| 981 | + }, |
| 982 | + { |
| 983 | + "cell_type": "markdown", |
| 984 | + "metadata": {}, |
| 985 | + "source": [ |
| 986 | + "### Building the config\n", |
| 987 | + "\n", |
| 988 | + "Here are the parameters to the hybrid search config:\n", |
| 989 | + "* **tsv_column:** The column name for TSV column. Default: `<content_column>_tsv`\n", |
| 990 | + "* **tsv_lang:** Value representing a supported language. Default: `pg_catalog.english`\n", |
| 991 | + "* **fts_query:** If provided, this would be used for secondary retrieval instead of user provided query.\n", |
| 992 | + "* **fusion_function:** Determines how the results are to be merged, default is equal weighted sum ranking.\n", |
| 993 | + "* **fusion_function_parameters:** Parameters for the fusion function\n", |
| 994 | + "* **primary_top_k:** Max results fetched for primary retrieval. Default: `4`\n", |
| 995 | + "* **secondary_top_k:** Max results fetched for secondary retrieval. Default: `4`\n", |
| 996 | + "* **index_name:** Name of the index built on the `tsv_column`\n", |
| 997 | + "* **index_type:** GIN or GIST. Default: `GIN`" |
| 998 | + ] |
| 999 | + }, |
| 1000 | + { |
| 1001 | + "cell_type": "markdown", |
| 1002 | + "metadata": {}, |
| 1003 | + "source": [ |
| 1004 | + "Here is an example `HybridSearchConfig`" |
| 1005 | + ] |
| 1006 | + }, |
| 1007 | + { |
| 1008 | + "cell_type": "code", |
| 1009 | + "execution_count": null, |
| 1010 | + "metadata": {}, |
| 1011 | + "outputs": [], |
| 1012 | + "source": [ |
| 1013 | + "from langchain_google_alloydb_pg import (\n", |
| 1014 | + " HybridSearchConfig,\n", |
| 1015 | + " reciprocal_rank_fusion,\n", |
| 1016 | + ")\n", |
| 1017 | + "\n", |
| 1018 | + "hybrid_search_config = HybridSearchConfig(\n", |
| 1019 | + " tsv_column=\"hybrid_description\",\n", |
| 1020 | + " tsv_lang=\"pg_catalog.english\",\n", |
| 1021 | + " fusion_function=reciprocal_rank_fusion,\n", |
| 1022 | + " fusion_function_parameters={\n", |
| 1023 | + " \"rrf_k\": 60,\n", |
| 1024 | + " \"fetch_top_k\": 10,\n", |
| 1025 | + " },\n", |
| 1026 | + ")" |
| 1027 | + ] |
| 1028 | + }, |
| 1029 | + { |
| 1030 | + "cell_type": "markdown", |
| 1031 | + "metadata": {}, |
| 1032 | + "source": [ |
| 1033 | + "**Note:** In this case, we have mentioned the fusion function to be a `reciprocal rank fusion` but you can also use the `weighted_sum_ranking`.\n", |
| 1034 | + "\n", |
| 1035 | + "Make sure to use the right fusion function parameters\n", |
| 1036 | + "\n", |
| 1037 | + "`reciprocal_rank_fusion`:\n", |
| 1038 | + "* rrf_k: The RRF parameter k. Defaults to 60\n", |
| 1039 | + "* fetch_top_k: The number of documents to fetch after merging the results. Defaults to 4\n", |
| 1040 | + "\n", |
| 1041 | + "`weighted_sum_ranking`:\n", |
| 1042 | + "* primary_results_weight: The weight for the primary source's scores. Defaults to 0.5\n", |
| 1043 | + "* secondary_results_weight: The weight for the secondary source's scores. Defaults to 0.5\n", |
| 1044 | + "* fetch_top_k: The number of documents to fetch after merging the results. Defaults to 4\n" |
| 1045 | + ] |
| 1046 | + }, |
| 1047 | + { |
| 1048 | + "cell_type": "markdown", |
| 1049 | + "metadata": {}, |
| 1050 | + "source": [ |
| 1051 | + "## Usage\n", |
| 1052 | + "\n", |
| 1053 | + "Let's assume we are using the previously mentioned table [`products`](#create-a-vector-store-using-existing-table), which stores product details for an eComm venture.\n" |
| 1054 | + ] |
| 1055 | + }, |
| 1056 | + { |
| 1057 | + "cell_type": "markdown", |
| 1058 | + "metadata": {}, |
| 1059 | + "source": [ |
| 1060 | + "### With a new hybrid search table\n", |
| 1061 | + "To create a new AlloyDB table with the tsv column, specify the hybrid search config during the initialization of the vector store.\n", |
| 1062 | + "\n", |
| 1063 | + "In this case, all the similarity searches will make use of hybrid search." |
| 1064 | + ] |
| 1065 | + }, |
| 1066 | + { |
| 1067 | + "cell_type": "code", |
| 1068 | + "execution_count": null, |
| 1069 | + "metadata": {}, |
| 1070 | + "outputs": [], |
| 1071 | + "source": [ |
| 1072 | + "TABLE_NAME = \"hybrid_search_products\"\n", |
| 1073 | + "\n", |
| 1074 | + "await engine.ainit_vectorstore_table(\n", |
| 1075 | + " table_name=TABLE_NAME,\n", |
| 1076 | + " # schema_name=SCHEMA_NAME,\n", |
| 1077 | + " vector_size=VECTOR_SIZE,\n", |
| 1078 | + " id_column=\"product_id\",\n", |
| 1079 | + " content_column=\"description\",\n", |
| 1080 | + " embedding_column=\"embed\",\n", |
| 1081 | + " metadata_columns=[\"name\", \"category\", \"price_usd\", \"quantity\", \"sku\", \"image_url\"],\n", |
| 1082 | + " metadata_json_column=\"metadata\",\n", |
| 1083 | + " hybrid_search_config=hybrid_search_config,\n", |
| 1084 | + " store_metadata=True,\n", |
| 1085 | + ")\n", |
| 1086 | + "\n", |
| 1087 | + "vs_hybrid = await AlloyDBVectorStore.create(\n", |
| 1088 | + " engine,\n", |
| 1089 | + " table_name=TABLE_NAME,\n", |
| 1090 | + " # schema_name=SCHEMA_NAME,\n", |
| 1091 | + " embedding_service=embedding,\n", |
| 1092 | + " # Connect to existing VectorStore by customizing below column names\n", |
| 1093 | + " id_column=\"product_id\",\n", |
| 1094 | + " content_column=\"description\",\n", |
| 1095 | + " embedding_column=\"embed\",\n", |
| 1096 | + " metadata_columns=[\"name\", \"category\", \"price_usd\", \"quantity\", \"sku\", \"image_url\"],\n", |
| 1097 | + " metadata_json_column=\"metadata\",\n", |
| 1098 | + " hybrid_search_config=hybrid_search_config,\n", |
| 1099 | + ")\n", |
| 1100 | + "\n", |
| 1101 | + "# Fetch documents from the previously created store to fetch product documents\n", |
| 1102 | + "docs = await custom_store.asimilarity_search(\"products\", k=5)\n", |
| 1103 | + "# Add data normally to the hybrid search vector store, which will also add the tsv values in tsv_column\n", |
| 1104 | + "await vs_hybrid.aadd_documents(docs)\n", |
| 1105 | + "\n", |
| 1106 | + "# Use hybrid search\n", |
| 1107 | + "hybrid_docs = await vs_hybrid.asimilarity_search(\"products\", k=5)\n", |
| 1108 | + "print(hybrid_docs)" |
| 1109 | + ] |
| 1110 | + }, |
| 1111 | + { |
| 1112 | + "cell_type": "markdown", |
| 1113 | + "metadata": {}, |
| 1114 | + "source": [ |
| 1115 | + "### With a pre-existing table\n", |
| 1116 | + "\n", |
| 1117 | + "If a hybrid search config is **NOT** provided during `init_vectorstore_table` while creating a table, the table will not contain a tsv_column. In this case you can still take advantage of hybrid search using the `HybridSearchConfig`.\n", |
| 1118 | + "\n", |
| 1119 | + "The specified TSV column is not present but the TSV vectors are created dynamically on-the-go for hybrid search." |
| 1120 | + ] |
| 1121 | + }, |
| 1122 | + { |
| 1123 | + "cell_type": "code", |
| 1124 | + "execution_count": null, |
| 1125 | + "metadata": {}, |
| 1126 | + "outputs": [], |
| 1127 | + "source": [ |
| 1128 | + "# Set the existing table name\n", |
| 1129 | + "TABLE_NAME = \"products\"\n", |
| 1130 | + "# SCHEMA_NAME = \"my_schema\"\n", |
| 1131 | + "\n", |
| 1132 | + "hybrid_search_config = HybridSearchConfig(\n", |
| 1133 | + " tsv_lang=\"pg_catalog.english\",\n", |
| 1134 | + " fusion_function=reciprocal_rank_fusion,\n", |
| 1135 | + " fusion_function_parameters={\n", |
| 1136 | + " \"rrf_k\": 60,\n", |
| 1137 | + " \"fetch_top_k\": 10,\n", |
| 1138 | + " },\n", |
| 1139 | + ")\n", |
| 1140 | + "\n", |
| 1141 | + "# Initialize AlloyDBVectorStore with the hybrid search config\n", |
| 1142 | + "custom_hybrid_store = await AlloyDBVectorStore.create(\n", |
| 1143 | + " engine,\n", |
| 1144 | + " table_name=TABLE_NAME,\n", |
| 1145 | + " # schema_name=SCHEMA_NAME,\n", |
| 1146 | + " embedding_service=embedding,\n", |
| 1147 | + " # Connect to existing VectorStore by customizing below column names\n", |
| 1148 | + " id_column=\"product_id\",\n", |
| 1149 | + " content_column=\"description\",\n", |
| 1150 | + " embedding_column=\"embed\",\n", |
| 1151 | + " metadata_columns=[\"name\", \"category\", \"price_usd\", \"quantity\", \"sku\", \"image_url\"],\n", |
| 1152 | + " metadata_json_column=\"metadata\",\n", |
| 1153 | + " hybrid_search_config=hybrid_search_config,\n", |
| 1154 | + ")\n", |
| 1155 | + "\n", |
| 1156 | + "# Use hybrid search\n", |
| 1157 | + "hybrid_docs = await custom_hybrid_store.asimilarity_search(\"products\", k=5)\n", |
| 1158 | + "print(hybrid_docs)" |
| 1159 | + ] |
| 1160 | + }, |
| 1161 | + { |
| 1162 | + "cell_type": "markdown", |
| 1163 | + "metadata": {}, |
| 1164 | + "source": [ |
| 1165 | + "In this case, all the similarity searches will make use of hybrid search." |
| 1166 | + ] |
| 1167 | + }, |
| 1168 | + { |
| 1169 | + "cell_type": "markdown", |
| 1170 | + "metadata": {}, |
| 1171 | + "source": [ |
| 1172 | + "### Applying Hybrid Search to Specific Queries\n", |
| 1173 | + "\n", |
| 1174 | + "To use hybrid search only for certain queries, omit the configuration during initialization and pass it directly to the search method when needed." |
| 1175 | + ] |
| 1176 | + }, |
| 1177 | + { |
| 1178 | + "cell_type": "code", |
| 1179 | + "execution_count": null, |
| 1180 | + "metadata": {}, |
| 1181 | + "outputs": [], |
| 1182 | + "source": [ |
| 1183 | + "# Use hybrid search\n", |
| 1184 | + "hybrid_docs = await custom_store.asimilarity_search(\n", |
| 1185 | + " \"products\", k=5, hybrid_search_config=hybrid_search_config\n", |
| 1186 | + ")\n", |
| 1187 | + "print(hybrid_docs)" |
| 1188 | + ] |
| 1189 | + }, |
| 1190 | + { |
| 1191 | + "cell_type": "markdown", |
| 1192 | + "metadata": {}, |
| 1193 | + "source": [ |
| 1194 | + "## Hybrid Search Index\n", |
| 1195 | + "\n", |
| 1196 | + "Optionally, if you have created an AlloyDB table with a tsv_column, you can create an index." |
| 1197 | + ] |
| 1198 | + }, |
| 1199 | + { |
| 1200 | + "cell_type": "code", |
| 1201 | + "execution_count": null, |
| 1202 | + "metadata": {}, |
| 1203 | + "outputs": [], |
| 1204 | + "source": [ |
| 1205 | + "await vs_hybrid.aapply_hybrid_search_index()" |
| 1206 | + ] |
957 | 1207 | }
|
958 | 1208 | ],
|
959 | 1209 | "metadata": {
|
|
0 commit comments