|
| 1 | +--- |
| 2 | +title: Offline Data Loading |
| 3 | +weight: 3 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +In this section, we will show you how to load private knowledge in our RAG. |
| 10 | + |
| 11 | +### Create the Collection |
| 12 | +We use [Zilliz Cloud](https://zilliz.com/cloud) deployed on AWS with Arm-based machines to store and retrieve the vector data. To quick start, simply [register an account](https://docs.zilliz.com/docs/register-with-zilliz-cloud) on Zilliz Cloud for free. |
| 13 | + |
| 14 | +> In addition to Zilliz Cloud, self-hosted Milvus is also a (more complicated to set up) option. We can also deploy [Milvus Standalone](https://milvus.io/docs/install_standalone-docker-compose.md) and [Kubernetes](https://milvus.io/docs/install_cluster-milvusoperator.md) on ARM-based machines. For more information about Milvus installation, please refer to the [installation documentation](https://milvus.io/docs/install-overview.md). |
| 15 | +
|
| 16 | +We set the `uri` and `token` as the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud. |
| 17 | +```python |
| 18 | +from pymilvus import MilvusClient |
| 19 | + |
| 20 | +milvus_client = MilvusClient( |
| 21 | + uri="<your_zilliz_public_endpoint>", token="<your_zilliz_api_key>" |
| 22 | +) |
| 23 | + |
| 24 | +collection_name = "my_rag_collection" |
| 25 | + |
| 26 | +``` |
| 27 | +Check if the collection already exists and drop it if it does. |
| 28 | +```python |
| 29 | +if milvus_client.has_collection(collection_name): |
| 30 | + milvus_client.drop_collection(collection_name) |
| 31 | +``` |
| 32 | +Create a new collection with specified parameters. |
| 33 | + |
| 34 | +If we don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values. |
| 35 | +```python |
| 36 | +milvus_client.create_collection( |
| 37 | + collection_name=collection_name, |
| 38 | + dimension=embedding_dim, |
| 39 | + metric_type="IP", # Inner product distance |
| 40 | + consistency_level="Strong", # Strong consistency level |
| 41 | +) |
| 42 | +``` |
| 43 | +We use inner product distance as the default metric type. For more information about distance types, you can refer to [Similarity Metrics page](https://milvus.io/docs/metric.md?tab=floating) |
| 44 | + |
| 45 | +### Prepare the data |
| 46 | + |
| 47 | +We use the FAQ pages from the [Milvus Documentation 2.4.x](https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip) as the private knowledge in our RAG, which is a good data source for a simple RAG pipeline. |
| 48 | + |
| 49 | +Download the zip file and extract documents to the folder `milvus_docs`. |
| 50 | + |
| 51 | +```shell |
| 52 | +wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip |
| 53 | +unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs |
| 54 | +``` |
| 55 | + |
| 56 | +We load all markdown files from the folder `milvus_docs/en/faq`. For each document, we just simply use "# " to separate the content in the file, which can roughly separate the content of each main part of the markdown file. |
| 57 | + |
| 58 | + |
| 59 | +```python |
| 60 | +from glob import glob |
| 61 | + |
| 62 | +text_lines = [] |
| 63 | + |
| 64 | +for file_path in glob("milvus_docs/en/faq/*.md", recursive=True): |
| 65 | + with open(file_path, "r") as file: |
| 66 | + file_text = file.read() |
| 67 | + |
| 68 | + text_lines += file_text.split("# ") |
| 69 | +``` |
| 70 | + |
| 71 | +### Insert data |
| 72 | +We prepare a simple but efficient embedding model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) that can convert text into embedding vectors. |
| 73 | +```python |
| 74 | +from langchain_huggingface import HuggingFaceEmbeddings |
| 75 | + |
| 76 | +embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") |
| 77 | +``` |
| 78 | + |
| 79 | +Iterate through the text lines, create embeddings, and then insert the data into Milvus. |
| 80 | + |
| 81 | +Here is a new field `text`, which is a non-defined field in the collection schema. It will be automatically added to the reserved JSON dynamic field, which can be treated as a normal field at a high level. |
| 82 | +```python |
| 83 | +from tqdm import tqdm |
| 84 | + |
| 85 | +data = [] |
| 86 | + |
| 87 | +text_embeddings = embedding_model.embed_documents(text_lines) |
| 88 | + |
| 89 | +for i, (line, embedding) in enumerate( |
| 90 | + tqdm(zip(text_lines, text_embeddings), desc="Creating embeddings") |
| 91 | +): |
| 92 | + data.append({"id": i, "vector": embedding, "text": line}) |
| 93 | + |
| 94 | +milvus_client.insert(collection_name=collection_name, data=data) |
| 95 | +``` |
| 96 | +```text |
| 97 | +Creating embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 72/72 [00:18<00:00, 3.91it/s] |
| 98 | +``` |
0 commit comments