Skip to content

Commit 784cdd0

Browse files
authored
Merge pull request #1337 from madeline-underwood/milvus
Milvus_KB to review
2 parents 829c536 + fccb522 commit 784cdd0

File tree

6 files changed

+69
-69
lines changed

6 files changed

+69
-69
lines changed

content/learning-paths/servers-and-cloud-computing/milvus-rag/_index.md

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,17 @@
11
---
2-
title: Build a Retrieval-Augmented Generation (RAG) application using Zilliz Cloud on Arm servers
3-
4-
draft: true
5-
cascade:
6-
draft: true
2+
title: Build a RAG application using Zilliz Cloud on Arm servers
73

84
minutes_to_complete: 20
95

10-
who_is_this_for: This is an introductory topic for software developers who want to create a RAG application on Arm servers.
6+
who_is_this_for: This is an introductory topic for software developers who want to create a Retrieval-Augmented Generation (RAG) application on Arm servers.
117

128
learning_objectives:
13-
- Create a simple RAG application using Zilliz Cloud
14-
- Launch a LLM service on Arm servers
9+
- Create a simple RAG application using Zilliz Cloud.
10+
- Launch an LLM service on Arm servers.
1511

1612
prerequisites:
17-
- Basic understanding of a RAG pipeline.
18-
- An AWS Graviton3 c7g.2xlarge instance, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp) from a cloud service provider or an on-premise Arm server.
13+
- A basic understanding of a RAG pipeline.
14+
- An AWS Graviton3 C7g.2xlarge instance, or any [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp) from a cloud service provider or an on-premise Arm server.
1915
- A [Zilliz account](https://zilliz.com/cloud), which you can sign up for with a free trial.
2016

2117
author_primary: Chen Zhang

content/learning-paths/servers-and-cloud-computing/milvus-rag/_review.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,23 +12,23 @@ review:
1212
1313
- questions:
1414
question: >
15-
Can Llama3.1 model run on Arm?
15+
Can Meta Llama 3.1 run on Arm?
1616
answers:
1717
- "Yes"
1818
- "No"
1919
correct_answer: 1
2020
explanation: >
21-
The Llama-3.1-8B model from Meta can be used on Arm-based servers with llama.cpp.
21+
You can use the Llama 3.1-8B model from Meta on Arm-based servers with llama.cpp.
2222
2323
- questions:
2424
question: >
25-
Which of the following is true about about Zilliz Cloud?
25+
Which of the following is true about Zilliz Cloud?
2626
answers:
27-
- "It is a fully-managed version of Milvus vector database"
28-
- "It is a self-hosted version of Milvus vector database"
27+
- "It is a fully managed version of Milvus vector database."
28+
- "It is a self-hosted version of Milvus vector database."
2929
correct_answer: 1
3030
explanation: >
31-
Zilliz Cloud is a fully-managed version of Milvus.
31+
Zilliz Cloud is a fully managed version of Milvus.
3232
3333
3434

content/learning-paths/servers-and-cloud-computing/milvus-rag/launch_llm_service.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,23 @@
11
---
2-
title: Launch LLM Server
2+
title: Launch the LLM Server
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
In this section, you will build and run the `llama.cpp` server program using an OpenAI-compatible API on your running AWS Arm-based server instance.
9+
### Llama 3.1 Model and Llama.cpp
1010

11-
### Llama 3.1 model & llama.cpp
11+
In this section, you will build and run the `llama.cpp` server program using an OpenAI-compatible API on your AWS Arm-based server instance.
1212

1313
The [Llama-3.1-8B model](https://huggingface.co/cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf) from Meta belongs to the Llama 3.1 model family and is free to use for research and commercial purposes. Before you use the model, visit the Llama [website](https://llama.meta.com/llama-downloads/) and fill in the form to request access.
1414

15-
[llama.cpp](https://github.com/ggerganov/llama.cpp) is an open source C/C++ project that enables efficient LLM inference on a variety of hardware - both locally, and in the cloud. You can conveniently host a Llama 3.1 model using `llama.cpp`.
15+
[Llama.cpp](https://github.com/ggerganov/llama.cpp) is an open-source C/C++ project that enables efficient LLM inference on a variety of hardware - both locally, and in the cloud. You can conveniently host a Llama 3.1 model using `llama.cpp`.
1616

1717

18-
### Download and build llama.cpp
18+
### Download and build Llama.cpp
1919

20-
Run the following commands to install make, cmake, gcc, g++, and other essential tools required for building llama.cpp from source:
20+
Run the following commands to install make, cmake, gcc, g++, and other essential tools required for building Llama.cpp from source:
2121

2222
```bash
2323
sudo apt install make cmake -y
@@ -27,13 +27,13 @@ sudo apt install build-essential -y
2727

2828
You are now ready to start building `llama.cpp`.
2929

30-
Clone the source repository for llama.cpp:
30+
Clone the source repository for Llama.cpp:
3131

3232
```bash
3333
git clone https://github.com/ggerganov/llama.cpp
3434
```
3535

36-
By default, `llama.cpp` builds for CPU only on Linux and Windows. You don't need to provide any extra switches to build it for the Arm CPU that you run it on.
36+
By default, `llama.cpp` builds for CPU only on Linux and Windows. You do not need to provide any extra switches to build it for the Arm CPU that you run it on.
3737

3838
Run `make` to build it:
3939

@@ -64,23 +64,23 @@ You can now download the model using the huggingface cli:
6464
```bash
6565
huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --local-dir . --local-dir-use-symlinks False
6666
```
67-
The GGUF model format, introduced by the llama.cpp team, uses compression and quantization to reduce weight precision to 4-bit integers, significantly decreasing computational and memory demands and making Arm CPUs effective for LLM inference.
67+
The GGUF model format, introduced by the Llama.cpp team, uses compression and quantization to reduce weight precision to 4-bit integers, significantly decreasing computational and memory demands and making Arm CPUs effective for LLM inference.
6868

6969

70-
### Re-quantize the model weights
70+
### Requantize the model weights
7171

72-
To re-quantize the model, run:
72+
To requantize the model, run:
7373

7474
```bash
7575
./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8
7676
```
7777

78-
This will output a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support.
78+
This outputs a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support.
7979

8080
This requantization is optimal specifically for Graviton3. For Graviton2, the optimal requantization should be performed in the `Q4_0_4_4` format, and for Graviton4, the `Q4_0_4_8` format is the most suitable for requantization.
8181

8282
### Start the LLM Server
83-
You can utilize the `llama.cpp` server program and send requests via an OpenAI-compatible API. This allows you to develop applications that interact with the LLM multiple times without having to repeatedly start and stop it. Additionally, you can access the server from another machine where the LLM is hosted over the network.
83+
You can utilize the `llama.cpp` server program and send requests through an OpenAI-compatible API. This allows you to develop applications that interact with the LLM multiple times without having to repeatedly start and stop it. Additionally, you can access the server from another machine where the LLM is hosted over the network.
8484

8585
Start the server from the command line, and it listens on port 8080:
8686

@@ -91,10 +91,10 @@ Start the server from the command line, and it listens on port 8080:
9191
The output from this command should look like:
9292

9393
```output
94-
'main: server is listening on 127.0.0.1:8080 - starting the main loop
94+
main: server is listening on 127.0.0.1:8080 - starting the main loop
9595
```
9696

97-
You can also adjust the parameters of the launched LLM to adapt it to your server hardware to obtain ideal performance. For more parameter information, see the `llama-server --help` command.
97+
You can also adjust the parameters of the launched LLM to adapt it to your server hardware to achieve an ideal performance. For more parameter information, see the `llama-server --help` command.
9898

9999
You have started the LLM service on your AWS Graviton instance with an Arm-based CPU. In the next section, you will directly interact with the service using the OpenAI SDK.
100100

content/learning-paths/servers-and-cloud-computing/milvus-rag/offline_data_loading.md

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -5,30 +5,31 @@ weight: 3
55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8+
## Create a dedicated cluster
89

9-
In this section, you will learn how to setup a cluster on Zilliz Cloud. You will then learn how to load your private knowledge database into the cluster.
10+
In this section, you will set up a cluster on Zilliz Cloud.
1011

11-
### Create a dedicated cluster
12+
Begin by [registering](https://docs.zilliz.com/docs/register-with-zilliz-cloud) for a free account on Zilliz Cloud.
1213

13-
You will need to [register](https://docs.zilliz.com/docs/register-with-zilliz-cloud) for a free account on Zilliz Cloud.
14+
After you register, [create a cluster](https://docs.zilliz.com/docs/create-cluster).
1415

15-
After you register, [create a cluster](https://docs.zilliz.com/docs/create-cluster) on Zilliz Cloud. In this Learning Path, you will create a dedicated cluster deployed in AWS using Arm-based machines to store and retreive the vector data as shown:
16+
Now create a **Dedicated** cluster deployed in AWS using Arm-based machines to store and retrieve the vector data as shown:
1617

1718
![cluster](create_cluster.png)
1819

19-
When you select the `Create Cluster` Button, you should see the cluster running in your Default Project.
20+
When you select the **Create Cluster** Button, you should see the cluster running in your **Default Project**.
2021

2122
![running](running_cluster.png)
2223

2324
{{% notice Note %}}
24-
You can use self-hosted Milvus as an alternative to Zilliz Cloud. This option is more complicated to set up. We can also deploy [Milvus Standalone](https://milvus.io/docs/install_standalone-docker-compose.md) and [Kubernetes](https://milvus.io/docs/install_cluster-milvusoperator.md) on Arm-based machines. For more information about Milvus installation, please refer to the [installation documentation](https://milvus.io/docs/install-overview.md).
25+
You can use self-hosted Milvus as an alternative to Zilliz Cloud. This option is more complicated to set up. You can also deploy [Milvus Standalone](https://milvus.io/docs/install_standalone-docker-compose.md) and [Kubernetes](https://milvus.io/docs/install_cluster-milvusoperator.md) on Arm-based machines. For more information about installing Milvus, see the [Milvus installation documentation](https://milvus.io/docs/install-overview.md).
2526
{{% /notice %}}
2627

27-
### Create the Collection
28+
## Create the Collection
2829

29-
With the dedicated cluster running in Zilliz Cloud, you are now ready to create a collection in your cluster.
30+
With the Dedicated cluster running in Zilliz Cloud, you are now ready to create a collection in your cluster.
3031

31-
Within your activated python `venv`, start by creating a file named `zilliz-llm-rag.py` and copy the contents below into it:
32+
Within your activated Python virtual environment `venv`, start by creating a file named `zilliz-llm-rag.py`, and copy the contents below into it:
3233

3334
```python
3435
from pymilvus import MilvusClient
@@ -38,7 +39,7 @@ milvus_client = MilvusClient(
3839
)
3940

4041
```
41-
Replace <your_zilliz_public_endpoint> and <your zilliz_api_key> with the `URI` and `Token` for your running cluster. Refer to [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud for more details.
42+
Replace *<your_zilliz_public_endpoint>* and *<your zilliz_api_key>* with the `URI` and `Token` for your running cluster. Refer to [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud for further information.
4243

4344
Now, append the following code to `zilliz-llm-rag.py` and save the contents:
4445

@@ -56,16 +57,16 @@ milvus_client.create_collection(
5657
consistency_level="Strong", # Strong consistency level
5758
)
5859
```
59-
This code checks if a collection already exists and drops it if it does. You then, create a new collection with the specified parameters.
60+
This code checks if a collection already exists and drops it if it does. If this happens, you can create a new collection with the specified parameters.
6061

61-
If you don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.
62-
You will use inner product distance as the default metric type. For more information about distance types, you can refer to [Similarity Metrics page](https://milvus.io/docs/metric.md?tab=floating)
62+
If you do not specify any field information, Milvus automatically creates a default `id` field for the primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema defined fields and their values.
63+
You can use inner product distance as the default metric type. For more information about distance types, you can refer to [Similarity Metrics page](https://milvus.io/docs/metric.md?tab=floating).
6364

6465
You can now prepare the data to use in this collection.
6566

66-
### Prepare the data
67+
## Prepare the data
6768

68-
In this example, you will use the FAQ pages from the [Milvus Documentation 2.4.x](https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip) as the private knowledge that is loaded in your RAG dataset/collection.
69+
In this example, you will use the FAQ pages from the [Milvus Documentation 2.4.x](https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip) as the private knowledge that is loaded in your RAG dataset.
6970

7071
Download the zip file and extract documents to the folder `milvus_docs`.
7172

@@ -74,7 +75,7 @@ wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/m
7475
unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs
7576
```
7677

77-
You will load all the markdown files from the folder `milvus_docs/en/faq` into your data collection. For each document, use "# " to separate the content in the file, which can roughly separate the content of each main part of the markdown file.
78+
Now load all the markdown files from the folder `milvus_docs/en/faq` into your data collection. For each document, use "# " to separate the content in the file. This divides the content of each main part of the markdown file.
7879

7980
Open `zilliz-llm-rag.py` and append the following code to it:
8081

@@ -91,9 +92,9 @@ for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
9192
```
9293

9394
### Insert data
94-
You will now prepare a simple but efficient embedding model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) that can convert the loaded text into embedding vectors.
95+
Now you can prepare a simple but efficient embedding model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) that can convert the loaded text into embedding vectors.
9596

96-
You will iterate through the text lines, create embeddings, and then insert the data into Milvus.
97+
You can iterate through the text lines, create embeddings, and then insert the data into Milvus.
9798

9899
Append and save the code shown below into `zilliz-llm-rag.py`:
99100

@@ -115,10 +116,10 @@ for i, (line, embedding) in enumerate(
115116

116117
milvus_client.insert(collection_name=collection_name, data=data)
117118
```
118-
Run the python script, to check that you have successfully created the embeddings on the data you loaded into the RAG collection:
119+
Run the Python script, to check that you have successfully created the embeddings on the data you loaded into the RAG collection:
119120

120121
```bash
121-
python3 python3 zilliz-llm-rag.py
122+
python3 zilliz-llm-rag.py
122123
```
123124

124125
The output should look like:

content/learning-paths/servers-and-cloud-computing/milvus-rag/online_rag.md

Lines changed: 10 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,11 @@ weight: 5
55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8+
## Prepare the Embedding Model
89

9-
In this section, you will build the online RAG part of your application.
10+
In your Python script, generate a test embedding and print its dimension and the first few elements.
1011

11-
### Prepare the embedding model
12-
13-
In your python script, generate a test embedding and print its dimension and first few elements.
14-
15-
For the LLM, you will use the OpenAI SDK to request the Llama service launched before. You don't need to use any API key because it is running locally on your machine.
12+
For the LLM, you will use the OpenAI SDK to request the Llama service that you launched previously. You do not need to use an API key because it is running locally on your machine.
1613

1714
Append the code below to `zilliz-llm-rag.py`:
1815

@@ -31,7 +28,7 @@ Run the script. The output should look like:
3128

3229
### Retrieve data for a query
3330

34-
You will specify a frequent question about Milvus and then search for the question in the collection and retrieve the semantic top-3 matches.
31+
Now specify a common question about Milvus, and search for the question in the collection, in order to retrieve the top 3 semantic matches.
3532

3633
Append the code shown below to `zilliz-llm-rag.py`:
3734

@@ -55,7 +52,7 @@ retrieved_lines_with_distances = [
5552
]
5653
print(json.dumps(retrieved_lines_with_distances, indent=4))
5754
```
58-
Run the script again and the output with the top 3 matches will look like:
55+
Run the script again, and the output with the top 3 matches should look like:
5956

6057
```output
6158
[
@@ -68,18 +65,18 @@ Run the script again and the output with the top 3 matches will look like:
6865
0.5974207520484924
6966
],
7067
[
71-
"What is the maximum dataset size Milvus can handle?\n\n \nTheoretically, the maximum dataset size Milvus can handle is determined by the hardware it is run on, specifically system memory and storage:\n\n- Milvus loads all specified collections and partitions into memory before running queries. Therefore, memory size determines the maximum amount of data Milvus can query.\n- When new entities and and collection-related schema (currently only MinIO is supported for data persistence) are added to Milvus, system storage determines the maximum allowable size of inserted data.\n\n###",
68+
"What is the maximum dataset size Milvus can handle?\n\n \nTheoretically, the maximum dataset size Milvus can handle is determined by the hardware it is run on, specifically system memory and storage:\n\n- Milvus loads all specified collections and partitions into memory before running queries. Therefore, memory size determines the maximum amount of data Milvus can query.\n- When new entities and collection-related schema (currently only MinIO is supported for data persistence) are added to Milvus, system storage determines the maximum allowable size of inserted data.\n\n###",
7269
0.5833579301834106
7370
]
7471
]
7572
```
76-
### Use LLM to get a RAG response
73+
### Use the LLM to obtain a RAG response
7774

7875
You are now ready to use the LLM and obtain a RAG response.
7976

80-
For the LLM, you will use the OpenAI SDK to request the Llama service you launched in the previous section. You don't need to use any API key because it is running locally on your machine.
77+
For the LLM, you will use the OpenAI SDK to request the Llama service you launched in the previous section. You do not need to use an API key because it is running locally on your machine.
8178

82-
You will then convert the retrieved documents into a string format. Define system and user prompts for the Language Model. This prompt is assembled with the retrieved documents from Milvus. Finally use the LLM to generate a response based on the prompts.
79+
You will then convert the retrieved documents into a string format. Define system and user prompts for the Language Model. This prompt is assembled with the retrieved documents from Milvus. Finally, use the LLM to generate a response based on the prompts.
8380

8481
Append the code below into `zilliz-llm-rag.py`:
8582

@@ -117,7 +114,7 @@ print(response.choices[0].message.content)
117114
```
118115

119116
{{% notice Note %}}
120-
Make sure your llama.cpp server from the previous section is running before you proceed
117+
Make sure your llama.cpp server from the previous section is running before you proceed.
121118
{{% /notice %}}
122119

123120
Run the script one final time with these changes using `python3 zilliz-llm-rag.py`. The output should look like:

0 commit comments

Comments
 (0)