Skip to content

Commit bae5968

Browse files
authored
Merge pull request #1303 from zc277584121/main
milvus RAG
2 parents be75d9c + 7080d48 commit bae5968

File tree

7 files changed

+453
-0
lines changed

7 files changed

+453
-0
lines changed
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
title: Use Milvus/Zilliz to build RAG on Arm Architecture
3+
4+
minutes_to_complete: 20
5+
6+
who_is_this_for: This is an introductory topic for engineers who want to create a RAG application on Arm machines.
7+
8+
learning_objectives:
9+
- Create a simple RAG application using Milvus/Zilliz
10+
- Launch LLM service on Arm machines
11+
12+
prerequisites:
13+
- Basic understand of RAG pipeline.
14+
- An [AWS account](/learning-paths/servers-and-cloud-computing/csp/aws/) to access instance types with different AWS Graviton processors.
15+
- A [Zilliz account](https://zilliz.com/cloud), which you can sign up for a free trial.
16+
17+
author_primary: Chen Zhang
18+
19+
### Tags
20+
skilllevels: Introductory
21+
subjects: ML
22+
armips:
23+
- Cortex-A
24+
tools_software_languages:
25+
- Python
26+
operatingsystems:
27+
- Linux
28+
29+
30+
### FIXED, DO NOT MODIFY
31+
# ================================================================================
32+
weight: 1 # _index.md always has weight of 1 to order correctly
33+
layout: "learningpathall" # All files under learning paths have this same wrapper
34+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
35+
---
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
next_step_guidance: Thank you for completing the Milvus RAG tutorial.
3+
4+
recommended_path: /learning-paths/servers-and-cloud-computing/llama-cpu/
5+
6+
further_reading:
7+
- resource:
8+
title: Zilliz Documentation
9+
link: https://zilliz.com/cloud
10+
type: documentation
11+
- resource:
12+
title: Milvus Documentation
13+
link: https://milvus.io/
14+
type: documentation
15+
- resource:
16+
title: llama.cpp repository
17+
link: https://github.com/ggerganov/llama.cpp
18+
type: website
19+
20+
21+
# ================================================================================
22+
# FIXED, DO NOT MODIFY
23+
# ================================================================================
24+
weight: 21 # set to always be larger than the content in this path, and one more than 'review'
25+
title: "Next Steps" # Always the same
26+
layout: "learningpathall" # All files under learning paths have this same wrapper
27+
---
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
review:
3+
- questions:
4+
question: >
5+
Can Milvus run on Arm systems?
6+
answers:
7+
- "Yes"
8+
- "No"
9+
correct_answer: 1
10+
explanation: >
11+
Milvus can run on Arm-based systems. Milvus supports deployment on Arm-based machines, whether it's through Zilliz Cloud, Docker, or Kubernetes.
12+
13+
- questions:
14+
question: >
15+
Can Llama3.1 model run on Arm systems?
16+
answers:
17+
- "Yes"
18+
- "No"
19+
correct_answer: 1
20+
explanation: >
21+
The Llama-3.1-8B model from Meta can be used on an AWS Arm-based server CPU with the llama.cpp tool.
22+
23+
24+
25+
# ================================================================================
26+
# FIXED, DO NOT MODIFY
27+
# ================================================================================
28+
title: "Review" # Always the same title
29+
weight: 20 # Set to always be larger than the content in this path
30+
layout: "learningpathall" # All files under learning paths have this same wrapper
31+
---
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
title: Launch LLM Service on Arm
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
In this section, we will build and launch the `llama.cpp` service on the Arm-based CPU.
10+
11+
### Llama 3.1 model & llama.cpp
12+
13+
The [Llama-3.1-8B model](https://huggingface.co/cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf) from Meta belongs to the Llama 3.1 model family and is free to use for research and commercial purposes. Before you use the model, visit the Llama [website](https://llama.meta.com/llama-downloads/) and fill in the form to request access.
14+
15+
[llama.cpp](https://github.com/ggerganov/llama.cpp) is an open source C/C++ project that enables efficient LLM inference on a variety of hardware - both locally, and in the cloud. You can conveniently host a Llama 3.1 model using `llama.cpp`.
16+
17+
18+
### Download and build llama.cpp
19+
20+
Run the following commands to install make, cmake, gcc, g++, and other essential tools required for building llama.cpp from source:
21+
22+
```bash
23+
sudo apt install make cmake -y
24+
sudo apt install gcc g++ -y
25+
sudo apt install build-essential -y
26+
```
27+
28+
You are now ready to start building `llama.cpp`.
29+
30+
Clone the source repository for llama.cpp:
31+
32+
```bash
33+
git clone https://github.com/ggerganov/llama.cpp
34+
```
35+
36+
By default, `llama.cpp` builds for CPU only on Linux and Windows. You don't need to provide any extra switches to build it for the Arm CPU that you run it on.
37+
38+
Run `make` to build it:
39+
40+
```bash
41+
cd llama.cpp
42+
make GGML_NO_LLAMAFILE=1 -j$(nproc)
43+
```
44+
45+
Check that `llama.cpp` has built correctly by running the help command:
46+
47+
```bash
48+
./llama-cli -h
49+
```
50+
51+
If `llama.cpp` has been built correctly, you will see the help option displayed. The output snippet looks like this:
52+
53+
```output
54+
example usage:
55+
56+
text generation: ./llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
57+
58+
chat (conversation): ./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
59+
```
60+
61+
62+
You can now download the model using the huggingface cli:
63+
64+
```bash
65+
huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --local-dir . --local-dir-use-symlinks False
66+
```
67+
The GGUF model format, introduced by the llama.cpp team, uses compression and quantization to reduce weight precision to 4-bit integers, significantly decreasing computational and memory demands and making Arm CPUs effective for LLM inference.
68+
69+
70+
### Re-quantize the model weights
71+
72+
To re-quantize, run
73+
74+
```bash
75+
./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8
76+
```
77+
78+
This will output a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support.
79+
80+
> This requantization is optimal specifically for Graviton3. For Graviton2, the optimal requantization should be performed in the `Q4_0_4_4` format, and for Graviton4, the `Q4_0_4_8` format is the most suitable for requantization.
81+
82+
### Start the LLM Service
83+
You can utilize the llama.cpp server program and send requests via an OpenAI-compatible API. This allows you to develop applications that interact with the LLM multiple times without having to repeatedly start and stop it. Additionally, you can access the server from another machine where the LLM is hosted over the network.
84+
85+
Start the server from the command line, and it listens on port 8080:
86+
87+
```shell
88+
./llama-server -m dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf -n 2048 -t 64 -c 65536 --port 8080
89+
```
90+
```text
91+
'main: server is listening on 127.0.0.1:8080 - starting the main loop
92+
```
93+
94+
You can also adjust the parameters of the launched LLM to adapt it to your server hardware to obtain ideal performance. For more parameter information, see the `llama-server --help` command.
95+
96+
If you struggle to perform this step, you can refer to the [this documents](https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama-cpu/llama-chatbot/) for more information.
97+
98+
You have started the LLM service on your Arm-based CPU. Next, we directly interact with the service using the OpenAI SDK.
99+
100+
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
---
2+
title: Offline Data Loading
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
In this section, we will show you how to load private knowledge in our RAG.
10+
11+
### Create the Collection
12+
We use [Zilliz Cloud](https://zilliz.com/cloud) deployed on AWS with Arm-based machines to store and retrieve the vector data. To quick start, simply [register an account](https://docs.zilliz.com/docs/register-with-zilliz-cloud) on Zilliz Cloud for free.
13+
14+
> In addition to Zilliz Cloud, self-hosted Milvus is also a (more complicated to set up) option. We can also deploy [Milvus Standalone](https://milvus.io/docs/install_standalone-docker-compose.md) and [Kubernetes](https://milvus.io/docs/install_cluster-milvusoperator.md) on ARM-based machines. For more information about Milvus installation, please refer to the [installation documentation](https://milvus.io/docs/install-overview.md).
15+
16+
We set the `uri` and `token` as the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud.
17+
```python
18+
from pymilvus import MilvusClient
19+
20+
milvus_client = MilvusClient(
21+
uri="<your_zilliz_public_endpoint>", token="<your_zilliz_api_key>"
22+
)
23+
24+
collection_name = "my_rag_collection"
25+
26+
```
27+
Check if the collection already exists and drop it if it does.
28+
```python
29+
if milvus_client.has_collection(collection_name):
30+
milvus_client.drop_collection(collection_name)
31+
```
32+
Create a new collection with specified parameters.
33+
34+
If we don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.
35+
```python
36+
milvus_client.create_collection(
37+
collection_name=collection_name,
38+
dimension=embedding_dim,
39+
metric_type="IP", # Inner product distance
40+
consistency_level="Strong", # Strong consistency level
41+
)
42+
```
43+
We use inner product distance as the default metric type. For more information about distance types, you can refer to [Similarity Metrics page](https://milvus.io/docs/metric.md?tab=floating)
44+
45+
### Prepare the data
46+
47+
We use the FAQ pages from the [Milvus Documentation 2.4.x](https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip) as the private knowledge in our RAG, which is a good data source for a simple RAG pipeline.
48+
49+
Download the zip file and extract documents to the folder `milvus_docs`.
50+
51+
```shell
52+
wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip
53+
unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs
54+
```
55+
56+
We load all markdown files from the folder `milvus_docs/en/faq`. For each document, we just simply use "# " to separate the content in the file, which can roughly separate the content of each main part of the markdown file.
57+
58+
59+
```python
60+
from glob import glob
61+
62+
text_lines = []
63+
64+
for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
65+
with open(file_path, "r") as file:
66+
file_text = file.read()
67+
68+
text_lines += file_text.split("# ")
69+
```
70+
71+
### Insert data
72+
We prepare a simple but efficient embedding model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) that can convert text into embedding vectors.
73+
```python
74+
from langchain_huggingface import HuggingFaceEmbeddings
75+
76+
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
77+
```
78+
79+
Iterate through the text lines, create embeddings, and then insert the data into Milvus.
80+
81+
Here is a new field `text`, which is a non-defined field in the collection schema. It will be automatically added to the reserved JSON dynamic field, which can be treated as a normal field at a high level.
82+
```python
83+
from tqdm import tqdm
84+
85+
data = []
86+
87+
text_embeddings = embedding_model.embed_documents(text_lines)
88+
89+
for i, (line, embedding) in enumerate(
90+
tqdm(zip(text_lines, text_embeddings), desc="Creating embeddings")
91+
):
92+
data.append({"id": i, "vector": embedding, "text": line})
93+
94+
milvus_client.insert(collection_name=collection_name, data=data)
95+
```
96+
```text
97+
Creating embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 72/72 [00:18<00:00, 3.91it/s]
98+
```

0 commit comments

Comments
 (0)