Skip to content

Commit 93bd421

Browse files
Add transform benchmark (#254)
* Add transform benchmark * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 9ae9839 commit 93bd421

File tree

8 files changed

+390
-4
lines changed

8 files changed

+390
-4
lines changed

.github/workflows/continuous-benchmark-hnsw.yaml

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,28 @@ jobs:
2626
with:
2727
hcloud_token: ${{ secrets.HCLOUD_TOKEN }}
2828
db_host: ${{ secrets.POSTGRES_HOST }}
29-
server_name: "benchmark-server-3"
29+
server_name: "benchmark-server-1"
3030
- name: Run bench
3131
id: hnsw-indexing-update
3232
run: |
3333
cd ansible/playbooks && ansible-playbook playbook-hnsw-index.yml --extra-vars "bench=update"
34+
35+
runTransformHealingBenchmark:
36+
runs-on: ubuntu-latest
37+
container: alpine/ansible:2.18.1
38+
needs: runUpdateHealingBenchmark
39+
steps:
40+
- uses: actions/checkout@v3
41+
- uses: webfactory/[email protected]
42+
with:
43+
ssh-private-key: ${{ secrets.SSH_PRIVATE_KEY }}
44+
- name: Create inventory
45+
uses: ./.github/workflows/actions/create-inventory
46+
with:
47+
hcloud_token: ${{ secrets.HCLOUD_TOKEN }}
48+
db_host: ${{ secrets.POSTGRES_HOST }}
49+
server_name: "benchmark-server-1"
50+
- name: Run bench
51+
id: hnsw-indexing-transform
52+
run: |
53+
cd ansible/playbooks && ansible-playbook playbook-hnsw-index.yml --extra-vars "bench=transform"

ansible/README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,4 +45,11 @@ benchmark-db ansible_host=${YOUR_DB_SERVER_IP} ansible_user=${YOUR_DB_SERVER_USE
4545
Then from [ansible/playbooks](playbooks) run:
4646
```bash
4747
ansible-playbook playbook-hnsw-index.yml --extra-vars "bench=update"
48-
```
48+
```
49+
50+
## How to add a new benchmark
51+
52+
* Create a new playbook in the [ansible/playbooks](playbooks) directory (i.e `playbook-hnsw-index.yml`). The playbook defines which role to run on which machine (i.e run `run-hnsw-indexing-update` on machines of `remote_machines` group).
53+
* Add a new folder in [ansible/playbooks/roles](playbooks/roles) (i.e `run-hnsw-indexing-update`) with 2 sub-folders `tasks` (required) and `files` (optional). Add `main.yml` in `tasks` folder. The role defines tasks (`main.yml`) required to run the benchmark. For example, copying scripts, setting up benchmark server, running the benchmark.
54+
* Optionally in the [ansible/playbooks/group_vars](playbooks/group_vars) directory add a new yml file to define variables specific for the role (i.e `hnsw-indexing-update.yml`). Variables that are shared can also be defined here (i.e in `common_vars.yml`).
55+
* Optionally in the [ansible/playbooks/files](playbooks/files) directory add files that are common across several roles and/or playbooks.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
qdrant_python_client_version: "1.14.0"
2+
logging_dir: "/tmp/logs"
3+
working_dir: "/tmp/experiments"
4+
dataset_url: "https://storage.googleapis.com/ann-filtered-benchmark/datasets/laion-small-clip-no-filters-1.tgz"
5+
dataset_name: "laion-small-clip-no-filters-1"
6+
dataset_dim: "512"
7+
dataset_2_url: "https://storage.googleapis.com/ann-filtered-benchmark/datasets/laion-small-clip-no-filters-2.tgz"
8+
dataset_2_name: "laion-small-clip-no-filters-2"
9+
servers:
10+
- name: "qdrant"
11+
registry: "ghcr.io"
12+
image: "qdrant/qdrant"
13+
version: "dev"
14+
feature_flags: "true"
15+
- name: "qdrant"
16+
registry: "docker.io"
17+
image: "qdrant/qdrant"
18+
version: "master"
19+
feature_flags: "false"

ansible/playbooks/roles/run-hnsw-indexing-common/tasks/main.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
- { src: "{{ bench }}.py", dest: "{{ working_dir }}/{{ bench }}.py" }
1111
- { src: "../../files/hnsw-indexing/requirements.txt", dest: "{{ working_dir }}/requirements.txt" }
1212
- { src: "../../files/hnsw-indexing/docker-compose.yaml", dest: "{{ working_dir }}/docker-compose.yaml" }
13-
- { src: get_score.py, dest: "{{ working_dir }}/get_score.py" }
13+
- { src: "../../files/hnsw-indexing/get_score.py", dest: "{{ working_dir }}/get_score.py" }
1414

1515
- name: Start Docker container on the remote machine
1616
ansible.builtin.shell: |
@@ -31,7 +31,7 @@
3131
- name: "Execute the script on the remote machine: {{ server_name }}-{{ server_version }}"
3232
ansible.builtin.shell: |
3333
{{ working_dir }}/run-bench.sh > "{{ working_dir }}/log-{{ server_name }}-{{ server_version }}-{{ bench }}.log" 2>&1
34-
async: 3600 # 60 minutes
34+
async: 7200 # 120 minutes
3535
poll: 30 # Check every 30 seconds
3636
environment:
3737
OUTPUT_FILENAME: "{{ working_dir }}/output-{{ server_name }}-{{ server_version }}-{{ bench }}.json"
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
#!/bin/bash
2+
# This script is used to set up a virtual environment, install dependencies, and run a Python script.
3+
4+
set -euo pipefail
5+
6+
if [ -z "${DATASET_DIM:-}" ]; then
7+
echo "Error: DATASET_DIM is not set"
8+
exit 1
9+
fi
10+
11+
if [ -z "${BENCH:-}" ]; then
12+
echo "Error: BENCH is not set"
13+
exit 2
14+
fi
15+
16+
if [ -z "${DATASET_NAME:-}" ]; then
17+
echo "Error: DATASET_NAME is not set"
18+
exit 3
19+
fi
20+
21+
if [ -z "${DATASET_NAME_2:-}" ]; then
22+
echo "Error: DATASET_NAME_2 is not set"
23+
exit 4
24+
fi
25+
26+
if [ -z "${OUTPUT_FILENAME:-}" ]; then
27+
echo "Error: OUTPUT_FILENAME is not set"
28+
exit 5
29+
fi
30+
31+
if [ -z "${WORK_DIR:-}" ]; then
32+
WORK_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
33+
echo "Warn: WORK_DIR is not set, defaults to script's ${WORK_DIR}"
34+
fi
35+
36+
cd "${WORK_DIR}"
37+
38+
# Check if venv exists
39+
if [ ! -d "${WORK_DIR}/venv" ]; then
40+
echo "Creating virtual environment..."
41+
python3 -m venv "${WORK_DIR}/venv"
42+
source "${WORK_DIR}/venv/bin/activate"
43+
44+
echo "Installing requirements..."
45+
pip install -r "${WORK_DIR}/requirements.txt"
46+
47+
deactivate
48+
else
49+
echo "Virtual environment already exists. Skipping setup."
50+
fi
51+
52+
echo "Activating virtual environment..."
53+
source "${WORK_DIR}/venv/bin/activate"
54+
55+
NOW=$(date "+%Y-%m-%dT%H:%M:%SZ")
56+
echo "${NOW}"
57+
echo "Running..."
58+
python -u "${WORK_DIR}/${BENCH}.py"
59+
echo "Python script completed with exit code: $?"
60+
deactivate
61+
62+
exit 0
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
"""
2+
Test Qdrant's accuracy in scenarios of continuous updates of real data.
3+
4+
5+
This script will:
6+
7+
- Create a Qdrant collection, and make initial upload of all available vectors from `data/dataset1`
8+
- Measure the accuracy of the search
9+
- Start replacing vectors of collection by removing points and replacing them with new ones from `data/dataset2`
10+
- Once finished, measure the accuracy of the search
11+
12+
"""
13+
14+
import json
15+
import os
16+
import random
17+
import sys
18+
import time
19+
from datetime import datetime
20+
from pathlib import Path
21+
22+
import numpy as np
23+
import tqdm
24+
from qdrant_client import QdrantClient, models
25+
26+
QDRANT_COLLECTION_NAME = "benchmark"
27+
OUTPUT_FILENAME = os.getenv("OUTPUT_FILENAME", "output.json")
28+
DATASET_DIM = int(os.getenv("DATASET_DIM", 512))
29+
DATASET_NAME = os.getenv("DATASET_NAME", "laion-small-clip-no-filters-1")
30+
DATASET_NAME_2 = os.getenv("DATASET_NAME_2", "laion-small-clip-no-filters-2")
31+
DATA_DIR = Path(__file__).parent / "data" / DATASET_NAME
32+
DATA_DIR_2 = Path(__file__).parent / "data" / DATASET_NAME_2
33+
34+
VECTORS_FILE_2 = DATA_DIR_2 / "vectors.npy"
35+
VECTORS_FILE_1 = DATA_DIR / "vectors.npy"
36+
37+
TEST_DATA_FILE_2 = DATA_DIR_2 / "tests.jsonl"
38+
TEST_DATA_FILE_1 = DATA_DIR / "tests.jsonl"
39+
40+
TOTAL_VECTORS = 100_000
41+
BATCH_SIZE = 500
42+
43+
44+
def read_test_data(file: Path, limit: int = 1000):
45+
"""
46+
{
47+
"query": [
48+
0.022043373435735703,
49+
-0.022230295464396477,
50+
....
51+
],
52+
"closest_ids": [
53+
43749,
54+
43756,
55+
....
56+
]
57+
}
58+
"""
59+
with open(file, "r") as f:
60+
for idx, line in enumerate(f):
61+
if idx >= limit:
62+
break
63+
64+
yield json.loads(line)
65+
66+
67+
class QdrantBenchmark:
68+
69+
def __init__(self, url):
70+
71+
client = QdrantClient(url=url, prefer_grpc=True)
72+
self.client = client
73+
74+
self.client.delete_collection(QDRANT_COLLECTION_NAME)
75+
76+
self.collection = self.client.create_collection(
77+
QDRANT_COLLECTION_NAME,
78+
vectors_config=models.VectorParams(
79+
size=DATASET_DIM,
80+
distance=models.Distance.COSINE,
81+
),
82+
optimizers_config=models.OptimizersConfigDiff(
83+
deleted_threshold=0.001,
84+
vacuum_min_vector_number=100,
85+
),
86+
)
87+
88+
def initial_upload(self, vectors: np.ndarray):
89+
self.client.upload_collection(
90+
collection_name=QDRANT_COLLECTION_NAME,
91+
vectors=vectors,
92+
ids=range(len(vectors)),
93+
)
94+
95+
def upload_points(self, vectors: np.ndarray, ids: list[int]):
96+
points = [
97+
models.PointStruct(id=idx, vector=vectors[idx].tolist()) for idx in ids
98+
]
99+
100+
self.client.upsert(
101+
collection_name=QDRANT_COLLECTION_NAME,
102+
points=points,
103+
)
104+
105+
def validate_test_data(self, file: Path) -> float:
106+
total_results = 0
107+
matched_results = 0
108+
for test in tqdm.tqdm(read_test_data(file), desc="Validating test data"):
109+
query = test["query"]
110+
closest_ids = set(test["closest_ids"])
111+
112+
results = self.client.query_points(
113+
collection_name=QDRANT_COLLECTION_NAME,
114+
query=query,
115+
limit=len(closest_ids),
116+
)
117+
118+
results_idx = set(obj.id for obj in results.points)
119+
120+
matched_results += len(closest_ids & results_idx)
121+
total_results += len(closest_ids)
122+
123+
return matched_results / total_results
124+
125+
def delete_points(self, points_to_delete: set):
126+
self.client.delete(
127+
collection_name=QDRANT_COLLECTION_NAME,
128+
points_selector=models.PointIdsList(
129+
points=[idx for idx in points_to_delete]
130+
),
131+
)
132+
133+
def wait_ready(self) -> float:
134+
wait_interval = 0.2
135+
confirmations_required = 2
136+
137+
start_time = time.time()
138+
confirmations = 0
139+
first_green_time: float | None = None
140+
141+
while True:
142+
collection_info = self.client.get_collection(QDRANT_COLLECTION_NAME)
143+
if collection_info.status == models.CollectionStatus.GREEN:
144+
confirmations += 1
145+
first_green_time = first_green_time or time.time()
146+
if confirmations == confirmations_required:
147+
return first_green_time - start_time
148+
else:
149+
confirmations = 0
150+
first_green_time = None
151+
time.sleep(wait_interval)
152+
153+
def __del__(self):
154+
self.client.close()
155+
156+
157+
def store_to_file(data_dict):
158+
timestamped_dict = data_dict.copy()
159+
timestamped_dict["timestamp"] = datetime.now().isoformat()
160+
161+
with open(OUTPUT_FILENAME, "w", encoding="utf-8") as f:
162+
json.dump(timestamped_dict, f, ensure_ascii=False)
163+
164+
165+
def main():
166+
result = {}
167+
vectors_1 = np.load(VECTORS_FILE_1)
168+
vectors_2 = np.load(VECTORS_FILE_2)
169+
170+
benchmark = QdrantBenchmark("http://localhost:6333")
171+
benchmark.initial_upload(vectors_1)
172+
benchmark.wait_ready()
173+
174+
initial_precision = benchmark.validate_test_data(TEST_DATA_FILE_1)
175+
print("Precision dataset1: ", initial_precision)
176+
result["initial_precision"] = initial_precision
177+
result["precision_before_iteration"] = initial_precision
178+
179+
points_to_migrate = list(range(TOTAL_VECTORS))
180+
181+
random.shuffle(points_to_migrate)
182+
183+
total_indexing_time = 0
184+
for i in tqdm.tqdm(range(0, len(points_to_migrate), BATCH_SIZE), desc="Iterating"):
185+
batch = points_to_migrate[i : i + BATCH_SIZE]
186+
187+
benchmark.delete_points(set(batch))
188+
189+
benchmark.upload_points(vectors_2, batch)
190+
191+
total_indexing_time += benchmark.wait_ready()
192+
193+
print(f"Indexing: {total_indexing_time}")
194+
result["indexing_total_time_s"] = total_indexing_time
195+
196+
precision_after_iteration = benchmark.validate_test_data(TEST_DATA_FILE_2)
197+
print(f"Precision dataset2: {precision_after_iteration}")
198+
result["precision_after_iteration"] = precision_after_iteration
199+
200+
store_to_file(result)
201+
202+
203+
if __name__ == "__main__":
204+
sys.stdout.reconfigure(line_buffering=True)
205+
sys.stderr.reconfigure(line_buffering=True)
206+
207+
main()
208+
209+
sys.stdout.flush()
210+
sys.stderr.flush()

0 commit comments

Comments
 (0)