Skip to content

Commit 23c8197

Browse files
authored
Divide retrieval module to 'hybrid', 'semantic' and 'lexical' (#1135)
* make run_util.py for reproduce code easily * new semantic_retrieval node and complete the feature of it (now semantic retrieval returns _semantic suffix like "retrieved_contents_semantic") * add lexical retrieval * refactor hybrid cc and hybrid rrf to hybridretrieval node * test non overlap * working hybrid cc * working hybrid rrf * cast retrieved_contents and etc infos * finally test_evaluator simple.yaml passed! * change test yaml files to the modified version * change sample config yaml files to the modified version * change documentations to the latest version * delete 'retrieval' at support.py * Fix query expansion with new retrieval * fix several errors * make again retrieval/base.py and run_util.py * change path for evaluate_retrieval_node * replace resources result_project to latest version * re-add test_retrieval_base.py for test files * do not track tests/resources/result_project/resources/chroma * re-add test_hybrid_base.py * fix test codes * remove 'frequent error' test code in g_eval * fix test codes and errors * re-add pseudo_project_dir again * fix hybrid retrieval errors * fix test_restart_evaluate_leads_start_evaluate code * re-fix test_restart_evaluate_leads_start_evaluate * track lfs file "resources/chroma.sqlite3" * just skip test_restart_evaluate_leads_start_evaluate in github actions * resolve import issue on github issue * Edit readme * delete resources/chroma again
1 parent b97f680 commit 23c8197

File tree

204 files changed

+4274
-3310
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

204 files changed

+4274
-3310
lines changed

README.md

Lines changed: 21 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,10 @@ RAG AutoML tool for automatically finding an optimal RAG pipeline for your data.
44

55
![Thumbnail](https://github.com/user-attachments/assets/6bab243d-a4b3-431a-8ac0-fe17336ab4de)
66

7-
![Discord](https://img.shields.io/discord/1204010535272587264) ![PyPI - Downloads](https://img.shields.io/pypi/dm/AutoRAG)
7+
![PyPI - Downloads](https://img.shields.io/pypi/dm/AutoRAG)
88
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-blue?style=flat-square&logo=linkedin)](https://www.linkedin.com/company/104375108/admin/dashboard/)
99
![X (formerly Twitter) Follow](https://img.shields.io/twitter/follow/AutoRAG_HQ)
1010
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Follow-orange?style=flat-square&logo=huggingface)](https://huggingface.co/AutoRAG)
11-
[![Static Badge](https://img.shields.io/badge/Roadmap-5D3FD3)](https://github.com/orgs/Auto-RAG/projects/1/views/2)
1211

1312
<img src=https://github.com/user-attachments/assets/9a4d0381-a161-457f-a787-e7eb3593ce00 width="251.5" height="55.2"/>
1413

@@ -26,26 +25,10 @@ Try now and find the best RAG pipeline for your own use-case.
2625

2726
Explore our 📖 [Document](https://marker-inc-korea.github.io/AutoRAG/)!!
2827

29-
---
30-
31-
## AutoRAG GUI (beta)
32-
33-
AutoRAG GUI is a web-based GUI for AutoRAG.
34-
If AutoRAG is a little bit complicated to you, try AutoRAG GUI.
35-
36-
Your Optimized RAG pipeline is just a few clicks away.
37-
38-
| Project Management | Easy Configuration | Parsed Page View |
39-
|:-----------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------:|
40-
| ![Image](https://github.com/user-attachments/assets/87289d84-ff65-4810-bc41-3f30b36b7ddf) | ![Image](https://github.com/user-attachments/assets/dbe0a49b-ebf2-4c9c-b17d-1be1c2cd1060) | ![Image](https://github.com/user-attachments/assets/d8a50512-3299-4b68-b48e-e2f49d688f01) |
41-
42-
Click the docs to use the AutoRAG GUI beta version! [AutoRAG GUI Docs](https://marker-inc-korea.github.io/AutoRAG/gui/gui.html).
43-
44-
### GUI Installation
45-
46-
1. Clone the repository
47-
2. Run Docker Compose `docker compose up -d`
48-
3. Access the GUI at `http://localhost:3000`
28+
```
29+
Notice: We are no longer support "AutoRAG GUI"
30+
And we will focus to maintain only AutoRAG core library in the future. Thank you.
31+
```
4932

5033
---
5134

@@ -293,23 +276,33 @@ We highly recommend using pre-made config YAML files for starter.
293276
- [Sample YAML Guide](https://marker-inc-korea.github.io/AutoRAG/optimization/sample_config.html)
294277
- [Make Custom YAML Guide](https://marker-inc-korea.github.io/AutoRAG/optimization/custom_config.html)
295278

296-
Here is an example of the config YAML file to use `retrieval`, `prompt_maker`, and `generator` nodes.
279+
Here is an example of the config YAML file to use three retrieval nodes, `prompt_maker`, and `generator` nodes.
297280

298281
```yaml
299282
node_lines:
300-
- node_line_name: retrieve_node_line # Set Node Line (Arbitrary Name)
283+
- node_line_name: retrieve_node_line
301284
nodes:
302-
- node_type: retrieval # Set Retrieval Node
285+
- node_type: lexical_retrieval
286+
strategy:
287+
metrics: [ retrieval_f1, retrieval_recall, retrieval_ndcg, retrieval_mrr ]
288+
top_k: 3
289+
modules:
290+
- module_type: bm25
291+
- node_type: semantic_retrieval
303292
strategy:
304-
metrics: [ retrieval_f1, retrieval_recall, retrieval_ndcg, retrieval_mrr ] # Set Retrieval Metrics
293+
metrics: [ retrieval_f1, retrieval_recall, retrieval_ndcg, retrieval_mrr ]
305294
top_k: 3
306295
modules:
307296
- module_type: vectordb
308297
vectordb: default
309-
- module_type: bm25
298+
- node_type: hybrid_retrieval
299+
strategy:
300+
metrics: [ retrieval_f1, retrieval_recall, retrieval_ndcg, retrieval_mrr ]
301+
top_k: 3
302+
modules:
310303
- module_type: hybrid_rrf
311304
weight_range: (4,80)
312-
- node_line_name: post_retrieve_node_line # Set Node Line (Arbitrary Name)
305+
- node_line_name: post_retrieve_node_line
313306
nodes:
314307
- node_type: prompt_maker # Set Prompt Maker Node
315308
strategy:

autorag/autorag/data/legacy/qacreation/base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from tqdm import tqdm
99

1010
import autorag
11-
from autorag.nodes.retrieval.vectordb import vectordb_ingest_api, vectordb_pure
11+
from autorag.nodes.semanticretrieval.vectordb import vectordb_ingest_api, vectordb_pure
1212
from autorag.utils.util import (
1313
save_parquet_safe,
1414
fetch_contents,

autorag/autorag/deploy/api.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -248,9 +248,19 @@ def run_api_server(
248248
self.app.run(host=host, port=port, **kwargs)
249249

250250
def extract_retrieve_passage(self, df: pd.DataFrame) -> List[RetrievedPassage]:
251-
retrieved_ids: List[str] = df["retrieved_ids"].tolist()[0]
251+
if "retrieved_ids" not in df.columns and "retrieved_ids_semantic" in df.columns:
252+
retrieved_ids: List[str] = df["retrieved_ids_semantic"].tolist()[0]
253+
scores = df["retrieve_scores_semantic"].tolist()[0]
254+
elif (
255+
"retrieved_ids" not in df.columns
256+
and "retrieved_ids_semantic" not in df.columns
257+
):
258+
retrieved_ids: List[str] = df["retrieved_ids_lexical"].tolist()[0]
259+
scores = df["retrieve_scores_lexical"].tolist()[0]
260+
else:
261+
retrieved_ids: List[str] = df["retrieved_ids"].tolist()[0]
262+
scores = df["retrieve_scores"].tolist()[0]
252263
contents = fetch_contents(self.corpus_df, [retrieved_ids])[0]
253-
scores = df["retrieve_scores"].tolist()[0]
254264
if "path" in self.corpus_df.columns:
255265
paths = fetch_contents(self.corpus_df, [retrieved_ids], column_name="path")[
256266
0

autorag/autorag/evaluator.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212

1313
from autorag.node_line import run_node_line
1414
from autorag.nodes.retrieval.base import get_bm25_pkl_name
15-
from autorag.nodes.retrieval.bm25 import bm25_ingest
16-
from autorag.nodes.retrieval.vectordb import (
15+
from autorag.nodes.lexicalretrieval.bm25 import bm25_ingest
16+
from autorag.nodes.semanticretrieval.vectordb import (
1717
vectordb_ingest_api,
1818
filter_exist_ids,
1919
filter_exist_ids_from_retrieval_gt,
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .hybrid_cc import HybridCC
2+
from .hybrid_rrf import HybridRRF
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
import abc
2+
3+
import pandas as pd
4+
5+
from autorag.nodes.retrieval.base import BaseRetrieval
6+
from autorag.utils import result_to_dataframe
7+
from autorag.utils.util import pop_params, fetch_contents
8+
9+
10+
class HybridRetrieval(BaseRetrieval, metaclass=abc.ABCMeta):
11+
def __init__(self, project_dir: str, *args, **kwargs):
12+
super().__init__(project_dir)
13+
14+
@result_to_dataframe(["retrieved_contents", "retrieved_ids", "retrieve_scores"])
15+
def pure(self, previous_result: pd.DataFrame, *args, **kwargs):
16+
previous_info = self.cast_to_run(previous_result, *args, **kwargs)
17+
_pure_params = pop_params(self._pure, kwargs)
18+
ids, scores = self._pure(previous_info, **_pure_params)
19+
contents = fetch_contents(self.corpus_df, ids)
20+
return contents, ids, scores
21+
22+
def cast_to_run(self, previous_result: pd.DataFrame, *args, **kwargs):
23+
return hybrid_cast(previous_result)
24+
25+
@classmethod
26+
def cast_to_run_class(cls, previous_result: pd.DataFrame):
27+
return hybrid_cast(previous_result)
28+
29+
30+
def hybrid_cast(
31+
previous_result: pd.DataFrame,
32+
):
33+
assert "query" in previous_result.columns, "previous_result must have query column."
34+
queries = previous_result["query"].tolist()
35+
36+
assert "retrieved_contents_semantic" in previous_result.columns
37+
assert "retrieved_contents_lexical" in previous_result.columns
38+
assert "retrieve_scores_semantic" in previous_result.columns
39+
assert "retrieve_scores_lexical" in previous_result.columns
40+
assert "retrieved_ids_semantic" in previous_result.columns
41+
assert "retrieved_ids_lexical" in previous_result.columns
42+
43+
contents_semantic = previous_result["retrieved_contents_semantic"].tolist()
44+
contents_lexical = previous_result["retrieved_contents_lexical"].tolist()
45+
scores_semantic = previous_result["retrieve_scores_semantic"].tolist()
46+
scores_lexical = previous_result["retrieve_scores_lexical"].tolist()
47+
ids_semantic = previous_result["retrieved_ids_semantic"].tolist()
48+
ids_lexical = previous_result["retrieved_ids_lexical"].tolist()
49+
50+
return {
51+
"queries": queries,
52+
"retrieved_contents_semantic": contents_semantic,
53+
"retrieved_contents_lexical": contents_lexical,
54+
"retrieve_scores_semantic": scores_semantic,
55+
"retrieve_scores_lexical": scores_lexical,
56+
"retrieved_ids_semantic": ids_semantic,
57+
"retrieved_ids_lexical": ids_lexical,
58+
}

autorag/autorag/nodes/retrieval/hybrid_cc.py renamed to autorag/autorag/nodes/hybridretrieval/hybrid_cc.py

Lines changed: 46 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
import os
21
from pathlib import Path
32
from typing import Tuple, List, Union
43

54
import numpy as np
65
import pandas as pd
76

8-
from autorag.nodes.retrieval.base import HybridRetrieval
9-
from autorag.utils.util import pop_params, fetch_contents, result_to_dataframe
7+
from autorag.nodes.hybridretrieval.base import HybridRetrieval
8+
from autorag.nodes.hybridretrieval.run import evaluate_retrieval_node
9+
from autorag.strategy import select_best
1010

1111

1212
def normalize_mm(scores: List[str], fixed_min_value: float = 0):
@@ -53,17 +53,16 @@ def normalize_dbsf(scores: List[str], fixed_min_value: float = 0):
5353
class HybridCC(HybridRetrieval):
5454
def _pure(
5555
self,
56-
ids: Tuple,
57-
scores: Tuple,
56+
info: dict,
5857
top_k: int,
5958
weight: float,
6059
normalize_method: str = "mm",
6160
semantic_theoretical_min_value: float = -1.0,
6261
lexical_theoretical_min_value: float = 0.0,
6362
):
6463
return hybrid_cc(
65-
ids,
66-
scores,
64+
(info["retrieved_ids_semantic"], info["retrieved_ids_lexical"]),
65+
(info["retrieve_scores_semantic"], info["retrieve_scores_lexical"]),
6766
top_k,
6867
weight,
6968
normalize_method,
@@ -79,34 +78,48 @@ def run_evaluator(
7978
*args,
8079
**kwargs,
8180
):
82-
if "ids" in kwargs and "scores" in kwargs:
83-
data_dir = os.path.join(project_dir, "data")
84-
corpus_df = pd.read_parquet(
85-
os.path.join(data_dir, "corpus.parquet"), engine="pyarrow"
81+
assert "strategy" in kwargs, "You must specify the strategy to use."
82+
assert (
83+
"input_metrics" in kwargs
84+
), "You must specify the input metrics to use, which is list of MetricInput."
85+
strategies = kwargs.pop("strategy")
86+
input_metrics = kwargs.pop("input_metrics")
87+
weight_range = kwargs.pop("weight_range", (0.0, 1.0))
88+
test_weight_size = kwargs.pop("test_weight_size", 101)
89+
weight_candidates = np.linspace(
90+
weight_range[0], weight_range[1], test_weight_size
91+
).tolist()
92+
93+
result_list = []
94+
instance = cls(project_dir, *args, **kwargs)
95+
for weight_value in weight_candidates:
96+
result_df = instance.pure(previous_result, weight=weight_value, **kwargs)
97+
result_list.append(result_df)
98+
99+
if strategies.get("metrics") is None:
100+
raise ValueError("You must at least one metrics for retrieval evaluation.")
101+
result_list = list(
102+
map(
103+
lambda x: evaluate_retrieval_node(
104+
x,
105+
input_metrics,
106+
strategies.get("metrics"),
107+
),
108+
result_list,
86109
)
110+
)
87111

88-
params = pop_params(hybrid_cc, kwargs)
89-
assert (
90-
"ids" in params and "scores" in params and "top_k" in params
91-
), "ids, scores, and top_k must be specified."
92-
93-
@result_to_dataframe(
94-
["retrieved_contents", "retrieved_ids", "retrieve_scores"]
95-
)
96-
def __cc(**cc_params):
97-
ids, scores = hybrid_cc(**cc_params)
98-
contents = fetch_contents(corpus_df, ids)
99-
return contents, ids, scores
100-
101-
return __cc(**params)
102-
else:
103-
assert (
104-
"target_modules" in kwargs and "target_module_params" in kwargs
105-
), "target_modules and target_module_params must be specified if there is not ids and scores."
106-
instance = cls(project_dir, *args, **kwargs)
107-
result = instance.pure(previous_result, *args, **kwargs)
108-
del instance
109-
return result
112+
# select best result
113+
best_result_df, best_weight = select_best(
114+
result_list,
115+
strategies.get("metrics"),
116+
metadatas=weight_candidates,
117+
strategy_name=strategies.get("strategy", "normalize_mean"),
118+
)
119+
return {
120+
"best_result": best_result_df,
121+
"best_weight": best_weight,
122+
}
110123

111124

112125
def hybrid_cc(

0 commit comments

Comments
 (0)