Skip to content

Commit 344b3e8

Browse files
Jinyu WangJinyu Wang
authored andcommitted
- update to fix 'Clear-text logging of sensitive information' issues
- update to fix 'Uncontrolled data used in path expression' issues - add acr online demo - add BM25Retriever - update AzureOpenAIClient authentication method - update LLM metric - add chunking workflow, update tagging workflow - update data protocol utils - updates to yaml configs - remove visualization scripts; remove absolute paths - update README, Transparency doc, requirements file
1 parent 818c3b3 commit 344b3e8

File tree

96 files changed

+1677
-1718
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

96 files changed

+1677
-1718
lines changed

.gitignore

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,3 @@ cython_debug/
144144
env_configs/
145145
data/
146146
logs/
147-
148-
# May enable in the future
149-
notebooks/
150-
test/

RAI_TRANSPARENCY.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,17 @@
22

33
## What is PIKE-RAG?
44

5-
PIKE-RAG, or PrIvate KnowledgE and Rationale Augmentation Generation, represents an advanced evolution in AI-assisted content interpretation, tailored for industrial applications requiring private knowledge and domain-specific rationale. Unlike conventional Retrieval-Augmented Generation (RAG) systems which primarily rely on retrieval to inform large language models (LLMs), PIKE-RAG introduces a methodology that integrates private knowledge extraction with the generation of a coherent rationale. This technique enables the LLMs to progressively navigate towards more accurate and contextually relevant responses. By parsing data to create detailed knowledge structures akin to a heterogeneous knowledge graph, and guiding LLMs to construct coherent rationale in a knowledge-aware manner, PIKE-RAG surpasses traditional models by extracting, understanding, linking and applying private knowledge that is not readily accessible through standard retrieval processes.
5+
PIKE-RAG, or sPecIalized KnowledgE and Rationale Augmentation Generation, represents an advanced evolution in AI-assisted content interpretation, tailored for industrial applications requiring specialized knowledge and domain-specific rationale. Unlike conventional Retrieval-Augmented Generation (RAG) systems which primarily rely on retrieval to inform large language models (LLMs), PIKE-RAG introduces a methodology that integrates domain-specific knowledge extraction with the generation of a coherent rationale. This technique enables the LLMs to progressively navigate towards more accurate and contextually relevant responses. By parsing data to create detailed knowledge structures akin to a heterogeneous knowledge graph, and guiding LLMs to construct coherent rationale in a knowledge-aware manner, PIKE-RAG surpasses traditional models by extracting, understanding, linking and applying specialized knowledge that is not readily accessible through standard retrieval processes.
66

77
## What can PIKE-RAG do?
88

9-
The strength of PIKE-RAG lies in its sophisticated ability to link disparate pieces of information across extensive data sets, thereby facilitating the answering of complex queries that are beyond the reach of typical keyword and vector-based search methods. By constructing a heterogeneous knowledge graph from a user-provided private dataset, PIKE-RAG can adeptly navigate through a sea of information to locate and synthesize content that addresses user questions, even when the answers are scattered across multiple documents. Moreover, PIKE-RAG's advanced functionality enables it to tackle thematic inquiries, such as identifying prevailing motifs or patterns within a dataset. This capacity to handle both specific, cross-referenced questions and broader, thematic ones makes PIKE-RAG a powerful tool for extracting actionable insights from large volumes of data.
9+
The strength of PIKE-RAG lies in its sophisticated ability to link disparate pieces of information across extensive data sets, thereby facilitating the answering of complex queries that are beyond the reach of typical keyword and vector-based search methods. By constructing a heterogeneous knowledge graph from a user-provided domain-specific dataset, PIKE-RAG can adeptly navigate through a sea of information to locate and synthesize content that addresses user questions, even when the answers are scattered across multiple documents. Moreover, PIKE-RAG's advanced functionality enables it to tackle thematic inquiries, such as identifying prevailing motifs or patterns within a dataset. This capacity to handle both specific, cross-referenced questions and broader, thematic ones makes PIKE-RAG a powerful tool for extracting actionable insights from large volumes of data.
1010

1111
## What is/are PIKE-RAG's intended use(s)?
1212

1313
PIKE-RAG is an advanced system crafted to revolutionize how large language models assist in complex industrial tasks. Here are the its intended uses:
1414

15-
- PIKE-RAG is specifically engineered to enhance large language models in sectors where the need for deep, domain-specific knowledge is paramount, and where conventional retrieval-augmented systems fall short. Its intended use is in complex industrial applications where the extraction and application of specialized knowledge are critical for achieving accurate and insightful outcomes. By incorporating private knowledge and constructing coherent rationales, PIKE-RAG is adept at guiding LLMs to provide precise responses tailored to intricate queries.
15+
- PIKE-RAG is specifically engineered to enhance large language models in sectors where the need for deep, domain-specific knowledge is paramount, and where conventional retrieval-augmented systems fall short. Its intended use is in complex industrial applications where the extraction and application of specialized knowledge are critical for achieving accurate and insightful outcomes. By incorporating specialized knowledge and constructing coherent rationales, PIKE-RAG is adept at guiding LLMs to provide precise responses tailored to intricate queries.
1616

1717
- Aimed at addressing the multifaceted challenges encountered in industrial environments, PIKE-RAG is poised for deployment in scenarios requiring a high level of logical reasoning and the ability to navigate through specialized corpora. It is ideal for use cases that demand not just information retrieval but also a deep understanding of that information and its application in a logical, reasoned manner. This makes PIKE-RAG particularly valuable for tasks where decision-making is heavily reliant on exclusive, industry-specific insights.
1818

@@ -36,11 +36,13 @@ PIKE-RAG's performance and utility were rigorously assessed through a comprehens
3636

3737
## What are the limitations of PIKE-RAG? How can users minimize the impact of PIKE-RAG’s limitations when using the system?
3838

39-
PIKE-RAG, like any advanced system, comes with certain limitations that users should be aware of in order to effectively utilize the system within its operational boundaries. The open-source version of PIKE-RAG is crafted with a general-purpose approach in mind. As a result, while it outperforms other untrained and unadjusted methods across the datasets being tested, it may not deliver the optimal performance possible within a specific domain. This is because it is not fine-tuned to the unique nuances and specialized requirements that certain domains may present.
39+
PIKE-RAG, like any advanced system, comes with certain limitations that users should be aware of in order to effectively utilize the system within its operational boundaries. The open-source version of PIKE-RAG is crafted with a general-purpose approach in mind. As a result, while it outperforms other untrained and unadjusted methods across the datasets being tested, it may not deliver the optimal performance possible within a specific domain. This is because it is not fine-tuned to the unique nuances and specialized requirements that certain domains may present. In other words, please note that PIKE-RAG was developed for research purposes and is not intended for real-world industrial applications without further testing and development.
4040

4141
Users can mitigate the limitations posed by PIKE-RAG's generalist nature by introducing domain-specific customizations. For instance, when deploying PIKE-RAG in a particular domain, users can enhance performance by tailoring the corpus pre-processing steps to better reflect the domain's specificities, thereby ensuring that the knowledge base PIKE-RAG draws from is highly relevant and fine-tuned. Additionally, during the rationale generation phase, users can incorporate domain-specific demonstrations or templates that guide the system to construct rationales that are more aligned with domain expertise and practices. This bespoke approach can significantly improve the relevance and accuracy of the system's outputs, making them more actionable and trustworthy for domain-specific applications.
4242

43-
By understanding and addressing these limitations through targeted domain-specific adjustments, users can effectively harness the power of PIKE-RAG and reduce the impact of its constraints, thus maximizing the system's utility and performance in specialized contexts.
43+
Another limitation is that, PIKE-RAG was mainly evaluated in English. If you want to apply PIKE-RAG in other languages, you may need to do some prompt engineering works-to update the prompts in the language you want to use. In the meanwhile, adequate testing works are necessary.
44+
45+
Please keep in mind that to apply PIKE-RAG for real-world industrial applications, domain experts are acquired to customize the whole pipeline. By understanding and addressing these limitations through targeted domain-specific adjustments, users can effectively harness the power of PIKE-RAG and reduce the impact of its constraints, thus maximizing the system's utility and performance in specialized contexts.
4446

4547
## What operational factors and settings allow for effective and responsible use of PIKE-RAG?
4648

README.md

Lines changed: 39 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,59 @@
1-
# PIKE-RAG: PrIvate KnowledgE and Rationale Augmented Generation
1+
<p align="center">
2+
<img src="./docs/source/images/logo/PIKE-RAG_horizontal_black-font.svg" alt="PIKE-RAG" style="width: 80%; max-width: 100%; height: auto;">
3+
</p>
24

3-
## Quick Start
4-
5-
Please set your `PYTHONPATH` before running the scripts:
6-
7-
### Windows
8-
9-
```powershell
10-
$Env:PYTHONPATH=PATH-TO-THIS-REPO
11-
12-
# If you exactly under this repository directory, you can do it by
13-
$Env:PYTHONPATH=$PWD
14-
```
15-
16-
### Linux / Mac OS
5+
<p align="center">
6+
<a href="https://pike-rag.azurewebsites.net/">🌐Online Demo</a>
7+
<a href="https://arxiv.org/abs/2501.11551">📊Technical Report</a>
8+
</p>
179

18-
```sh
19-
export PYTHONPATH=PATH-TO-THIS-REPO
10+
[![License](https://img.shields.io/github/license/microsoft/PIKE-RAG)](https://github.com/microsoft/PIKE-RAG/blob/main/LICENSE)
11+
[![CodeQL](https://github.com/microsoft/PIKE-RAG/actions/workflows/github-code-scanning/codeql/badge.svg)](https://github.com/microsoft/PIKE-RAG/actions/workflows/github-code-scanning/codeql)
12+
[![Release](https://img.shields.io/github/v/release/microsoft/PIKE-RAG)](https://github.com/microsoft/PIKE-RAG/releases)
13+
[![ReleaseDate](https://img.shields.io/github/release-date-pre/microsoft/PIKE-RAG)](https://github.com/microsoft/PIKE-RAG/releases)
14+
[![Commits](https://img.shields.io/github/commits-since/microsoft/PIKE-RAG/latest/main)](https://github.com/microsoft/PIKE-RAG/commits/main)
15+
[![Pull Requests](https://img.shields.io/github/issues-pr/microsoft/PIKE-RAG)](https://github.com/microsoft/PIKE-RAG/pulls)
16+
[![Issues](https://img.shields.io/github/issues/microsoft/PIKE-RAG)](https://github.com/microsoft/PIKE-RAG/issues)
2017

21-
# If you are exactly under the repository directory, you can do it by
22-
export PYTHONPATH=$PWD
23-
```
18+
# PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation
2419

25-
## .env File
20+
## Why PIKE-RAG?
2621

27-
Please follow below environment configuration variable names to create your *.env* file, we suggest you put it under
28-
`PIKE-RAG/env_configs/` which has already been added to *.gitignore* file:
22+
In recent years, Retrieval Augmented Generation (RAG) systems have made significant progress in extending the capabilities of Large Language Models (LLM) through external retrieval. However, these systems still face challenges in meeting the complex and diverse needs of real-world industrial applications. Relying solely on direct retrieval is insufficient for extracting deep domain-specific knowledge from professional corpora and performing logical reasoning. To address this issue, we propose the PIKE-RAG (sPecIalized KnowledgE and Rationale Augmented Generation) method, which focuses on extracting, understanding, and applying domain-specific knowledge while building coherent reasoning logic to gradually guide LLMs toward accurate responses.
2923

30-
### For Azure OpenAI Client
24+
<p align="center">
25+
<img src="docs/source/images/readme/pipeline.png" alt="Overview of PIKE-RAG Framework" style="width: 80%; max-width: 100%; height: auto;">
26+
</p>
3127

32-
```sh
33-
AZURE_OPENAI_ENDPOINT = "YOUR-ENDPOINT(https://xxx.openai.azure.com/)"
34-
OPENAI_API_TYPE = "azure"
35-
OPENAI_API_VERSION = "2023-07-01-preview"
36-
```
28+
PIKE-RAG framework mainly consists of several basic modules, including document parsing, knowledge extraction, knowledge storage, knowledge retrieval, knowledge organization, knowledge-centric reasoning, and task decomposition and coordination. By adjusting the submodules within the main modules, it is possible to achieve RAG systems that focus on different capabilities to meet the diverse needs of real-world scenarios.
3729

38-
*Note that the way to access GPT API with key is disabled in Azure now.*
30+
For example, in case *patient's historical medical records searching*, it focuses on the *factual information retrieval capability*. The main challenges are that (1) the understanding and extraction of knowledge are often hindered by inappropriate knowledge segmentation, disrupting semantic coherence, leading to a complex and inefficient retrieval process; (2) commonly used embedding-based knowledge retrieval is limited by embedding models' ability to align professional terms and aliases, reducing system accuracy. With PIKE-RAG, we can improve the accuracy of knowledge extraction and retrieval by using context-aware segmentation techniques, automatic term label alignment techniques, and multi-granularity knowledge extraction methods during the knowledge extraction process, thereby enhancing factual information retrieval capability, as shown in the pipeline below.
3931

40-
To access GPT resource from Azure, please remember to login to Azure CLI using your *SC-* account:
32+
<p align="center">
33+
<img src="docs/source/images/readme/L1_pipeline.png" alt="A Pipeline Focusing on Factual Information Retrieval" style="width: 80%; max-width: 100%; height: auto;">
34+
</p>
4135

42-
```sh
43-
# Install Azure-CLI and other dependencies. Sudo permission is required.
44-
bash scripts/install_az.sh
36+
For complex task like *reasonable treatment plans and coping measures suggestions for patients*, it requires more advanced capabilities: strong domain-specific knowledge are required to accurately understand the task and sometimes reasonably decompose it; advanced data retrieval, processing and organization techniques are also required for potential tendency prediction; while multi-agents planning will also be useful to take considerations of both creativity and reliance. In such case, a richer pipeline below can be initialized to achive this.
4537

46-
# Login Azure CLI using device code.
47-
bash scripts/login_az.sh
48-
```
38+
<p align="center">
39+
<img src="docs/source/images/readme/L4_pipeline.png" alt="A Pipeline Focusing on Fact-based Innovation and Generation" style="width: 80%; max-width: 100%; height: auto;">
40+
</p>
4941

50-
### For Azure Meta LlaMa Client
42+
In public benchmark tests, PIKE-RAG demonstrated excellent performance on several multi-hop question answering datasets such as HotpotQA, 2WikiMultiHopQA, and MuSiQue. Compared to existing benchmark methods, PIKE-RAG excelled in metrics like accuracy and F1 score. On the HotpotQA dataset, PIKE-RAG achieved an accuracy of 87.6%, on 2WikiMultiHopQA it reached 82.0%, and on the more challenging MuSiQue dataset, it achieved 59.6%. These results indicate that PIKE-RAG has significant advantages in handling complex reasoning tasks, especially in scenarios that require integrating multi-source information and performing multi-step reasoning.
5143

52-
Since the endpoint and API keys varied among different LlaMa models, you can add multiple
53-
(`llama_endpoint_name`, `llama_key_name`) pairs you want to use into the *.env* file, and specify the names when
54-
initializing `AzureMetaLlamaClient` (you can modify the llm client args in the YAML files). If `null` is set to be the
55-
name, the (`LLAMA_ENDPOINT`, `LLAMA_API_KEY`) would be used as the default environment variable name.
44+
PIKE-RAG has been tested and significantly improved question answering accuracy in fields such as industrial manufacturing, mining, and pharmaceuticals. In the future, we will continue to explore its application in more fields. Additionally, we will continue to explore other forms of knowledge and logic and their optimal adaptation to specific scenarios.
5645

57-
```sh
58-
# Option 1: Set only one pair in one time, update these variables every time you want to change the LlaMa model.
59-
LLAMA_ENDPOINT = "YOUR-LLAMA-ENDPOINT"
60-
LLAMA_API_KEY = "YOUR-API-KEY"
46+
## For More Details
6147

62-
# Option 2: Add multiple pairs into the .env file, for example:
63-
LLAMA3_8B_ENDPOINT = "..."
64-
LLAMA3_8B_API_KEY = "..."
48+
- 📊 [Technical Report](https://arxiv.org/abs/2501.11551) will illustrate the industrial RAG problem classification, introduce the main components in PIKE-RAG, and show some experimental results in public benchmarks.
49+
- 🌐 [Online Demo](https://pike-rag.azurewebsites.net/) is a show-case of our Knowledge-Aware decomposition pipeline for L2 RAG task.
6550

66-
LLAMA3_70B_ENDPOINT = "..."
67-
LLAMA3_70B_API_KEY = "..."
68-
```
69-
70-
#### Ways to Get the Available Azure Meta LLaMa **Endpoints**, **API Keys** and **Model Names**
71-
72-
The way we have implemented the LLaMa model so far involves requesting the deployed model on the GCR server. You can
73-
find the available settings follow the steps below:
74-
75-
1. Open [Azure Machine Learning Studio](https://ml.azure.com/home), sign in may be required;
76-
2. Click *Workspaces* on the left side (expand the menu by clicking the three horizontal lines in the top left corner if
77-
you cannot find it);
78-
3. Choose and click on a valid workspace, e.g., *gcrllm2ws*;
79-
4. Click *Endpoints* on the left side (expand the menu by clicking the three horizontal lines in the top left corner if
80-
you cannot find it), You can find the available model list in this page;
81-
5. Choose and click the model you want to use, e.g., *gcr-llama-3-8b-instruct*:
82-
- **model** name: in tab "Details", scroll to find "Deployment summary", the *Live traffic allocation* string (e.g.,
83-
*meta-llama-3-8b-instruct-4*) is the model name you need to set up in your YAML file;
84-
- **LLAMA_ENDPOINT** & **LLAMA_API_KEY**: can be found in tab "Consume".
85-
86-
#### Handling the Issue "Specified deployment could not be found"
51+
## Quick Start
8752

88-
If you get error message "Specified deployment could not be found", it indicates that the GCR team has changed the
89-
server deployment location. In this case, you need to check the available model list in
90-
[Azure Machine Learning Studio](https://ml.azure.com/home) and update the YAML config again.
53+
1. Clone this repo and set up the Python environment, refer to [this document](docs/guides/environment.md);
54+
2. Create a `.env` file to save your endpoint information (and some other environment variables if needed), refer to [this document](docs/guides/env_file.md);
55+
3. Modify the *yaml config* files and try the scripts under *examples/*, refer to [this document](docs/guides/examples.md);
56+
4. Build up your own pipeline and/or add your own components!
9157

9258
## Contributing
9359

0 commit comments

Comments
 (0)