Skip to content

Commit 505ad72

Browse files
authored
Merge pull request road-core#305 from tisnik/llm-evaluation-infrastructure
LLM evaluation infrastructure
2 parents f7e0677 + 11a0489 commit 505ad72

26 files changed

+2042
-2
lines changed

scripts/evaluation/README.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Evaluation
2+
3+
## Description
4+
Currently we have 2 types of evaluations.
5+
1. `consistency`: Ability to compare responses against ground-truth answer for specific provider+model. Objective of this evaluation is to flag any variation in specific provider+model response. Currently a combination of similarity distances are used to calculate final score. Cut-off scores are used to flag any deviations. This also stores a .csv file with query, pre-defined answer, API response & score. Input for this is a [json file](eval_data/question_answer_pair.json)
6+
7+
2. `model`: Ability to compare responses against single ground-truth answer. Here we can do evaluation for more than one provider+model at a time. This creates a json file as summary report with scores (f1-score) for each provider+model. Along with selected QnAs from above json file, we can also provide additional QnAs using a parquet file (optional). [Sample QnA set (parquet)](eval_data/interview_qna_30_per_title.parquet) with 30 queries per OCP documentation title.
8+
9+
**Notes**
10+
- QnAs should `not` be used for model training or tuning. This is created only for evaluation purpose.
11+
- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR.
12+
- OLS API should be ready/live with all the required provider+model configured.
13+
- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, *model* evaluation first checks .csv file generated by *consistency* evaluation. If response is not present in csv file, then only we call API to get the response.
14+
15+
### e2e test case
16+
17+
These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is parimarily used to gate PRs. Final e2e suite will also invoke *model* evaluation which will use .csv files generated by earlier suites, if any file is not present then last suite will fail.
18+
19+
### Usage
20+
```
21+
python -m scripts.evaluation.driver
22+
```
23+
24+
### Input Data/QnA pool
25+
[Json file](eval_data/question_answer_pair.json)
26+
27+
[Sample QnA set (parquet)](eval_data/interview_qna_30_per_title.parquet)
28+
29+
Please refer above files for the structure, add new data accordingly.
30+
31+
### Arguments
32+
**eval_type**: This will control which evaluation, we want to do. Currently we have 3 options.
33+
1. `consistency` -> Compares model specific answer for QnAs provided in json file
34+
2. `model` -> Compares set of models based on their response and generates a summary report. For this we can provide additional QnAs in parquet format, along with json file.
35+
3. `all` -> Both of the above evaluations.
36+
37+
**eval_api_url**: OLS API url. Default is `http://localhost:8080`. If deployed in a cluster, then pass cluster API url.
38+
39+
**eval_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is deployed in cluster.
40+
41+
**eval_scenario**: This is primarily required to indetify which pre-defined answers need to be compared. Values can be `with_rag`, `without_rag`. Currently we always do evaluation for the API with rag.
42+
43+
**eval_query_ids**: Option to give set of query ids for evaluation. By default all queries are processed.
44+
45+
**eval_provider_model_id**: We can provide set of provider/model combinations as ids for comparison.
46+
47+
**qna_pool_file**: Applicable only for `model` evaluation. Provide file path to the parquet file having additional QnAs. Default is None.
48+
49+
**eval_out_dir**: Directory, where output csv/json files will be saved.
50+
51+
**eval_metrics**: By default all scores/metrics are calculated, but this decides which scores will be used to create the graph.
52+
This is a list of metrics. Ex: cosine, euclidean distance, precision/recall/F1 score, answer relevancy score, LLM based similarity score.
53+
54+
**judge_provider / judge_model**: Provider / Model for judge LLM. This is required for LLM based evaluation (answer relevancy score, LLM based similarity score). This needs to be configured correctly through config yaml file. [Sample provider/model configuration](../../examples/olsconfig.yaml)
55+
56+
**eval_modes**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramaters/prompt/RAG so that we can have baseline score. This is a list of modes. Ex: vanilla, ols_param, ols_prompt, ols_rag, & ols (actual api).
57+
58+
### Outputs
59+
Evaluation scripts creates below files.
60+
- CSV file with response for given provider/model & modes.
61+
- response evaluation result with scores (for consistency check).
62+
- Final csv file with all results, json score summary & graph (for model evaluation)
63+
64+
[Evaluation Result](eval_data/result/README.md)
65+
66+
67+
# RAG retrieval script
68+
```
69+
python -m scripts.evaluation.query_rag
70+
```
71+
This is used to generate a .csv file having retrieved chunks for given set of queries with similarity score. This is not part of actual evaluation. But useful to do a spot check to understand the text that we send to LLMs as context (this may explain any deviation in the response)
72+
73+
#### Arguments
74+
*db-path*: Path to the RAG index
75+
76+
*product-index*: RAG index ID
77+
78+
*model-path*: Path or name of the embedding model
79+
80+
*queries*: Set of queries separated by space. If not passed default queries are used.
81+
82+
*top-k*: How many chunks we want to retrieve. Default is 10.
83+
84+
*output_dir*: To save the .csv file.

scripts/evaluation/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Modules for evaluation."""

scripts/evaluation/driver.py

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
"""Driver for evaluation."""
2+
3+
import argparse
4+
import sys
5+
6+
from httpx import Client
7+
8+
from scripts.evaluation.response_evaluation import ResponseEvaluation
9+
10+
11+
def _args_parser(args):
12+
"""Arguments parser."""
13+
parser = argparse.ArgumentParser(description="Response validation module.")
14+
parser.add_argument(
15+
"--eval_provider_model_id",
16+
nargs="+",
17+
default=["watsonx+ibm/granite-3-8b-instruct"],
18+
type=str,
19+
help="Identifier for Provider/Model to be used for model eval.",
20+
)
21+
parser.add_argument(
22+
"--judge_provider",
23+
default="ollama",
24+
type=str,
25+
help="Provider name for judge model; required for LLM based evaluation",
26+
)
27+
parser.add_argument(
28+
"--judge_model",
29+
default="llama3.1:latest",
30+
type=str,
31+
help="Judge model; required for LLM based evaluation",
32+
)
33+
parser.add_argument(
34+
"--eval_out_dir",
35+
default=None,
36+
type=str,
37+
help="Result destination.",
38+
)
39+
parser.add_argument(
40+
"--eval_query_ids",
41+
nargs="+",
42+
default=None,
43+
help="Ids of questions to be validated. Check json file for valid ids.",
44+
)
45+
parser.add_argument(
46+
"--eval_scenario",
47+
choices=["with_rag", "without_rag"],
48+
default="with_rag",
49+
type=str,
50+
help="Scenario for which responses will be evaluated.",
51+
)
52+
parser.add_argument(
53+
"--qna_pool_file",
54+
default=None,
55+
type=str,
56+
help="Additional file having QnA pool in parquet format.",
57+
)
58+
parser.add_argument(
59+
"--eval_type",
60+
choices=["consistency", "model", "all"],
61+
default="model",
62+
type=str,
63+
help="Evaluation type.",
64+
)
65+
parser.add_argument(
66+
"--eval_metrics",
67+
nargs="+",
68+
default=["cos_score"],
69+
help="Evaluation score/metric.",
70+
)
71+
parser.add_argument(
72+
"--eval_modes",
73+
nargs="+",
74+
default=["ols"],
75+
help="Evaluation modes ex: with just prompt/rag etc.",
76+
)
77+
parser.add_argument(
78+
"--eval_api_url",
79+
default="http://localhost:8080",
80+
type=str,
81+
help="API URL",
82+
)
83+
parser.add_argument(
84+
"--eval_api_token_file",
85+
default="ols_api_key.txt",
86+
type=str,
87+
help="Path to text file with API token (applicable when deployed on cluster)",
88+
)
89+
return parser.parse_args(args)
90+
91+
92+
def main():
93+
"""Evaluate response."""
94+
args = _args_parser(sys.argv[1:])
95+
96+
client = Client(base_url=args.eval_api_url, verify=False) # noqa: S501
97+
98+
if "localhost" not in args.eval_api_url:
99+
with open(args.eval_api_token_file, mode="r", encoding="utf-8") as t_f:
100+
token = t_f.read().rstrip()
101+
client.headers.update({"Authorization": f"Bearer {token}"})
102+
103+
resp_eval = ResponseEvaluation(args, client)
104+
105+
match args.eval_type:
106+
case "consistency":
107+
resp_eval.validate_response()
108+
case "model":
109+
resp_eval.evaluate_models()
110+
case _:
111+
resp_eval.validate_response()
112+
resp_eval.evaluate_models()
113+
114+
115+
if __name__ == "__main__":
116+
main()
309 KB
Binary file not shown.
268 KB
Binary file not shown.

0 commit comments

Comments
 (0)