Skip to content

Commit 265d1c6

Browse files
authored
Merge pull request #47 from ansible/aap_38439
PR to apply for E2E OLS evaluation framework for AAP chatbot
2 parents 8d092ac + e346349 commit 265d1c6

File tree

6 files changed

+65
-1
lines changed

6 files changed

+65
-1
lines changed

scripts/evaluation/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Currently we have 2 types of evaluations.
1111
- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR.
1212
- OLS API should be ready/live with all the required provider+model configured.
1313
- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, *model* evaluation first checks .csv file generated by *consistency* evaluation. If response is not present in csv file, then only we call API to get the response.
14+
- User needs to install python `matplotlib`, and `rouge_score` before running the evaluation.
1415

1516
### e2e test case
1617

@@ -21,6 +22,11 @@ These evaluations are also part of **e2e test cases**. Currently *consistency* e
2122
python -m scripts.evaluation.driver
2223
```
2324

25+
### Sample run command
26+
```
27+
OPENAI_API_KEY=IGNORED python -m scripts.evaluation.driver --qna_pool_file ./scripts/evaluation/eval_data/aap-sample.parquet --eval_provider_model_id my_rhoai+granite3-8b --eval_metrics answer_relevancy answer_similarity_llm cos_score rougeL_precision --eval_modes vanilla --judge_model granite3-8b --judge_provider my_rhoai3 --eval_query_ids qna1
28+
```
29+
2430
### Input Data/QnA pool
2531
[Json file](eval_data/question_answer_pair.json)
2632

8.46 KB
Binary file not shown.
38 KB
Binary file not shown.

scripts/evaluation/olsconfig.yaml

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# olsconfig.yaml sample for local ollama server
2+
#
3+
# 1. install local ollama server from https://ollama.com/
4+
# 2. install llama3.1:latest model with:
5+
# ollama pull llama3.1:latest
6+
# 3. Copy this file to the project root of cloned lightspeed-service repo
7+
# 4. Install dependencies with:
8+
# make install-deps
9+
# 5. Start lightspeed-service with:
10+
# OPENAI_API_KEY=IGNORED make run
11+
# 6. Open https://localhost:8080/ui in your web browser
12+
#
13+
llm_providers:
14+
- name: ollama
15+
type: openai
16+
url: "http://localhost:11434/v1/"
17+
models:
18+
- name: "mistral"
19+
- name: 'llama3.2:latest'
20+
- name: my_rhoai
21+
type: openai
22+
url: "https://granite3-8b-wisdom-model-staging.apps.stage2-west.v2dz.p1.openshiftapps.com/v1"
23+
credentials_path: ols_api_key.txt
24+
models:
25+
- name: granite3-8b
26+
ols_config:
27+
# max_workers: 1
28+
reference_content:
29+
# product_docs_index_path: "./vector_db/vector_db/aap_product_docs/2.5"
30+
# product_docs_index_id: aap-product-docs-2_5
31+
# embeddings_model_path: "./vector_db/embeddings_model"
32+
conversation_cache:
33+
type: memory
34+
memory:
35+
max_entries: 1000
36+
logging_config:
37+
app_log_level: info
38+
lib_log_level: warning
39+
uvicorn_log_level: info
40+
default_provider: ollama
41+
default_model: 'llama3.2:latest'
42+
query_validation_method: llm
43+
user_data_collection:
44+
feedback_disabled: false
45+
feedback_storage: "/tmp/data/feedback"
46+
transcripts_disabled: false
47+
transcripts_storage: "/tmp/data/transcripts"
48+
dev_config:
49+
# config options specific to dev environment - launching OLS in local
50+
enable_dev_ui: true
51+
disable_auth: true
52+
disable_tls: true
53+
pyroscope_url: "https://pyroscope.pyroscope.svc.cluster.local:4040"
54+
# llm_params:
55+
# temperature_override: 0
56+
# k8s_auth_token: optional_token_when_no_available_kube_config

scripts/evaluation/utils/constants.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@
1111
"azure_openai+gpt-4o": ("azure_openai", "gpt-4o"),
1212
"ollama+llama3.1:latest": ("ollama", "llama3.1:latest"),
1313
"ollama+mistral": ("ollama", "mistral"),
14+
"my_rhoai+granite3-8b": ("my_rhoai", "granite3-8b"),
15+
"my_rhoai3+granite3-1-8b": ("my_rhoai3", "granite3-1-8b"),
1416
}
1517

1618
NON_LLM_EVALS = {

scripts/evaluation/utils/relevancy_score.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ def get_score(
4242
# raise
4343
sleep(time_to_breath)
4444

45-
if out:
45+
if out and isinstance(out, dict):
4646
valid_flag = out["Valid"]
4747
gen_questions = out["Question"]
4848
score = 0

0 commit comments

Comments
 (0)