Skip to content

Commit f7a2c33

Browse files
andytaelmarkxnelsonldemarchis
authored
Build a Better ChatBot LiveLab (#540)
* init new workshop Signed-off-by: Mark Nelson <[email protected]> * 1st edit * renamed .md * Update manifest.json * ToC * refactoring * refactoring * set OCI credentials added * Lab 1 Update * Lab 1 update * Lab 2 updated * Lab 5 updated * Lab 5 updated * Lab 5 updated * Lab 5 updated * Update testbed.png * Update evaluating.md * Objectives updated * Intros updated * Lab 6 updated * Title updated * lab 3 updated * Update RAG.md * Update RAG.md * changed images directory * Add .gitkeep to track images folder * Add files via upload * updated introduction * Update get-started.md * Update get-started.md * updated get-started * updated introduction and get started * general update * updated api-server * Update server.md * Update server.md * corrections * updated explore.md * Update explore.md * Update explore.md * Update explore.md * updated evaluating.md * generic updateds * corrections * corrected typos * Update introduction.md * references update * update acknowledgments * updated source document * Add files via upload * source document related content changed * Add files via upload * updated RAG.md * Update server.md * Update experimenting.md * help.md created * minor fixes * Add files via upload * general review * estimated times for lab1&lab2 * estimated times added to lab3-6 * General updates * alias podman=docker added * Update get-started.md * volume version+OCI credentials as optional * Update introduction.md * Update get-started.md * Update get-started.md * Update get-started.md * copy button added * note indentation added * Update explore.md * Create getting_started-30_testset.json * updates * updates to 1.1 * re-directoring * clean-up * test * refactoring * Update manifest.json * Update manifest.json * updated explore.md * updated explore.md * updated prepare.md * updated rag.md * updated explore.md * Update explore.md * Update explore.md * prepare.md adapted for Sandbox version * updated introduction.md and server.md for sanbox version * Rename sandbox to tenancy * spelling and linting * Adding deploy section * Deploy using IaC * cleanup and fixes * linting, spelling * minor changes * typo fix * linting * numbers * desktop linting, spelling * lint * manifest update * updates * manifest updates --------- Signed-off-by: Mark Nelson <[email protected]> Co-authored-by: Mark Nelson <[email protected]> Co-authored-by: Lorenzo De Marchis <[email protected]>
1 parent 9184bb4 commit f7a2c33

File tree

141 files changed

+3491
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

141 files changed

+3491
-0
lines changed
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# Evaluating Performance
2+
3+
## Introduction
4+
5+
We are confident that adjusting certain parameters can improve the quality and accuracy of the chatbot’s responses. However, can we be sure that a specific configuration remains reliable when scaled to hundreds or even thousands of different questions?
6+
7+
In this lab, you will explore the *Testbed* feature. The Testbed allows you to evaluate your chatbot at scale by generating a Q&A test dataset and automatically running it against your current configuration.
8+
9+
**Note**: The example shown in this lab relies on gpt-4o-mini. Feel free to use your local LLMs (e.g. llama3.1) if you choose to or can't use OpenAI LLMs.
10+
11+
Estimated Time: 15 minutes
12+
13+
### Objectives
14+
15+
In this lab, you will:
16+
17+
* Explore the *Testbed* tab
18+
* Generate a Q&A Test dataset
19+
* Perform an evaluation on the Q&A Testset
20+
21+
### Prerequisites
22+
23+
* All previous labs successfully completed
24+
25+
## Task 1: Navigate to the Testbed tab
26+
27+
Access the *Testbed* from the left-hand menu:
28+
29+
![testbed](./images/testbed.png)
30+
31+
As a first step, you can either upload an existing Q&A test set—either from a local file or from a saved collection in the database—or generate a new one from a local PDF file.
32+
33+
## Task 2: Generate a Q&A Test dataset
34+
35+
The AI Optimizer allows you to generate as many questions and answers as you need, based on a single document from your knowledge base. To enable test dataset generation, simply select the corresponding radio button:
36+
37+
![generate](./images/generatenew.png)
38+
39+
1. Upload a document
40+
41+
Upload the same document that was used to create the vector store. You can easily download it from [this link](https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/ai-vector-search-users-guide.pdf).
42+
43+
2. Increase the number of questions to be generated to 10 or more
44+
45+
Keep in mind that the process can take a significant amount of time, especially if you are using a local LLM without sufficient hardware resources. If you choose to use a remote OpenAI model instead, the generation time will be less affected by the number of Q&A pairs to create.
46+
47+
3. Leave the default option for:
48+
* Q&A Language Model: **gpt-4o-mini**
49+
* Q&A Embedding Model: **text-embedding-3-small**
50+
51+
4. Click on **Generate Q&A** button and wait until the process is over:
52+
53+
![patience](./images/patience.png)
54+
55+
5. Browse the questions and answers generated:
56+
57+
![qa-browse](./images/qa-browse.png)
58+
59+
Note that the **Question** and **Answer** fields are editable, allowing you to modify the proposed Q&A pairs based on the **Context** (which is randomly extracted and not editable) and the **Metadata** generated by the Testbed engine.
60+
61+
In the *Metadata* field you'll find a **topic** tag that classifies each Q&A pair. The topic list is generated automatically by analyzing the document content and is assigned to each Q&A pair. It will be used in the final report to break down the **Overall Correctness Score** and highlight areas where the chatbot lacks precision.
62+
63+
You can also export the generated Q&A dataset using the **Download** button. This allows you to edit and review it—e.g., in Visual Studio Code.
64+
65+
![qa-json](./images/qa-json.png)
66+
67+
6. Update the **Test Set Name**
68+
69+
Replace the automatically generated default name to make it easier to identify the test dataset later, especially when running repeated tests with different chatbot configurations. For example, change it from:
70+
71+
![default-test-set](./images/default-test-set.png)
72+
73+
to something more descriptive, like:
74+
75+
![test-rename](./images/test-rename.png)
76+
77+
## Task 3: Evaluate the Q&A Testset
78+
79+
Now you are ready to perform an evaluation on the Q&As you generated in the previous step.
80+
81+
1. In the left-hand menu:
82+
83+
* Under **Language Model Parameters**, select **gpt-4o-mini** from the **Chat model** dropdown list.
84+
85+
* Ensure **Enable RAG?** is selected (if it wasn't already)
86+
87+
* In the **Select Alias** dropdown list, choose the **TEST2** value.
88+
89+
* Leave all other parameters unchanged
90+
91+
2. With **gpt-4o-mini** selected as the evaluation model, click the **Start Evaluation** button and wait a few seconds. All questions from your dataset will be submitted to the chatbot using the configuration defined in the left pane:
92+
93+
![start-eval](./images/start-eval.png)
94+
95+
3. Let's examine the result report, starting with the first section:
96+
97+
![result](./images/result-topic.png)
98+
99+
This section displays:
100+
101+
* The chatbot's **Evaluation Settings**, as configured in the left-hand pane before launching the massive test.
102+
103+
* The **RAG Settings** including the database and vector store used, the name of the embedding **model** used, and all associated parameters (e.g., **chunk size**, **top-k**).
104+
105+
* The **Overall Correctness Score**, representing the percentage of questions for which the LLM judged the chatbot's response as correct compared to the reference answer
106+
107+
* The **Correctness By Topic**, which breaks down the results based on the automatically generated topics assigned to each Q&A pair in the dataset.
108+
109+
The second section of the report contains details on each question submitted, with a focus on the **Failures** collection and the **Full Report** list. To view all fields, scroll horizontally. In the image below, the second frame has been scrolled to the right:
110+
111+
![result](./images/result-question.png)
112+
113+
The main fields displayed are:
114+
115+
* **question**: the submitted question
116+
* **reference_answer**: the expected answer used as a benchmark
117+
* **reference_context**: the source document section used to generate the Q&A pair
118+
* **agent_answer**: the response provided by the chatbot based on the current configuration and vector store
119+
* **correctness_reason**: an explanation (if any) of why the response was considered incorrect. If correct, this field will display **None**.
120+
121+
You can download the results in different formats:
122+
123+
* Click the **Download Report** button to generate an HTML summary of the *Overall Correctness Score* and *Correctness by Topic*
124+
125+
* To export the **Full Report** and the **Failures** list, download them as .csv files using the download icons shown in the interface:
126+
127+
![csv](./images/download-csv.png)
128+
129+
## Task 4 (optional): Try a different Q&A Testset
130+
131+
Now let's perform a test using an external saved test dataset, which you can download [here](https://raw.githubusercontent.com/markxnelson/developer/refs/heads/main/ai-optimizer/getting_started-30_testset.json). This file contains 30 pre-generated questions.
132+
133+
If you wish to remove any Q&A pairs that you consider irrelevant or unhelpful, you can edit the file, save it, and then reload it as a local file—following the steps shown in the screenshot below:
134+
135+
![load-tests](./images/load-tests.png)
136+
137+
Next, let’s update the Chat Model parameters by setting the **Temperature** to **0** in the left-hand pane under the **Language Model Parameters** section.
138+
Why? Q&A datasets are typically generated with a low level of creativity to minimize randomness and focus on expressing core concepts clearly—avoiding unnecessary "frills" in the answers.
139+
Now, repeat the test to see whether there are improvements in the **Overall Correctness Score**.
140+
141+
* To compare with previous results, open the dropdown under **Previous Evaluations for...** and click on the **View** button to display the associated report.
142+
143+
![previous](./images/previous.png)
144+
145+
* You can repeat the tests as many times as needed, changing the **Vector Store**, **Search Type**, and **Top K** parameters to apply the same tuning strategies you've used previously with individual questions—now extended to a full test using curated and reproducible data.
146+
147+
## Acknowledgements
148+
149+
* **Author** - Lorenzo De Marchis, Developer Evangelist, May 2025
150+
* **Contributors** - Mark Nelson, John Lathouwers, Corrado De Bari, Jorge Ortiz Fuentes, Andy Tael
151+
* **Last Updated By** - Lorenzo De Marchis, May 2025
11 KB
Loading
36.2 KB
Loading
126 KB
Loading
44.1 KB
Loading
86.7 KB
Loading
10.4 KB
Loading
99.7 KB
Loading
84.3 KB
Loading
152 KB
Loading

0 commit comments

Comments
 (0)