You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+23-23Lines changed: 23 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,16 +3,16 @@
3
3
LLM Comparator is an interactive visualization tool for analyzing side-by-side
4
4
LLM evaluation results. It is designed to help people qualitatively analyze how
5
5
responses from two models differ at example- and slice-levels. Users can
6
-
interactively discover insights like "Model A's responses are better than B's on
7
-
email rewriting tasks because Model A tends to generate bulleted lists more
8
-
often."
6
+
interactively discover insights like *"Model A's responses are better than B's
7
+
on email rewriting tasks because Model A tends to generate bulleted lists more
8
+
often."*
9
9
10
10

11
11
12
12
13
13
## Using LLM Comparator
14
14
15
-
You can open LLM Comparator at https://pair-code.github.io/llm-comparator/.
15
+
You can play with LLM Comparator at https://pair-code.github.io/llm-comparator/.
16
16
17
17
You can either select one of the example files we provide, or you can upload
18
18
your own JSON file (e.g.,
@@ -25,19 +25,19 @@ that follows our format which we describe below.
25
25
We provide an example file for comparing
26
26
the model responses between [Gemma](https://ai.google.dev/gemma) 1.1 and 1.0
27
27
for prompts obtained from the
28
-
[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations). You can click the link below to play with it:
A positive score represents that A's response is better than B's; a negative
67
67
score indicates B is better; and zero meaning a tie.
@@ -83,7 +83,7 @@ All the fields presented below are required.
83
83
"examples": [
84
84
{
85
85
"input_text": "This is a prompt.",
86
-
"tags": ["Coding"], # A list of keywords for categorizing prompts
86
+
"tags": ["Math"], # A list of keywords for categorizing prompts
87
87
"output_text_a": "Response to the prompt from the first model (A)",
88
88
"output_text_b": "Response to the prompt from the other model (B)",
89
89
"score": -1.25, # Score from the judge LLM
@@ -100,13 +100,13 @@ All the fields presented below are required.
100
100
101
101
### Additional Data
102
102
103
-
Users can optionally provide additional information to be analyzed in LLM
103
+
You can optionally provide additional information to be analyzed in LLM
104
104
Comparator.
105
105
106
106
#### Custom Fields
107
107
108
108
If you have additional information about each prompt, it can be displayed as
109
-
a column in the table and aggregated information is visualized as a chart
109
+
columns in the table and aggregated information is visualized as charts
110
110
on the right side of the interface. It supports various data types, such as:
111
111
112
112
-`number`: Numeric data, visualized as histograms (e.g., word count for prompt,
@@ -231,18 +231,18 @@ npm run serve
231
231
232
232
## Citing LLM Comparator
233
233
234
-
If you use LLM Comparator as part of your work, please cite our paper at
235
-
https://arxiv.org/abs/2402.10524.
234
+
If you use LLM Comparator as part of your work, please cite our research paper
235
+
at https://arxiv.org/abs/2402.10524.
236
236
237
237
```
238
238
@inproceedings{kahng2024comparator,
239
-
title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of
240
-
Large Language Models},
239
+
title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of Large Language Models},
241
240
author={Kahng, Minsuk and Tenney, Ian and Pushkarna, Mahima and Liu, Michael Xieyang and Wexler, James and Reif, Emily and Kallarackal, Krystal and Chang, Minsuk and Terry, Michael and Dixon, Lucas},
242
-
booktitle={Extended Abstracts of the CHI Conference on Human Factors in
243
-
Computing Systems},
241
+
booktitle={Extended Abstracts of the CHI Conference on Human Factors in Computing Systems},
0 commit comments