Skip to content

Commit e80be93

Browse files
committed
Update code, data, and documentation for launch
1 parent 817f8a0 commit e80be93

25 files changed

+314
-229
lines changed

README.md

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,16 @@
33
LLM Comparator is an interactive visualization tool for analyzing side-by-side
44
LLM evaluation results. It is designed to help people qualitatively analyze how
55
responses from two models differ at example- and slice-levels. Users can
6-
interactively discover insights like "Model A's responses are better than B's on
7-
email rewriting tasks because Model A tends to generate bulleted lists more
8-
often."
6+
interactively discover insights like *"Model A's responses are better than B's
7+
on email rewriting tasks because Model A tends to generate bulleted lists more
8+
often."*
99

1010
![Screenshot of LLM Comparator interface](documentation/images/llm_comparator_screenshot.png)
1111

1212

1313
## Using LLM Comparator
1414

15-
You can open LLM Comparator at https://pair-code.github.io/llm-comparator/.
15+
You can play with LLM Comparator at https://pair-code.github.io/llm-comparator/.
1616

1717
You can either select one of the example files we provide, or you can upload
1818
your own JSON file (e.g.,
@@ -25,19 +25,19 @@ that follows our format which we describe below.
2525
We provide an example file for comparing
2626
the model responses between [Gemma](https://ai.google.dev/gemma) 1.1 and 1.0
2727
for prompts obtained from the
28-
[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations). You can click the link below to play with it:
28+
[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations).
29+
You can click the link below to play with it:
2930
https://pair-code.github.io/llm-comparator/?results_path=https://pair-code.github.io/llm-comparator/data/example_arena.json
3031

3132
The tool helps you analyze *when* and *why* Gemma 1.1 is better or worse than
32-
1.0 and *how* responses from two models qualitatively differ.
33+
1.0 and *how* responses from two models differ.
3334

34-
- ***When***: The **Score Distribution** panel shows that the quality of
35-
responses from Model A (Gemma 1.1) is considered better than that from Model B
36-
(Gemma 1.0) (larger blue area than orange),
37-
according to the LLM-based evaluation method
35+
- ***When***: The **Score Distribution** and **Metrics by Prompt Category**
36+
panels show that the quality of responses from Model A (Gemma 1.1) is considered
37+
better than that from Model B (Gemma 1.0) (larger blue area than orange;
38+
>50% win rate), according to the LLM-based evaluation method
3839
([LLM-as-a-judge](https://arxiv.org/abs/2306.05685)).
39-
This holds true for most prompt categories
40-
(as in **Metrics by Prompt Category** panel).
40+
This holds true for most prompt categories (e.g., Humanities, Math).
4141
- ***Why***: The **Rationale Summary** panel dives into the reasons behind these
4242
score differences.
4343
In this case, the LLM judge focused mostly on the amount of details. It also
@@ -60,8 +60,8 @@ must follow the schema described below.
6060

6161
We assume that a user has a set of input prompts to test. For each prompt, they
6262
need to prepare the responses to the prompt from two LLMs (i.e., Model A, Model
63-
B), and a numerical score obtained from automatic side-by-side evaluation (also
64-
known as [LLM-as-a-judge](https://arxiv.org/abs/2306.05685) or
63+
B), and a numerical score obtained from side-by-side evaluation (e.g.,
64+
[LLM-as-a-judge](https://arxiv.org/abs/2306.05685),
6565
[AutoSxS](https://cloud.google.com/vertex-ai/generative-ai/docs/models/side-by-side-eval)).
6666
A positive score represents that A's response is better than B's; a negative
6767
score indicates B is better; and zero meaning a tie.
@@ -83,7 +83,7 @@ All the fields presented below are required.
8383
"examples": [
8484
{
8585
"input_text": "This is a prompt.",
86-
"tags": ["Coding"], # A list of keywords for categorizing prompts
86+
"tags": ["Math"], # A list of keywords for categorizing prompts
8787
"output_text_a": "Response to the prompt from the first model (A)",
8888
"output_text_b": "Response to the prompt from the other model (B)",
8989
"score": -1.25, # Score from the judge LLM
@@ -100,13 +100,13 @@ All the fields presented below are required.
100100

101101
### Additional Data
102102

103-
Users can optionally provide additional information to be analyzed in LLM
103+
You can optionally provide additional information to be analyzed in LLM
104104
Comparator.
105105

106106
#### Custom Fields
107107

108108
If you have additional information about each prompt, it can be displayed as
109-
a column in the table and aggregated information is visualized as a chart
109+
columns in the table and aggregated information is visualized as charts
110110
on the right side of the interface. It supports various data types, such as:
111111

112112
- `number`: Numeric data, visualized as histograms (e.g., word count for prompt,
@@ -231,18 +231,18 @@ npm run serve
231231

232232
## Citing LLM Comparator
233233

234-
If you use LLM Comparator as part of your work, please cite our paper at
235-
https://arxiv.org/abs/2402.10524.
234+
If you use LLM Comparator as part of your work, please cite our research paper
235+
at https://arxiv.org/abs/2402.10524.
236236

237237
```
238238
@inproceedings{kahng2024comparator,
239-
title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of
240-
Large Language Models},
239+
title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of Large Language Models},
241240
author={Kahng, Minsuk and Tenney, Ian and Pushkarna, Mahima and Liu, Michael Xieyang and Wexler, James and Reif, Emily and Kallarackal, Krystal and Chang, Minsuk and Terry, Michael and Dixon, Lucas},
242-
booktitle={Extended Abstracts of the CHI Conference on Human Factors in
243-
Computing Systems},
241+
booktitle={Extended Abstracts of the CHI Conference on Human Factors in Computing Systems},
244242
year={2024},
245243
publisher={ACM},
244+
doi={10.1145/3613905.3650755},
245+
url={https://arxiv.org/abs/2402.10524}
246246
}
247247
```
248248

client/app.ts

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@
1515
* limitations under the License.
1616
*/
1717

18-
// tslint:disable:g3-no-void-expression
1918
// tslint:disable:no-new-decorators
2019
import './components/charts';
2120
import './components/custom_functions';
@@ -89,14 +88,14 @@ export class LlmComparatorAppElement extends MobxLitElement {
8988
</div>
9089
<div class="link-icon">
9190
<a href=${feedbackLink} target="_blank">
92-
<mwc-icon class="icon" title="Open Form">
91+
<mwc-icon class="icon" title="Send Feedback">
9392
feedback
9493
</mwc-icon>
9594
</a>
9695
</div>
9796
<div class="link-icon">
9897
<a href=${documentationLink} target="_blank">
99-
<mwc-icon class="icon" title="Open project page">
98+
<mwc-icon class="icon" title="Open Documentation Page">
10099
help_outline
101100
</mwc-icon>
102101
</a>

client/components/bar_chart.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ export interface AggregatedEntry {
3737

3838
/**
3939
* Component for bar charts. Currently for rating scores by individual raters.
40-
* TODO(b/311744307): Extract common parts in the histogram.
40+
* TODO: Extract common parts in the histogram.
4141
*/
4242
@customElement('comparator-bar-chart')
4343
export class BarChartElement extends MobxLitElement {

client/components/charts.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -373,7 +373,7 @@ export class ChartsElement extends MobxLitElement {
373373
const renderChartsForCustomFields: Array<[string, any]> =
374374
this.appState
375375
.columns
376-
// TODO(b/315388387): Will not need when custom functions are
376+
// TODO: Will not need when custom functions are
377377
// merged.
378378
.filter((field: Field) => field.id.startsWith('custom_field:'))
379379
.filter(

client/components/custom_functions.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ export class CustomFunctionsElement extends MobxLitElement {
245245
</comparator-binary-stacked-bar-chart>`;
246246
}
247247

248-
// TODO(b/326139568): Merge into the side-by-side histogram code in charts.ts.
248+
// TODO: Merge into the side-by-side histogram code in charts.ts.
249249
private renderChartForNumberType(customFunc: CustomFunction) {
250250
const getHistogramSpec = () =>
251251
this.appState.histogramSpecForCustomFuncs[customFunc.id];
@@ -423,7 +423,7 @@ export class CustomFunctionsElement extends MobxLitElement {
423423
'disabled': customFunc.precomputed === true,
424424
});
425425

426-
// TODO(b/323336525): Improve the design for displaying custom func rows.
426+
// TODO: Improve the design for displaying custom func rows.
427427
// prettier-ignore
428428
return html`
429429
<tr class=${customFuncRowStyle(customFunc.id)}>

client/components/dataset_selection.css

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,8 @@
3838
}
3939

4040
.panel-instruction {
41-
color: #555;
42-
line-height: 16px;
41+
color: var(--comparator-grey-800);
42+
line-height: 18px;
4343
margin: 5px 0;
4444
padding: 2px 0;
4545
}

client/components/dataset_selection.ts

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ import {AppState} from '../services/state_service';
2929
import {styles} from './dataset_selection.css';
3030

3131
/**
32-
* Dataset Selection component.
32+
* Component for selecting data files.
3333
*/
3434
@customElement('comparator-dataset-selection')
3535
export class DatasetSelectionElement extends MobxLitElement {
@@ -53,11 +53,17 @@ export class DatasetSelectionElement extends MobxLitElement {
5353

5454
return html`
5555
<div>
56-
The json file must contain these three properties: "metadata", "models",
57-
and "examples".
56+
The json file must contain these three properties:
57+
<span class="filepath">metadata</span>,
58+
<span class="filepath">models</span>,
59+
and <span class="filepath">examples</span>.
5860
<br />
59-
Each example must have "input_text", "tags", "output_text_a",
60-
"output_text_b", and "score".
61+
Each example in <span class="filepath">examples</span> must have
62+
<span class="filepath">input_text</span>,
63+
<span class="filepath">tags</span>,
64+
<span class="filepath">output_text_a</span>,
65+
<span class="filepath">output_text_b</span>,
66+
and <span class="filepath">score</span>.
6167
<br />
6268
Please refer to our document for details:
6369
<a href="${documentationLink}" target="_blank">${documentationLink}</a>
@@ -94,7 +100,7 @@ export class DatasetSelectionElement extends MobxLitElement {
94100

95101
const textareaPlaceholder = 'Enter a URL to load the json file from.';
96102
const urlLoadPath =
97-
this.appState.appLink + '?results_path=https://.../results.json';
103+
this.appState.appLink + '?results_path=https://.../...json';
98104
const panelIntro = html`
99105
Enter the URL path of a json file prepared for LLM Comparator.`;
100106
const panelOutro = html`

client/components/example_details.ts

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,7 @@ export class ExampleDetailsElement extends MobxLitElement {
153153
</comparator-histogram>`;
154154
}
155155

156-
// TODO(b/311725252): Create a separate data-table component.
156+
// TODO: Create a separate data-table component.
157157
private renderRaterTable() {
158158
const selectedExample = this.selectedExample;
159159
if (selectedExample == null) {
@@ -237,18 +237,17 @@ export class ExampleDetailsElement extends MobxLitElement {
237237
<th class="score" rowspan="2">Score ${renderSortIcons()}</th>
238238
<th class="label" rowspan="2">Rating</th>
239239
<th class="flipped" rowspan="2">Flipped?</th>
240-
<th class="rationale" rowspan="2">
241-
Rationale
242-
<small>(Careful for flipped cases!)</small>
243-
</th>
244-
${this.appState.customFieldsOfPerRatingType.map((field: Field) =>
245-
renderCustomFieldHeaderCell(field),
246-
)}
240+
<th class="rationale" rowspan="2">Rationale</th>
241+
${
242+
this.appState.customFieldsOfPerRatingType.map(
243+
(field: Field) => renderCustomFieldHeaderCell(field),
244+
)}
247245
</tr>
248246
<tr class="second-row">
249-
${this.appState.customFieldsOfPerRatingType.map((field: Field) =>
250-
renderCustomFieldHeaderCellSecondRow(field),
251-
)}
247+
${
248+
this.appState.customFieldsOfPerRatingType.map(
249+
(field: Field) => renderCustomFieldHeaderCellSecondRow(field),
250+
)}
252251
</tr>`;
253252

254253
// Table body.

client/components/example_table.css

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,11 @@ td.score.b-win {
217217
text-decoration: underline;
218218
}
219219

220+
.selected .rater-info-link {
221+
color: var(--comparator-grey-800);
222+
font-weight: 600;
223+
}
224+
220225
td.score:hover .rater-info-link {
221226
color: var(--comparator-grey-800);
222227
}
@@ -257,7 +262,8 @@ ul.rationale-list li.cluster-selected::before {
257262

258263
.text-holder,
259264
.list-holder,
260-
.sequence-chunks-holder {
265+
.sequence-chunks-holder,
266+
.score-holder {
261267
height: 119px; /* Set default as 17px x 7 rows */
262268
overflow-x: hidden;
263269
overflow-y: scroll;
@@ -273,6 +279,11 @@ ul.rationale-list li.cluster-selected::before {
273279
overflow-wrap: anywhere;
274280
}
275281

282+
.score-holder {
283+
overflow-y: hidden;
284+
padding-top: 0;
285+
}
286+
276287
tr.monospace .text-holder {
277288
font-family: monospace;
278289
}

client/components/example_table.ts

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -91,12 +91,16 @@ export class ExampleTableElement extends MobxLitElement {
9191

9292
private styleHolder(example: Example) {
9393
return styleMap({
94-
'height':
95-
this.appState.selectedExample !== example
96-
? `${
97-
this.appState.numberOfLinesPerOutputCell * LINE_HEIGHT_IN_CELL
98-
}px`
99-
: 'auto',
94+
'height': this.appState.getIsExampleExpanded(example.index) !== true ?
95+
`${
96+
this.appState.numberOfLinesPerOutputCell *
97+
LINE_HEIGHT_IN_CELL}px` :
98+
'auto',
99+
'min-height': this.appState.getIsExampleExpanded(example.index) === true ?
100+
`${
101+
this.appState.numberOfLinesPerOutputCell *
102+
LINE_HEIGHT_IN_CELL}px` :
103+
null,
100104
});
101105
}
102106

@@ -233,14 +237,17 @@ export class ExampleTableElement extends MobxLitElement {
233237

234238
private renderRow(example: Example, rowIndex: number) {
235239
const handleDoubleClickRow = () => {
236-
this.appState.selectedExample =
237-
this.appState.selectedExample === example ? null : example;
240+
this.appState.isExampleExpanded[example.index] =
241+
this.appState.getIsExampleExpanded(example.index) === true ? false :
242+
true;
238243
};
239244
const styleRow = classMap({
240245
'selected': this.appState.selectedExample === example,
241246
'monospace': this.appState.useMonospace === true,
242247
});
243248

249+
const styleHolder = this.styleHolder(example);
250+
244251
// Use text diff only when both are texts.
245252
const textDiff =
246253
typeof example.output_text_a === 'string' &&
@@ -376,10 +383,12 @@ export class ExampleTableElement extends MobxLitElement {
376383
</div>
377384
${renderHistogram}` :
378385
'';
379-
const renderScore = example.score == null ? 'null' : html`
386+
const renderScore = example.score == null ? 'Null' : html`
387+
<div class="score-holder" style=${styleHolder}>
380388
<div class="score-number">${example.score.toFixed(2)}</div>
381389
${scoreDescription}
382-
${raterInfoLink}`;
390+
${raterInfoLink}
391+
</div>`;
383392

384393
const styleScore = classMap({
385394
'score': true,
@@ -467,8 +476,6 @@ export class ExampleTableElement extends MobxLitElement {
467476
) :
468477
html``;
469478

470-
const styleHolder = this.styleHolder(example);
471-
472479
// Custom fields.
473480
const renderCustomField = (field: Field, columnIndex: number) => {
474481
if (field.type === FieldType.PER_RATING_PER_MODEL_CATEGORY) {

0 commit comments

Comments
 (0)