forked from henrythe9th/AI-Crash-Course
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathoriginal_content.txt
More file actions
1872 lines (1868 loc) · 86.1 KB
/
original_content.txt
File metadata and controls
1872 lines (1868 loc) · 86.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang* 1Lianmin Zheng* 1Ying Sheng2Anastasios N. Angelopoulos1Tianle Li1Dacheng Li1
Banghua Zhu1Hao Zhang3Michael I. Jordan1Joseph E. Gonzalez1Ion Stoica1
Abstract
Large Language Models (LLMs) have unlocked
new capabilities and applications; however, evalu-
ating the alignment with human preferences still
poses significant challenges. To address this is-
sue, we introduce Chatbot Arena, an open plat-
form for evaluating LLMs based on human pref-
erences. Our methodology employs a pairwise
comparison approach and leverages input from
a diverse user base through crowdsourcing. The
platform has been operational for several months,
amassing over 240K votes. This paper describes
the platform, analyzes the data we have collected
so far, and explains the tried-and-true statistical
methods we are using for efficient and accurate
evaluation and ranking of models. We confirm
that the crowdsourced questions are sufficiently
diverse and discriminating and that the crowd-
sourced human votes are in good agreement with
those of expert raters. These analyses collectively
establish a robust foundation for the credibility
of Chatbot Arena. Because of its unique value
and openness, Chatbot Arena has emerged as
one of the most referenced LLM leaderboards,
widely cited by leading LLM developers and
companies. Our demo is publicly available at
https://chat.lmsys.org .
1. Introduction
Recent advancements in large language models (LLMs)
have significantly expanded their capabilities beyond tradi-
tional natural language processing boundaries, addressing a
broad array of general tasks (OpenAI, 2023; Gemini et al.,
2023; Touvron et al., 2023). These developments underscore
the potential of LLMs but also have raised concerns with re-
spect to performance evaluation. Current benchmarks often
fail to capture the nuanced and diverse aspects of these mod-
els, particularly in assessing their alignment with human
*Equal contribution1UC Berkeley2Stanford3UCSD. Corre-
spondence to: Wei-Lin Chiang <weichiang@berkeley.edu>.
LiveStaticCodeforces Weekly ContestsMMLU, HellaSwag, GSM-8KGround TruthChatbot ArenaMT-Bench, AlpacaEvalHuman PreferenceQuestion SourceEvaluationMetricFigure 1. Classification of LLM benchmarks: We categorize
along two dimensions: whether the questions are from a static
dataset or a live, fresh source, and whether the evaluation met-
ric relies on ground truth or (approximated) human preferences.
MMLU (Hendrycks et al., 2020), HellaSwag (Zellers et al., 2019),
GSM-8K (Cobbe et al., 2021), MT-Bench (Zheng et al., 2023b),
and AlpacaEval (Li et al., 2023) are common examples of static
benchmarks. Chatbot Arena is the platform introduced in this
paper.
preferences in real-world, open-ended tasks.
To assess the performance of LLMs, the research community
has introduced a variety of benchmarks. These benchmarks
can be categorized based on two factors: the source of ques-
tions (either static or live) and the evaluation metric (either
ground truth or human preference). According to these fac-
tors, benchmarks can be classified into four categories, as
shown in Figure 1. While a range of benchmarks is benefi-
cial, the most prevalent current method for evaluating LLMs
remains a static, ground-truth-based evaluation, partly be-
cause such evaluations are inexpensive and reproducible.
However, these static, ground-truth-based benchmarks ex-
hibit several limitations. Firstly, the questions within these
benchmarks are not open-ended, hindering the ability to
capture the flexible and interactive use found in real-world
settings (Zheng et al., 2023b). Secondly, the test sets in
these benchmarks are static, meaning they can become con-
taminated over time, which undermines the reliability of
the evaluation results (Yang et al., 2023). Furthermore, for
many complex tasks, establishing a definitive ground truth
is not only challenging but sometimes unattainable. Conse-
quently, current benchmarks fail to adequately address the
needs of state-of-the-art LLMs, particularly in evaluating
user preferences. Thus, there is an urgent necessity for an
open, live evaluation platform based on human preference
that can more accurately mirror real-world usage.
Creating such a benchmark platform entails significant chal-
lenges. It requires the collection of live, fresh, and diverse
user questions to accurately represent real-world scenarios.
1arXiv:2403.04132v1 [cs.AI] 7 Mar 2024
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Additionally, developing scalable, incremental, and efficient
ranking systems is essential for evaluating a large number
of models. Moreover, ensuring the quality of human evalua-
tions is crucial given the noisy nature of human preferences.
To this end, we introduce Chatbot Arena, a benchmarking
platform for LLMs that features anonymous, randomized
battles in a crowdsourced setting. Chatbot Arena is a free
website open to all users.1On this website, a user can ask a
question and get answers from two anonymous LLMs. Af-
terward, the user casts a vote for the model that delivers the
preferred response, with the models’ identities revealed only
after voting. This crowdsourced method effectively gathers
a diverse array of fresh user prompts, accurately reflecting
real-world LLM applications. Armed with this data, we
employ a suite of powerful statistical techniques, ranging
from the statistical model of Bradley & Terry (1952) to the
E-values of V ovk & Wang (2021), to estimate the ranking
over models as reliably and sample-efficiently as possible.
With these tools in hand, we have designed efficient sam-
pling algorithms specifically to select model pairs in a way
that accelerates the convergence of rankings while retaining
statistical validity.
We conduct a thorough analysis of the collected data to en-
sure the credibility of our platform. We demonstrate that
the user-generated questions are sufficiently diverse to en-
compass a wide range of LLM use cases and are sufficiently
challenging to differentiate between models. Furthermore,
we confirm that the crowd-sourced votes are highly consis-
tent with expert evaluations.
We have been running our system since Apr 2023 and have
received over 240K votes from about 90K users in over
100 different languages as of Jan 2024. To encourage user
engagement, we have made over 50 state-of-the-art models
available for free. We also collaborate with leading model
developers such as OpenAI, Google, Anthropic, Mistral,
Hugging Face, and various universities, incorporating their
latest models into our platform. We keep the community
engaged by routinely updating the leaderboard, publishing
analytical blogs, releasing datasets, and sharing information
via tweets. Because of its unique and significant value, our
leaderboard has emerged as one of the most referenced in
the LLM field and has become a benchmark for the industry.
We commit to making our data and code available, ensuring
that this platform is open-source and open-accessible.
We make the following contributions:
•We build the first large-scale crowd-sourced live LLM
evaluation platform with over 1M users visit.2
1https://chat.lmsys.org
2The number was estimated by Google Analytics as of March
2024. Note that user visit may not convert to votes as our website
also offers “direct chat” mode.•We conduct an in-depth analysis of the collected data,
including prompt diversity, quality, vote quality, and
insights on human feedback.
•We will publicly release a human preference dataset with
over 100K pairwise votes collected from Chatbot Arena.
•We design an efficient sampling algorithm that actively
chooses which model pairs to show, such that our sample
efficiency improves, sometimes to a large degree.
2. Related Work
LLM Benchmarks. We briefly review the common LLM
benchmarks, following the classification presented in Fig-
ure 1. The most prevalent benchmarks are static, ground-
truth-based ones, typically in the form of multiple-choice
questions or question-answering tasks with predefined an-
swers and test cases. These benchmarks encompass a range
of topics including language understanding, mathematics,
coding, and logical reasoning. Prominent examples in
this category are MMLU (Hendrycks et al., 2020), Hel-
laSwag (Zellers et al., 2019), GSM-8K (Cobbe et al., 2021),
BigBench (Srivastava et al., 2023), AGIEval (Zhong et al.,
2023), and HumanEval (Chen et al., 2021). Benchmarks
focusing on safety, such as ToxicChat (Lin et al., 2023),
and comprehensive suites like HELM (Liang et al., 2022),
also exist. In addition to closed-ended questions, bench-
marks can include open-ended questions that are evaluated
by human judgment, which can be rated by experts or crowd
workers such as Amazon Mechanical Turk (Karpinska et al.,
2021; Geng et al., 2023; Wang et al., 2023). The recent trend
includes utilizing GPT-4 for approximating human judg-
ment (Chiang & Lee, 2023), with notable instances being
MT-Bench (Zheng et al., 2023b) and AlpacaEval (Li et al.,
2023). In addition to static benchmarks, live benchmarks
that include fresh questions are also available. These ques-
tions can be obtained from annual exams or weekly online
contests such as Codeforces (Li et al., 2022; Huang et al.,
2023). They can also be sourced from human interaction.
Some studies have explored using live human interaction for
reinforcement learning from human preference (Bai et al.,
2022; Ouyang et al., 2022; Touvron et al., 2023). However,
these studies are typically limited to specific organizations.
In this paper, we introduce Chatbot Arena, the first open,
large-scale, and crowdsourced benchmark platform that uti-
lizes live human interaction.
Risks of Static Benchmarks. Static benchmarks have cer-
tain issues, including contamination, saturation, overfitting,
and a lack of human alignment (Yang et al., 2023; Oren
et al., 2023). DynaBench (Kiela et al., 2021) identifies these
challenges and recommends the use of a live benchmark
that incorporates a human-in-the-loop approach for classical
NLP benchmarks. Our system adopts a similar spirit. How-
ever, our focus is on chatting with LLMs, and we implement
2
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
this on a significantly larger user scale.
Ranking System. Ranking systems have been a well-
studied topic in statistics. Related topics include probability
models (Hunter, 2004; Rao & Kupper, 1967), rank elicita-
tion (Szörényi et al., 2015; Busa-Fekete et al., 2014a;b),
and online experiment design (Chernoff, 1992; Karimi
et al., 2021). The Elo rating system has also been used for
LLMs (Bai et al., 2022; Boubdir et al., 2023). Contributing
to this literature, we introduce techniques for accelerating
ranking convergence and detecting abnormalities, specifi-
cally applied to large-scale, real-world settings of LLMs.
Human Preference Dataset. Owing to the significance
of human preferences, several datasets and analyses exist
that incorporate human preferences. These include Ope-
nAssistant (Köpf et al., 2023), HH-RLHF (Bai et al., 2022),
LMSYS-Chat-1M (Zheng et al., 2023a), and synthetic ap-
proximations of human preferences like UltraFeedback (Cui
et al., 2023) and Nectar (Zhu et al., 2023). Our prior data
release, LMSYS-Chat-1M (Zheng et al., 2023a), is similarly
collected via crowdsourcing. However, LMSYS-Chat-1M
comprises solely conversations and lacks human preference
data, rendering it unsuitable for direct use in ranking studies.
This paper focuses on the analysis of preference data for
ranking purposes.
3. Human Preference Data Collection
In this section, we discuss our interface design to collect
human preferences and present summary statistics.
3.1. Interface
Chatbot Arena crowd-sources feedback from users for
model evaluation. Our goal is to design an ease-of-use in-
terface to reduce friction for users to contribute data. Since
we collect feedback from many users, it is difficult to set
a consistent grading rubric across different people. Hence,
we adopt a pairwise comparison mechanism where users
only need to compare two model responses and vote for the
better one, instead of requiring users to provide an absolute
score.
In each battle, two anonymous models are sampled. To
encourage data diversity, we do not preset any input prompt
on the website. Users are free to input any prompt to the
two models. We believe this creates incentives for user en-
gagement, particularly given that we offer a free service.
It also helps us collect a diverse set of inputs representing
real-world usage. After models provide their answers, user
compare them side-by-side and vote for the preferred an-
swer. If a user cannot choose in the first turn, the user can
continue chatting until identifying a winner. For those who
are unsure, we also present two buttons, “tie” or “both are
bad.” Figure 8 shows a screenshot of our interface. Beforeusing our service, users are required to accept terms of use,
which gives us their consent to release the data publicly.
3.2. Data Statistics
We began collecting data in April 2023. As of Jan 2024,
we have received around 240K votes from over 90K users.
Our data involves more than 50 models, including both pro-
prietary models like GPT-4, Claude, and Gemini, as well
as open models such as LLaMA and Mistral. These con-
versations cover more than 100 languages, with 77% being
in English, 5% in Chinese, and the remaining languages,
such as Russian, German, Spanish, French, and Japanese,
each representing less than 2% of the total. Each data point
includes multi-turn conversations between the user and two
LLMs, and a vote to indicate which model the user prefers.
We summarize statistics in Table 1 along with other existing
human preference datasets.
Figure 10 in the Appendix shows the vote count per model.
On average, 8K votes are collected for each model. In Fig-
ure 2, we select a set of representative models and present
their win rate and the number of battles. Note that we em-
ploy non-uniform sampling to concentrate votes on model
pairs that have similar performance due to higher uncer-
tainty. This helps us reduce the number of votes required to
reach stable results. We later develop an adaptive sampling
method and demonstrate its effectiveness against random
sampling. See Section 5 for further analysis.
To ensure anonymity, we use keywords to filter out con-
versations containing model identity such as model name
(e.g., GPT, Claude) or companies (e.g., OpenAI, Anthropic).
To avoid misuse, we adopt OpenAI moderation API to flag
conversations that contain unsafe content. The flagged user
requests account for 3% of the total requests. Figure 9 in
the Appendix shows the number of valid user votes over
time, where we get 1-2K votes per day in recent months and
spikes as we introduce new models or leaderboard updates.
4. From Pairwise Comparisons to Rankings
Our data consists of pairwise comparisons—but how can we
use these comparisons to recover a ranking over all Mmod-
els? This is a well-studied topic in the literature on learning
to rank (Liu et al., 2009), and we present our perspective
here. We let A={(m, m′) :m < m′andm, m′∈[M]}
denote our comparative data set.
We consider a sequential setting, where at time t∈N, we
serve the human a pair of models At∈ A (which we pick),
and in turn we observe the human’s response Ht∈[0,1].
As an example, we might have that At= (1,2)andHt= 1,
indicating that the human prefers model 2 over model 1. In
the ensuing text, we will primarily focus on the binary case—
where Ht∈ {0,1}—but our approach will generalize to
3
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Table 1. Statistics of human preference datasets, including Anthropic HH (Bai et al., 2022), OpenAssistant Conversations (Köpf et al.,
2023), and Chatbot Arena (as of 2024/1/21). The tokens are counted by Llama2’s tokenizer. “Conv” = Conversation. “Lang” = Language.
Dataset # Convs # Models # Users # LangsAvg. # Turns Avg. # Tokens Avg. # Tokens
per Sample per Prompt per Response
Anthropic HH 338,704 - 143 1 2.3 18.9 78.9
OpenAssistant 66,497 - 13,500 35 - 36.9 214.2
Chatbot Arena (20240121) 243,329 50 90,051 149 1.3 94.9 269.0
0.00 0.68 0.69 0.75 0.71 0.76 0.77 0.75 0.76 0.79 0.86 0.90
0.32 0.00 0.50 0.59 0.61 0.59 0.59 0.59 0.62 0.72 0.70 0.80
0.31 0.50 0.00 0.54 0.56 0.50 0.57 0.52 0.69 0.73 0.60 0.87
0.25 0.41 0.46 0.00 0.54 0.48 0.51 0.56 0.58 0.53 0.73 0.84
0.29 0.39 0.44 0.46 0.00 0.42 0.54 0.55 0.58 0.63 0.67 0.76
0.24 0.41 0.50 0.52 0.58 0.00 0.49 0.55 0.58 0.61 0.64 0.73
0.23 0.41 0.43 0.49 0.46 0.51 0.00 0.56 0.58 0.63 0.72 0.71
0.25 0.41 0.48 0.44 0.45 0.45 0.44 0.00 0.54 0.68 0.65 0.62
0.24 0.38 0.31 0.42 0.42 0.42 0.42 0.46 0.00 0.61 0.58 0.61
0.21 0.28 0.27 0.47 0.37 0.39 0.37 0.32 0.39 0.00 0.53 0.57
0.14 0.30 0.40 0.27 0.33 0.36 0.28 0.35 0.42 0.47 0.00 0.52
0.10 0.20 0.13 0.16 0.24 0.27 0.29 0.38 0.39 0.43 0.48 0.00
gpt-4-turbogpt-4-0613mistral-mediummixtral-8x7b-instruct-v0.1gemini-pro-dev-apiclaude-2.1gpt-3.5-turbo-0613claude-instant-1llama-2-70b-chatllama-2-13b-chatllama-2-7b-chatmistral-7b-instruct
mistral-7b-instructllama-2-7b-chatllama-2-13b-chatllama-2-70b-chatclaude-instant-1gpt-3.5-turbo-0613claude-2.1gemini-pro-dev-apimixtral-8x7b-instruct-v0.1mistral-mediumgpt-4-0613gpt-4-turbo
0.10.20.30.40.50.60.70.80.9Model BModel A
02564 1189 1192 858 3053 1991 270 141 157 106 144
2564 0 566 263 756 2227 1025 355 414 409 264 197
1189 566 0 775 371 382 773 103 45 51 40 52
1192 263 775 0 71 744 869 136 862 66 45 61
858 756 371 71 0 74 564 53 31 30 30 37
3053 2227 382 744 74 0 351 650 113 117 75 114
1991 1025 773 869 564 351 0 842 572 388 283 155
270 355 103 136 53 650 842 0 459 241 202 101
141 414 45 862 31 113 572 459 0 383 134 369
157 409 51 66 30 117 388 241 383 0 251 621
106 264 40 45 30 75 283 202 134 251 0 521
144 197 52 61 37 114 155 101 369 621 521 0
gpt-4-turbogpt-4-0613mistral-mediummixtral-8x7b-instruct-v0.1gemini-pro-dev-apiclaude-2.1gpt-3.5-turbo-0613claude-instant-1llama-2-70b-chatllama-2-13b-chatllama-2-7b-chatmistral-7b-instruct
mistral-7b-instructllama-2-7b-chatllama-2-13b-chatllama-2-70b-chatclaude-instant-1gpt-3.5-turbo-0613claude-2.1gemini-pro-dev-apimixtral-8x7b-instruct-v0.1mistral-mediumgpt-4-0613gpt-4-turbo
050010001500200025003000Model BModel A
Figure 2. Win-rate (left) and battle count (right) between a subset of models in Chatbot Arena.
any form of feedback, including the possibility of allowing
the human to express different degrees of preference or to
say the models are tied.
One critical goal is to estimate the win matrix :θ∗(a) =
E[Ht|At=a], for all a∈ A; see the left panel of Figure 2
for an illustration of the (empirical) win matrix. In the
binary case, the aentry in the win matrix corresponds to
the probability the human prefers model a2toa1when
shown the pair a. Finding the win matrix is a relatively
straightforward mean-estimation problem; we will provide
details in Section 5.
Formally, consider a score s(P)∈RM, where Pis a joint
distribution over AandH(by default, we will target a uni-
form distribution over A). Each model has a true score
s(P)m, and better models will have higher scores. In partic-
ular, we have the rank of model m:
rank(P)m= 1 +X
m′∈[M]1{s(P)m′> s(P)m}.(1)
The best model has rank 1. If there is another model tied for
best, they will both get assigned rank 1.
Picking a score. A standard score function in this setting
is the vector of Bradley-Terry (BT) coefficients (Bradley &
Terry, 1952). In the Bradley-Terry model, Ht∈ {0,1}, and
the probability model mbeats model m′is modeled via alogistic relationship:
P(Ht= 1) =1
1 +eξm′−ξm, (2)
where ξis anM-length vector of so-called BT coefficients.
Without loss of generality, we take ξ1= 0(since the model
is invariant to addition in ξ). Our goal is to estimate the pop-
ulation Bradley-Terry coefficients, i.e., those that minimize
the binary cross-entropy:
s(P) = argmin
ξE(A,H)∼P
ℓ
H,1
1 +eξA2−ξA1
,(3)
where ℓis the binary cross-entropy loss, ℓ(h, p) =
−(hlog(p) + (1 −h) log(1 −p)).
Although the BT model technically assumes a parametric
form for the model win rates, the seminal results of Huber
et al. (1967); White (1982) show that maximum likelihood
estimators are still asymptotically normal even when these
assumptions do not hold, so long as the so-called “sandwich”
covariance matrix is used; see Section 5 for details, and see
Appendix B for a nonparametric extension of the Bradley-
Terry model. Finally, we remark that previous evolutions of
our online interface have reported different ranking scores,
such as the Elo score (Elo, 1967) instead of the BT coeffi-
cients. We made this change because the BT coefficients
are better for the purpose of statistical estimation.
4
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
5. Efficient Approximate Ranking
In Section 4 we described how to calculate the win matrix,
score, and rank. Now we describe our estimation proce-
dures.
Win matrix estimation. Estimation of the win ma-
trix is relatively straightforward. Define Xt(a) =
1
Pt(a)Ht1{At=a}, where Pt(a)is the probability of sam-
pling pair aat time t, andXtas the according vector. Then
the estimator is
ˆθT=1
TTX
t=1Xt. (4)
Note that E[Xt(a)] = θ∗(a)for all t, and thus ˆθTis an
unbiased estimator of θ∗. We will furthermore estimate the
covariance matrix as
bΣT=1
TTX
t=1(Xt−ˆθT)(Xt−ˆθT)⊤. (5)
Under the appropriate regularity conditions, we have that
√
TbΣ−1/2(ˆθ−θ∗)→ N (0, Id), (6)
and we construct confidence intervals accordingly. For
an understanding of the appropriate regularity conditions,
see Durrett (2019), Theorem 8.2.8, where condition (ii) is
trivially satisfied so long as Pt(a)> ϵ > 0, and condition
(i) is implied by the almost-sure convergence of Pt(a)to a
limiting distribution P(a).
Estimating the BT scores. To estimate the BT coefficients,
mirroring (3), we perform (reweighted) maximum likeli-
hood estimation on our data points:
s(ˆP) = argmin
ξTX
t=11
P(At)ℓ
Ht,1
1 +eξAt,2−ξAt,1
,
(7)
where At∼P. We perform the inverse weighting by P(At)
because this allows us to target a score with a uniform
distribution over A.
To compute confidence intervals on the BT coefficients, we
employ two strategies: (1) the pivot bootstrap (DiCiccio &
Efron, 1996), and (2) the “sandwich” robust standard errors
outlined in Huber et al. (1967) (see also Freedman (2006)
for an outline of the necessary technical assumptions). Ulti-
mately, based on the results of a simulation study described
in Appendix A, we choose to deploy the sandwich intervals
due to their smaller size in large samples.
Approximate rankings. Finally, we report an approximate
ranking for each model that accounts for the uncertainty
in the estimation of the score. Given an M-dimensional
confidence set Csatisfying
P(s(P)∈ C)≥1−α, (8)we extract an approximate ranking Rm= 1 +P
m′∈[M]1{infCm′>supCm}. The uniform validity of
Cdirectly implies that P(∃m:Rm>rank(P)m)≤α—
i.e., with high probability, no model’s performance is un-
derstated. A guarantee on the other side—that no model’s
performance is overstated—is possible by interchanging the
infandsup. To get the uniform confidence set, we construct
the chi-squared interval implied by the central limit theo-
rem using the sandwich estimate of the variance. In other
words, we construct the interval {ξ:TˆV−1/2(ˆξ−ξ)≤
χ2
1−α,M−1, where ˆξis our MLE of the BT coefficients and
ˆVξis the sandwich variance of the logistic regression.
Active sampling rule. Our sampling rule was to choose
the model pair a∈ A proportionally to the reduction in
confidence interval size by sampling that pair:
Pt(a)∝s
ˆΣt,a,a
|{t:At=a}|−s
ˆΣt,a,a
|{t:At=a}|+ 1.(9)
5.1. Detecting Anomalous Users
On a different note, we take a first step towards identify-
ing anomalous IP addresses in our dataset. In a dataset
ofUunique IPs, we let IP={1, . . . , U }be the set
of all IP addresses. Consider a “test” user, outside this
database, who gives ratings H′
1, . . . , H′
nwhen presented
actions A′
1, . . . , A′
n. The idea of our procedure is to com-
pare the distribution of ratings for the new user to the his-
torical distribution of ratings for a given action. We let
Ha={Ht:At=a}and every time a user submits a vote,
we calculate the following number:
pi=1
|HA′
i|+ 1
1 +X
h∈HA′
i1{h≥H′
i}
. (10)
Under the null hypothesis that HA′
iis exchangeable with
H′
i,piis a valid p-value (see Appendix C for a proof). Fur-
thermore, the dependence of these p-values asymptotically
is negligible.
With this p-value in hand, we can test against this null
hypothesis sequentially by using Fisher’s combination
test (Fisher, 1928) along with a variant of the Bonferroni
correction. In particular, for each user, after their jth vote,
we compute Mj=−2jP
i=1log(pi). At 5 randomly cho-
sen values of jbetween 1 and 100, we identify a user as
anomalous if Mj≥χ2
2j,1−α/5. (The times are randomly
chosen, as to avoid anomalous users strategizing to hack this
p-value.) Despite the heuristic application of this procedure,
it seems to work well in our small-scale tests reported in
Table 5.
5
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Medical Queries and Infor mation ( 0.4% )Role-Playing Games ( 0.4% )Movie R eviews and Discussions ( 0.5% )SQL Database T able Queries ( 0.5% )Web Development Essentials ( 0.7% )Animal Behavior and P et Car e Queries ( 0.4% )Cooking and R ecipes ( 0.7% )Email and L etter W riting Assistance ( 0.8% )Operations & Fleet Management ( 0.8% )Sports and A thletics Queries ( 0.8% )Advanced Mathematical Concepts ( 0.4% )Philosophical T exts & Concepts ( 0.4% )AI Impact and Applications ( 0.9% )Original Jok e Requests ( 0.5% )Poetry W riting & Styles ( 0.8% )Word Play and Phonetics ( 1.0% )
0.30.40.50.60.70.80.91Similarity
Figure 3. Similarity matrix of top-16 topic clusters. The number
followed by the topic label represents the cluster size in percentage.
Note that similarity is computed by cluster’s centroid embeddings,
hence diagonals are always one.
6. Data Analysis
To examine whether Arena’s crowdsourced data reflects
real-world use cases, we conduct topic modeling on the
user prompts. We show how effective are these prompts in
distinguishing models. Lastly, we validate the vote quality
by relabeling data with experts.
6.1. Topic Modeling on User Prompts
To study the prompt diversity, we build a topic modeling
pipeline with BERTopic3(Grootendorst, 2022). We start
with transforming user prompts into representation vectors
using OpenAI’s text embedding model (text-embedding-3-
small). To mitigate the curse of dimensionality for data
clustering, we employ UMAP (Uniform Manifold Approx-
imation and Projection) (McInnes et al., 2020) to reduce
the embedding dimension from 1,536 to 5. We then use
the hierarchical density-based clustering algorithm, HDB-
SCAN, to identify topic clusters with minimum cluster size
32. Finally, to obtain topic labels, we sample 10 prompts
from each topic cluster and feed into GPT-4-Turbo for topic
summarization.
The pipeline identifies 600 clusters covering a wide range of
topics including poetry writing, coding, math, and medical
queries. We present the top-16 topic clusters in Figure 3.
We observe that the largest cluster only accounts for 1%
of the entire set and the rest quickly drop to <0.5%, and
the similarity between clusters is small, showing a long-tail
and diverse distribution. Due to space limit, we present the
similarity matrix and cluster hierarchy of top-64 clusters in
Figure 11 and 12 in Appendix.
6.2. Can Arena Prompts Distinguish Models?
Next, we study how effective are these topic clusters in
distinguishing models strengths. Constructing challenging
prompts has become increasingly difficult due to LLMs’
3https://github.com/MaartenGr/BERTopicTable 2. GPT-4-0613’s win-rate against Llama-2-70b-chat on 30
sample prompts from various topic clusters. We use GPT-4-turbo
as judge to evaluate model responses in pairwise comparison.
Topic Cluster Win-rate Size
Python Game Programming Challenge 96.7% 0.2%
C/C++ Process Multi-Threading 86.7% 0.3%
SQL Query Database Assistance 73.3% 0.2%
Poetry Writing Prompts 66.7% 1.1%
Python Coding Basics 65.0% 0.2%
Linguistic Analysis & Wordplay 58.3% 0.7%
Travel Itinerary Planning 58.3% 0.4%
Movie Recommendations & Ratings 53.3% 0.2%
fast growing capabilities. For example, open models such as
Llama-2-70b-chat can likely answer inquiries about movie
or travel recommendation as good as GPT-4, but not in
other domains such as reasoning or coding. To demon-
strate, we sample 30 prompts from seven topic clusters and
compare the performance of Llama-2-70b-chat and GPT-4.
To control variables, we factor out user votes and consider
LLM-as-judge (Zheng et al., 2023b) to evaluate model re-
sponse. Results are shown in Table 2, where we see GPT-4
has significantly higher win-rate (up to 97%) in clusters that
require coding and reasoning skills. On the other hand, for
clusters with less problem-solving tasks, GPT-4 win-rate
drops to below 60%. We show examples in Appendix D.1.
This result shows models may exhibit varying strengths in
different areas, but also highlights some of the topic clusters
in Chatbot Arena are effective in differentiate models.
Building Challenging Benchmark. To further demonstrate
the prompt quality, we show it is possible to construct a chal-
lenging benchmark with crowd-sourced user prompts. To
ensure both topic coverage and quality, we first run the topic
modeling pipeline and follow a similar procedure in Zheng
et al. (2023a) to select challenging questions sampled from
each topic cluster. Examples prompts and evaluation proce-
dures can be found in the Appendix D.2 and Appendix D.3,
respectively. We observe the selected prompts are highly
effective in differentiating models. In Figure 4, we compare
Arena bench against a widely used LLM benchmark, MT-
Bench (Zheng et al., 2023b). We can see that Arena Bench
effectively reveals a significant gap in performance between
proprietary and the strongest open models.
6.3. Validating Vote Quality
To assess the quality of crowdsourced votes, we randomly
selected 160 battles between GPT-4-Turbo and Llama-2-
13B, as well as GPT-4-Turbo and GPT-3.5-Turbo-0613. We
then asked experts4to label their preference per comparison.
The experts were given the prompts and answers blindly,
and asked to carefully fact-check model’s answer with ex-
ternal resources like search engine. Manually labeling each
4The laborers are graduate students at UC Berkeley.
6
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
02468Llama-2-7B-ChatLlama-2-70B-ChatVicuna-33b-v1.3OpenChat-3.5Starling-LM-7B-alphaMixtral-8x7B-InstructClaude-2.1GPT-3.5-Turbo-0613GPT-4-0613GPT-4-0314GPT-4-Turbo Arena Bench
MT Bench
ScoreModel
Figure 4. Model’s performance between Arena Bench and MT-
Bench, showing an increased gap between open and proprietary
models. Both uses GPT-4 as judge.
Table 3. Pairwise agreement rate between crowd-user, gpt-4 judge,
and experts on pairwise battles. The top part of the table is between
GPT-4-Turbo and Llama-2-13b-chat. The bottom is between GPT-
4-Turbo and GPT-3.5-Turbo-0613.
Llama-2-13b Expert 1 Expert 2 GPT-4
Crowd 72.8% 77.8% 75.6%
Expert 1 - 89.8% 81.0%
Expert 2 - - 78.5%
GPT-3.5-Turbo Expert 1 Expert 2 GPT-4
Crowd 73.8% 83.1% 75.6%
Expert 1 - 79.4% 76.3%
Expert 2 - - 79.3%
data point took on average 3-5 minutes. For reference, we
also use GPT-4 as a judge for pairwise comparisons. The
agreement rate between crowd-users, experts, and GPT-4-
judge are presented in Table 3. The corresponsing win-rate
are shown in Table 4.
To summarize, we observe high agreement rates (72% to
83%) between Arena crowd-user and experts in both setup.
Note that agreement rates between two experts are around
similar levels (79.4% and 89.8%). As for the 10%-20%
disagreement between experts, it is mostly due to some user
prompts don’t have a ground truth answer. Depending on
the preference of the evaluator, sometimes both answers
can be argued as being better than the other one, such as
the examples in Appendix D.4. The gap between crowd-
vs-expert agreement rate and expert-vs-expert agreement
rate (5%-10%) is mostly attributed to crowd user making
mistakes or overlooking factual errors in model’s response.
Overall, the agreement rates presented in Table 3 validate
the decent quality of crowd-sourced votes in Chatbot Arena.Table 4. GPT-4-Turbo’s win-rate across crowd-user, gpt-4 judge,
and experts on pairwise battles against Llama-2-13b and GPT-3.5-
Turbo-0613.
Baseline Arena User Expert 1 Expert 2 GPT-4
Llama-2-13b 81.2% 89.4% 86.9% 78.8%
GPT-3.5-Turbo 76.3% 82.5% 89.4% 79.4%
0.0 0.5 1.0 1.5 2.0 2.5
zephyr-7b-beta (#19-37)wizardlm-13b (#19-37)llama2-70b-steerlm-chat (#17-37)solar-10.7b-instruct-v1.0 (#18-36)dolphin-2.2.1-mistral-7b (#17-37)pplx-70b-online (#17-31)gpt-3.5-turbo-1106 (#15-29)openchat-3.5 (#14-29)llama-2-70b-chat (#14-28)openhermes-2.5-mistral-7b (#12-28)vicuna-33b (#7-25)starling-lm-7b-alpha (#7-26)gpt-3.5-turbo-0314 (#7-22)wizardlm-70b (#7-22)tulu-2-dpo-70b (#6-21)yi-34b-chat (#6-19)claude-instant-1 (#6-19)gemini-pro (#6-18)gpt-3.5-turbo-0613 (#6-18)claude-2.1 (#6-18)mixtral-8x7b-instruct-v0.1 (#4-18)gemini-pro-dev-api (#3-18)claude-2.0 (#3-14)claude-1 (#3-8)mistral-medium (#3-8)gpt-4-0613 (#3-7)gpt-4-0314 (#2)gpt-4-turbo (#1)
corrected
uncorrected
Figure 5. Intervals for the BT coefficients with and without mul-
tiplicity correction. The multiplicity correction, in this case a
chi-square CLT interval, is technically required for the purpose of
calculating the ranking, because it ensures allscores are simulta-
neously contained in their intervals (and the ranking is a function
of all the scores). However, it induces extra conservatism, so we
report both intervals.
7. Experiments
7.1. Ranking system
Computing the rank on real data. In this section, we
report results from our experiments on approximate ranking.
For this experiment, we ran a replay of T= 213 ,576his-
torical votes from our online platform and calculate the BT
coefficients using our earlier-described estimation algorithm
with confidence intervals; see Figure 5 for these intervals
(with and without multiplicity correction; the formal notion
of approximate ranking technically requires multiplicity
correction, but it makes the intervals looser).
Evaluating the coverage of the intervals. A natural follow-
up question is whether or not the intervals are doing their job
correctly: whether they cover the true BT coefficients with
probability at least (and almost exactly) 1−α. Of course,
7
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
0 20000 40000 60000 80000 100000
n0.80.9CoverageCoverage
0 20000 40000 60000 80000 100000
n0.00.51.01.5Average Interval WidthAverage Interval Width
M
4
7
10
15
20
Figure 6. Intervals for the BT coefficients as a function of the
number of samples and the number of models M.
this cannot be evaluated on real data, so we run a simulation.
A vector of BT coefficients is drawn, with each coordinate
sampled i.i.d. from a distribution beta(1/γ,1/γ); we take
γ= 2in Figure 6 (and we vary γin Appendix A). Given
these coefficients, a dataset is synthesized, and the coverage
and average width are computed for each of 20 trials. The
results can be seen in Figure 6 for the uncorrected intervals
The coverage of the intervals behaves as expected, centering
around 1−α, regardless of the number of models. Mean-
while, the more models are included, the larger the intervals
become.
Evaluating the active sampling rule. Next, we discuss the
evaluation of our active sampling rule as Equation (9)for
win matrix estimation. We evaluate this sampling rule by
taking the best fit BT coefficients to our 213,576 point sized
holdout set, and then sampling from that distribution using
our active sampling algorithm. The results are displayed
in Figure 7. It is hard to tell by looking at plots, but the
improvement is substantial: To estimate θ∗to a precision of
0.2, random needs 6,800 samples and adaptive needs 4,400
samples; meanwhile to estimate the score to a precision
of 0.3, random needs 17,200 samples and adaptive needs
16,400 samples. Thus, the random baseline requires 54%
and 5% more data to achieve the same level of precision,
respectively. One can see from the plots in Figure 7 that
these results are not cherry-picked: the sample-efficiency of
our method is better at all values on the horizontal axis.
7.2. Anomalous Users Detection
We evaluate the outlier detection method in Section 5.1.
We construct the evaluation set by manually identifying
25 anomalous users whose inputs are highly repetitive or
meaningless (e.g., asking “hi” for 100 times or inputting
garbled texts). We randomly sample 25 normal users with
0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24
average width ()
0100002000030000nrandom
pairwise adaptive
0.25 0.30 0.35 0.40 0.45 0.50
average width (s)0100002000030000nFigure 7. Interval widths on the win matrix (upper figure) and on
the BT coefficients (lower figure) as a function of the number of
samples, for random sampling and also adaptive sampling. Im-
provements from adaptive sampling can be seen in both cases,
although they are more subtle on the scale of the score.
Table 5. Confusion matrix of different α. “Pred.” means predicted.
Positive means anomalous and negative means normal.
α= 0.1 Pred. Positive Pred. Negative
Actual Positive 13/14 12/36
Actual Negative 1/14 24/36
α= 0.3 Pred. Positive Pred. Negative
Actual Positive 21/29 4/21
Actual Negative 8/29 17/21
at least 50 votes, and inspect their input prompts to ensure
no abnormal behaviors. As mentioned in Section 5.1, per
user we compute five Mjand identify the user as anomalous
ifMj≥χ2
2j,1−α/5. We present results of two different α
(i.e., the significance leval) in Table 5. We find the detec-
tion method effective (e.g., reaching 90% true positive and
60-70% true negative rate). We inspect the false negative
errors and find those are from users do not always behave
abnormally, making them harder to detect.
8. Discussion
Limitations. Although our user base is extensive, we an-
ticipate that it will primarily consist of LLM hobbyists and
researchers who are eager to experiment with and evaluate
the latest LLMs. This inclination may result in a biased
distribution of users. Additionally, despite the wide array of
topics encompassed by the prompts discussed in previous
sections, the data predominantly comes from our online
chat interface. This source might not accurately reflect the
real-world usage of LLMs in production environments or
specialized domains, potentially leading to a skewed prompt
distribution. Moreover, our study concentrates on assessing
the helpfulness of LLMs but overlooks their safety aspects.
We recognize the possibility and necessity of a parallel
8
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
mechanism to evaluate the safety of these models.
Future Directions. In our future work, we plan to develop
comprehensive topic leaderboards and establish a dedicated
section for multimodal and agent-based LLMs in more dy-
namic, gamified settings, catering to more complex tasks.
We also believe our approach to detecting harmful users
could be improved and made more formally rigorous by
using the theory of nonnegative supermartingales and E-
values (Howard et al., 2020; Waudby-Smith & Ramdas,
2020; V ovk & Wang, 2021; Ramdas et al., 2023); this would
deal with the dependence, but the variants we tried did not
perform well in terms of power.
9. Conclusion
In this paper, we present Chatbot Arena, an open platform
for evaluating LLMs through crowdsourced, pairwise hu-
man preferences. We conduct an in-depth analysis of the
crowdsourced user prompts and preference votes to validate
the diversity and quality. We develop an efficient model
sampling and ranking algorithm. Our dataset including
100K pairwise preference votes will be released for future
research.
Acknowledgments
This project is supported by sponsorship from Kaggle,
MBZUAI, a16z, Together AI, Anyscale, and HuggingFace.
This project is also partly supported by Accenture, AMD,
Google, IBM, Intel, Microsoft, Samsung SDS, SAP, Uber,
and VMware. The authors would like to thank Siyuan
Zhuang for insightful discussion and Tijana Zrni ´c for helpful
feedback on the manuscript.
References
Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das-
Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T.,
et al. Training a helpful and harmless assistant with rein-
forcement learning from human feedback. arXiv preprint
arXiv:2204.05862 , 2022.
Boubdir, M., Kim, E., Ermis, B., Hooker, S., and Fadaee, M.
Elo uncovered: Robustness and best practices in language
model evaluation, 2023.
Bradley, R. A. and Terry, M. E. Rank analysis of incom-
plete block designs: I. the method of paired comparisons.
Biometrika , 39(3/4):324–345, 1952.
Busa-Fekete, R., Huellermeier, E., and Szörényi, B.
Preference-based rank elicitation using statistical models:
The case of mallows. In Xing, E. P. and Jebara, T. (eds.),
Proceedings of the 31st International Conference on Ma-
chine Learning , volume 32 of Proceedings of MachineLearning Research , pp. 1071–1079, Bejing, China, 22–
24 Jun 2014a. PMLR. URL https://proceedings.
mlr.press/v32/busa-fekete14.html .
Busa-Fekete, R., Huellermeier, E., and Szörényi, B.
Preference-based rank elicitation using statistical models:
The case of mallows. In Xing, E. P. and Jebara, T. (eds.),
Proceedings of the 31st International Conference on Ma-
chine Learning , volume 32 of Proceedings of Machine
Learning Research , pp. 1071–1079, Bejing, China, 22–24
Jun 2014b. PMLR. URL https://proceedings.
mlr.press/v32/busa-fekete14.html .
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,
Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman,
G., et al. Evaluating large language models trained on
code. arXiv preprint arXiv:2107.03374 , 2021.
Chernoff, H. Sequential Design of Experiments , pp.
345–360. Springer New York, New York, NY ,
1992. ISBN 978-1-4612-4380-9. doi: 10.1007/
978-1-4612-4380-9_27. URL https://doi.org/
10.1007/978-1-4612-4380-9_27 .
Chiang, C.-H. and Lee, H.-y. Can large language mod-
els be an alternative to human evaluations? In Rogers,
A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceed-
ings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers) , pp.
15607–15631, Toronto, Canada, July 2023. Association
for Computational Linguistics. doi: 10.18653/v1/2023.
acl-long.870. URL https://aclanthology.org/
2023.acl-long.870 .
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H.,
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano,
R., et al. Training verifiers to solve math word problems.
arXiv preprint arXiv:2110.14168 , 2021.
Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y ., Xie, G.,
Liu, Z., and Sun, M. Ultrafeedback: Boosting language
models with high-quality feedback, 2023.
DiCiccio, T. J. and Efron, B. Bootstrap confidence intervals.
Statistical science , 11(3):189–228, 1996.
Durrett, R. Probability: theory and examples , volume 49.
Cambridge university press, 2019.
Elo, A. E. The proposed uscf rating system, its develop-
ment, theory, and applications. Chess Life , 22(8):242–
247, 1967.
Fisher, R. A. Statistical methods for research workers . Num-
ber 5. Oliver and Boyd, 1928.
Freedman, D. A. On the so-called “huber sandwich esti-
mator”’ and “robust standard errors”’. The American
Statistician , 60(4):299–302, 2006.
9
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Gemini, T., Anil, R., Borgeaud, S., Wu, Y ., Alayrac, J.-B.,
Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A.,
et al. Gemini: a family of highly capable multimodal
models. arXiv preprint arXiv:2312.11805 , 2023.
Geng, X., Gudibande, A., Liu, H., Wallace, E., Abbeel,
P., Levine, S., and Song, D. Koala: A dia-
logue model for academic research. Blog post,
April 2023. URL https://bair.berkeley.edu/
blog/2023/04/03/koala/ .
Grootendorst, M. Bertopic: Neural topic modeling
with a class-based tf-idf procedure. arXiv preprint
arXiv:2203.05794 , 2022.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., and Steinhardt, J. Measuring massive multitask
language understanding. In International Conference on
Learning Representations , 2020.
Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon,
J. Time-uniform chernoff bounds via nonnegative super-
martingales. 2020.
Huang, Y ., Lin, Z., Liu, X., Gong, Y ., Lu, S., Lei, F., Liang,
Y ., Shen, Y ., Lin, C., Duan, N., et al. Competition-level
problems are effective llm evaluators. arXiv preprint
arXiv:2312.02143 , 2023.
Huber, P. J. et al. The behavior of maximum likelihood
estimates under nonstandard conditions. In Proceedings
of the fifth Berkeley symposium on mathematical statistics
and probability , volume 1, pp. 221–233. Berkeley, CA:
University of California Press, 1967.
Hunter, D. R. MM algorithms for generalized Bradley-
Terry models. The Annals of Statistics , 32(1):384 – 406,
2004. doi: 10.1214/aos/1079120141. URL https:
//doi.org/10.1214/aos/1079120141 .
Karimi, M. R., Gürel, N. M., Karlaš, B., Rausch, J., Zhang,
C., and Krause, A. Online active model selection for
pre-trained classifiers. In International Conference on
Artificial Intelligence and Statistics , pp. 307–315. PMLR,
2021.
Karpinska, M., Akoury, N., and Iyyer, M. The perils of
using Mechanical Turk to evaluate open-ended text gen-
eration. In Moens, M.-F., Huang, X., Specia, L., and
Yih, S. W.-t. (eds.), Proceedings of the 2021 Confer-
ence on Empirical Methods in Natural Language Pro-
cessing , pp. 1265–1285, Online and Punta Cana, Domini-
can Republic, November 2021. Association for Computa-
tional Linguistics. doi: 10.18653/v1/2021.emnlp-main.
97. URL https://aclanthology.org/2021.
emnlp-main.97 .Kiela, D., Bartolo, M., Nie, Y ., Kaushik, D., Geiger, A., Wu,
Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., et al.
Dynabench: Rethinking benchmarking in nlp. In Pro-
ceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics:
Human Language Technologies , pp. 4110–4124, 2021.
Köpf, A., Kilcher, Y ., von Rütte, D., Anagnostidis, S.,
Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stan-
ley, O., Nagyfi, R., et al. Openassistant conversations–
democratizing large language model alignment. arXiv
preprint arXiv:2304.07327 , 2023.
Langley, P. Crafting papers on machine learning. In Langley,
P. (ed.), Proceedings of the 17th International Conference
on Machine Learning (ICML 2000) , pp. 1207–1216, Stan-
ford, CA, 2000. Morgan Kaufmann.
Li, X., Zhang, T., Dubois, Y ., Taori, R., Gulrajani, I.,
Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacae-
val: An automatic evaluator of instruction-following
models. https://github.com/tatsu-lab/
alpaca_eval , 2023.
Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J.,
Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago,
A., et al. Competition-level code generation with alpha-
code. Science , 378(6624):1092–1097, 2022.
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D.,
Yasunaga, M., Zhang, Y ., Narayanan, D., Wu, Y ., Kumar,
A., et al. Holistic evaluation of language models. arXiv
preprint arXiv:2211.09110 , 2022.
Lin, Z., Wang, Z., Tong, Y ., Wang, Y ., Guo, Y ., Wang,
Y ., and Shang, J. ToxicChat: Unveiling hidden chal-
lenges of toxicity detection in real-world user-AI con-
versation. In Bouamor, H., Pino, J., and Bali, K.
(eds.), Findings of the Association for Computational
Linguistics: EMNLP 2023 , pp. 4694–4702, Singa-
pore, December 2023. Association for Computational
Linguistics. doi: 10.18653/v1/2023.findings-emnlp.
311. URL https://aclanthology.org/2023.
findings-emnlp.311 .
Liu, T.-Y . et al. Learning to rank for information retrieval.
Foundations and Trends ®in Information Retrieval , 3(3):
225–331, 2009.
McInnes, L., Healy, J., and Melville, J. Umap: Uniform
manifold approximation and projection for dimension
reduction, 2020.
OpenAI. Gpt-4 technical report. arXiv preprint
arXiv:2303.08774 , 2023.
10
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Oren, Y ., Meister, N., Chatterji, N., Ladhak, F., and
Hashimoto, T. B. Proving test set contamination in black
box language models. arXiv preprint arXiv:2310.17623 ,
2023.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K.,
Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L.,
Simens, M., Askell, A., Welinder, P., Christiano, P., Leike,
J., and Lowe, R. Training language models to follow
instructions with human feedback, 2022.
Ramdas, A., Grünwald, P., V ovk, V ., and Shafer, G. Game-
theoretic statistics and safe anytime-valid inference. Sta-
tistical Science , 38(4):576–601, 2023.
Rao, P. V . and Kupper, L. L. Ties in paired-comparison
experiments: A generalization of the bradley-terry model.
Journal of the American Statistical Association , 62(317):
194–204, 1967. doi: 10.1080/01621459.1967.10482901.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid,
A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A.,
Garriga-Alonso, A., et al. Beyond the imitation game:
Quantifying and extrapolating the capabilities of language
models. Transactions on Machine Learning Research ,
2023.
Szörényi, B., Busa-Fekete, R., Paul, A., and Hüller-
meier, E. Online rank elicitation for plackett-
luce: A dueling bandits approach. In Cortes, C.,
Lawrence, N., Lee, D., Sugiyama, M., and Garnett,
R. (eds.), Advances in Neural Information Process-
ing Systems , volume 28. Curran Associates, Inc.,
2015. URL https://proceedings.neurips.
cc/paper_files/paper/2015/file/
7eacb532570ff6858afd2723755ff790-Paper.
pdf.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine-
tuned chat models. arXiv preprint arXiv:2307.09288 ,
2023.
V ovk, V . and Wang, R. E-values: Calibration, combination
and applications. The Annals of Statistics , 49(3):1736–
1754, 2021.
Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A.,
Khashabi, D., and Hajishirzi, H. Self-instruct: Align-
ing language models with self-generated instructions. In
Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Pro-
ceedings of the 61st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers) ,
pp. 13484–13508, Toronto, Canada, July 2023. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.
org/2023.acl-long.754 .
Waudby-Smith, I. and Ramdas, A. Estimating means of
bounded random variables by betting. arXiv preprint
arXiv:2010.09686 , 2020.
White, H. Maximum likelihood estimation of misspeci-
fied models. Econometrica: Journal of the econometric
society , pp. 1–25, 1982.
Yang, S., Chiang, W.-L., Zheng, L., Gonzalez, J. E., and
Stoica, I. Rethinking benchmark and contamination for
language models with rephrased samples. arXiv preprint
arXiv:2311.04850 , 2023.
Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi,
Y . Hellaswag: Can a machine really finish your sentence?
InProceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics , pp. 4791–4800,
2019.
Zheng, L., Chiang, W.-L., Sheng, Y ., Li, T., Zhuang, S., Wu,