Skip to content

Commit 40626e7

Browse files
docs: improve consistency in punctuation of metric list (huggingface#605)
Co-authored-by: Clémentine Fourrier <[email protected]>
1 parent d146f9b commit 40626e7

File tree

1 file changed

+32
-32
lines changed

1 file changed

+32
-32
lines changed

docs/source/metric-list.mdx

Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
11
# Metric List
22

3-
## Automatic metrics for multiple choice tasks
3+
## Automatic metrics for multiple-choice tasks
44

55
These metrics use log-likelihood of the different possible targets.
6-
- `loglikelihood_acc`: Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_single_token`)
7-
- `loglikelihood_acc_norm`: Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_norm_single_token`)
8-
- `loglikelihood_acc_norm_nospace`: Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct, with the first space ignored
9-
- `loglikelihood_f1`: Corpus level F1 score of the multichoice selection - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_f1_single_token`)
10-
- `mcc`: Matthew's correlation coefficient (a measure of agreement between statistical distributions),
11-
- `recall_at_1`: Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_1_single_token`)
12-
- `recall_at_2`: Fraction of instances where the choice with the 2nd best logprob or better was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_2_single_token`)
13-
- `mrr`: Mean reciprocal rank, a measure of the quality of a ranking of choices ordered by correctness/relevance - also exists in a faster version for tasks where the possible choices include only one token (`mrr_single_token`)
6+
- `loglikelihood_acc`: Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_single_token`).
7+
- `loglikelihood_acc_norm`: Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_norm_single_token`).
8+
- `loglikelihood_acc_norm_nospace`: Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct, with the first space ignored.
9+
- `loglikelihood_f1`: Corpus level F1 score of the multichoice selection - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_f1_single_token`).
10+
- `mcc`: Matthew's correlation coefficient (a measure of agreement between statistical distributions).
11+
- `recall_at_1`: Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_1_single_token`).
12+
- `recall_at_2`: Fraction of instances where the choice with the 2nd best logprob or better was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_2_single_token`).
13+
- `mrr`: Mean reciprocal rank, a measure of the quality of a ranking of choices ordered by correctness/relevance - also exists in a faster version for tasks where the possible choices include only one token (`mrr_single_token`).
1414
- `target_perplexity`: Perplexity of the different choices available.
15-
- `acc_golds_likelihood`:: A bit different, it actually checks if the average logprob of a single target is above or below 0.5
16-
- `multi_f1_numeric`: Loglikelihood F1 score for multiple gold targets
15+
- `acc_golds_likelihood`: A bit different, it actually checks if the average logprob of a single target is above or below 0.5.
16+
- `multi_f1_numeric`: Loglikelihood F1 score for multiple gold targets.
1717

1818
All these metrics also exist in a "single token" version (`loglikelihood_acc_single_token`, `loglikelihood_acc_norm_single_token`, `loglikelihood_f1_single_token`, `mcc_single_token`, `recall@2_single_token` and `mrr_single_token`). When the multichoice option compares only one token (ex: "A" vs "B" vs "C" vs "D", or "yes" vs "no"), using these metrics in the single token version will divide the time spent by the number of choices. Single token evals also include:
19-
- `multi_f1_numeric`: computes the f1 score of all possible choices and averages it.
19+
- `multi_f1_numeric`: Computes the f1 score of all possible choices and averages it.
2020

2121
## Automatic metrics for perplexity and language modeling
2222
These metrics use log-likelihood of prompt.
@@ -32,45 +32,45 @@ These metrics need the model to generate an output. They are therefore slower.
3232
- `exact_match`: Fraction of instances where the prediction matches the gold with the exception of the border whitespaces (= after a `strip` has been applied to both).
3333
- `quasi_exact_match`: Fraction of instances where the normalized prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...). Other variations exist, with other normalizers, such as `quasi_exact_match_triviaqa`, which only normalizes the predictions after applying a strip to all sentences.
3434
- `prefix_exact_match`: Fraction of instances where the beginning of the prediction matches the gold at the exception of the border whitespaces (= after a `strip` has been applied to both).
35-
- `prefix_quasi_exact_match`: Fraction of instances where the normalized beginning of the prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...)
36-
- `exact_match_indicator`: Exact match with some preceding context (before an indicator) removed
37-
- `f1_score_quasi`: Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
38-
- `f1_score`: Average F1 score in terms of word overlap between the model output and gold without normalisation
39-
- `f1_score_macro`: Corpus level macro F1 score
40-
- `f1_score_macro`: Corpus level micro F1 score
35+
- `prefix_quasi_exact_match`: Fraction of instances where the normalized beginning of the prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...).
36+
- `exact_match_indicator`: Exact match with some preceding context (before an indicator) removed.
37+
- `f1_score_quasi`: Average F1 score in terms of word overlap between the model output and gold, with both being normalized first.
38+
- `f1_score`: Average F1 score in terms of word overlap between the model output and gold without normalisation.
39+
- `f1_score_macro`: Corpus level macro F1 score.
40+
- `f1_score_macro`: Corpus level micro F1 score.
4141
- `maj_at_5` and `maj_at_8`: Model majority vote. Takes n (5 or 8) generations from the model and assumes the most frequent is the actual prediction.
4242
- Summarization:
43-
- `rouge`: Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/)
43+
- `rouge`: Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/).
4444
- `rouge1`: Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 1-gram overlap.
4545
- `rouge2`: Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 2-gram overlap.
4646
- `rougeL`: Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
4747
- `rougeLsum`: Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
48-
- `rouge_t5` (BigBench): Corpus level ROUGE score for all available ROUGE metrics
48+
- `rouge_t5` (BigBench): Corpus level ROUGE score for all available ROUGE metrics.
4949
- `faithfulness`: Faithfulness scores based on the SummaC method of [Laban et al. (2022)](https://aclanthology.org/2022.tacl-1.10/).
50-
- `extractiveness`: Reports, based on [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/)
50+
- `extractiveness`: Reports, based on [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/):
5151
- `summarization_coverage`: Extent to which the model-generated summaries are extractive fragments from the source document,
5252
- `summarization_density`: Extent to which the model-generated summaries are extractive summaries based on the source document,
5353
- `summarization_compression`: Extent to which the model-generated summaries are compressed relative to the source document.
5454
- `bert_score`: Reports the average BERTScore precision, recall, and f1 score [(Zhang et al., 2020)](https://openreview.net/pdf?id=SkeHuCVFDr) between model generation and gold summary.
55-
- Translation
55+
- Translation:
5656
- `bleu`: Corpus level BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) - uses the sacrebleu implementation.
5757
- `bleu_1`: Average sample BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 1-gram overlap - uses the nltk implementation.
5858
- `bleu_4`: Average sample BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 4-gram overlap - uses the nltk implementation.
5959
- `chrf`: Character n-gram matches f-score.
6060
- `ter`: Translation edit/error rate.
61-
- Copyright
61+
- Copyright:
6262
- `copyright`: Reports:
63-
- `longest_common_prefix_length`: average length of longest common prefix between model generation and reference,
64-
- `edit_distance`: average Levenshtein edit distance between model generation and reference,
65-
- `edit_similarity`: average Levenshtein edit similarity (normalized by length of longer sequence) between model generation and reference.
63+
- `longest_common_prefix_length`: Average length of longest common prefix between model generation and reference,
64+
- `edit_distance`: Average Levenshtein edit distance between model generation and reference,
65+
- `edit_similarity`: Average Levenshtein edit similarity (normalized by the length of longer sequence) between model generation and reference.
6666
- Math:
67-
- `quasi_exact_match_math`: Fraction of instances where the normalized prediction matches the normalized gold (normalization done for math, where latex symbols, units, etc are removed)
68-
- `maj_at_4_math`: Majority choice evaluation, using the math normalisation for the predictions and gold
69-
- `quasi_exact_match_gsm8k`: Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed)
70-
- `maj_at_8_gsm8k`: Majority choice evaluation, using the gsm8k normalisation for the predictions and gold
67+
- `quasi_exact_match_math`: Fraction of instances where the normalized prediction matches the normalized gold (normalization done for math, where latex symbols, units, etc are removed).
68+
- `maj_at_4_math`: Majority choice evaluation, using the math normalisation for the predictions and gold.
69+
- `quasi_exact_match_gsm8k`: Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed).
70+
- `maj_at_8_gsm8k`: Majority choice evaluation, using the gsm8k normalisation for the predictions and gold.
7171

7272
## LLM-as-Judge
73-
- `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API
74-
- `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API
73+
- `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API.
74+
- `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API.
7575
- `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API. It is used for multiturn tasks like mt-bench.
7676
- `llm_judge_multi_turn_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API. It is used for multiturn tasks like mt-bench.

0 commit comments

Comments
 (0)