Skip to content

Commit 8967a42

Browse files
committed
BERTScore
1 parent 5eaa05a commit 8967a42

File tree

1 file changed

+47
-1
lines changed

1 file changed

+47
-1
lines changed

book/8-genai.tex

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,4 +48,50 @@ \subsection{Perplexity}
4848
\orangebox{Did you know that...}
4949
{Perplexity is closely related to entropy in information theory. In fact, $Perplexity = 2^{H(P)}$
5050
where \(H(P)\) is the entropy. This means that perplexity can be seen as the effective
51-
number of equally likely words the model is choosing from at each step!}
51+
number of equally likely words the model is choosing from at each step!}
52+
53+
% ---------- BERTScore ----------
54+
\clearpage
55+
\thispagestyle{genaistyle}
56+
\section{BERTScore}
57+
\subsection{BERTScore}
58+
59+
% reference:
60+
% https://arxiv.org/pdf/1904.09675
61+
% https://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT
62+
63+
BERTScore is a metric for evaluating text generation that leverages contextual embeddings from pre-trained language models like BERT.
64+
Instead of relying solely on surface-level n-gram overlap (as in BLEU or ROUGE), BERTScore computes similarity by aligning tokens from the
65+
candidate and reference sentences in embedding space.
66+
67+
% equation
68+
\begin{center}
69+
FORMULA GOES HERE
70+
\end{center}
71+
72+
Mathematically, for each token in a candidate sentence, BERTScore finds its most similar token in the reference sentence (and vice versa)
73+
using cosine similarity. Precision, recall, and F1 are then aggregated over all pairs, yielding a semantic-oriented score that correlates
74+
strongly with human judgments.
75+
76+
\textbf{When to use the BERTScore?}
77+
78+
Use BERTScore when evaluating tasks where semantic similarity matters more than exact wording, such as machine translation, summarization, or
79+
dialogue generation. It is especially useful when generated text can be phrased differently from references but still convey the same meaning.
80+
81+
\coloredboxes{
82+
\item Context-aware. Uses deep contextual embeddings, capturing meaning beyond surface word matches
83+
\item Better correlation with humans. Empirical studies show BERTScore aligns more closely with human evaluation than BLEU or ROUGE.
84+
}
85+
{
86+
\item Computationally heavy. Requires embedding extraction with large pre-trained models, making it slower than n-gram metrics.
87+
\item Model dependence. Performance varies depending on which pre-trained model (e.g., BERT, RoBERTa, multilingual-BERT) is used.
88+
\item Bias inheritance. Any biases in the underlying language model embeddings can influence the scores.
89+
}
90+
91+
\clearpage
92+
93+
\thispagestyle{customstyle}
94+
95+
\orangebox{Did you know that...}
96+
{The original BERTScore paper included a picture of Bert from Sesame Street paying homage to the model’s namesake and adding a playful touch to an
97+
otherwise technical paper.}

0 commit comments

Comments
 (0)