@@ -48,4 +48,50 @@ \subsection{Perplexity}
4848\orangebox {Did you know that...}
4949{Perplexity is closely related to entropy in information theory. In fact, $ Perplexity = 2 ^{H(P)}$
5050where \( H(P)\) is the entropy. This means that perplexity can be seen as the effective
51- number of equally likely words the model is choosing from at each step!}
51+ number of equally likely words the model is choosing from at each step!}
52+
53+ % ---------- BERTScore ----------
54+ \clearpage
55+ \thispagestyle {genaistyle}
56+ \section {BERTScore }
57+ \subsection {BERTScore }
58+
59+ % reference:
60+ % https://arxiv.org/pdf/1904.09675
61+ % https://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT
62+
63+ BERTScore is a metric for evaluating text generation that leverages contextual embeddings from pre-trained language models like BERT.
64+ Instead of relying solely on surface-level n-gram overlap (as in BLEU or ROUGE), BERTScore computes similarity by aligning tokens from the
65+ candidate and reference sentences in embedding space.
66+
67+ % equation
68+ \begin {center }
69+ FORMULA GOES HERE
70+ \end {center }
71+
72+ Mathematically, for each token in a candidate sentence, BERTScore finds its most similar token in the reference sentence (and vice versa)
73+ using cosine similarity. Precision, recall, and F1 are then aggregated over all pairs, yielding a semantic-oriented score that correlates
74+ strongly with human judgments.
75+
76+ \textbf {When to use the BERTScore? }
77+
78+ Use BERTScore when evaluating tasks where semantic similarity matters more than exact wording, such as machine translation, summarization, or
79+ dialogue generation. It is especially useful when generated text can be phrased differently from references but still convey the same meaning.
80+
81+ \coloredboxes {
82+ \item Context-aware. Uses deep contextual embeddings, capturing meaning beyond surface word matches
83+ \item Better correlation with humans. Empirical studies show BERTScore aligns more closely with human evaluation than BLEU or ROUGE.
84+ }
85+ {
86+ \item Computationally heavy. Requires embedding extraction with large pre-trained models, making it slower than n-gram metrics.
87+ \item Model dependence. Performance varies depending on which pre-trained model (e.g., BERT, RoBERTa, multilingual-BERT) is used.
88+ \item Bias inheritance. Any biases in the underlying language model embeddings can influence the scores.
89+ }
90+
91+ \clearpage
92+
93+ \thispagestyle {customstyle}
94+
95+ \orangebox {Did you know that...}
96+ {The original BERTScore paper included a picture of Bert from Sesame Street paying homage to the model’s namesake and adding a playful touch to an
97+ otherwise technical paper.}
0 commit comments