Skip to content

Commit d58dc01

Browse files
[FIX] - Fix for summarization edge case (#1201)
This PR adds a fix for the issue mentioned in #1108 However I have a points to discuss @shahules786 : - I had added `conciseness_score` to penalize long summaries, but I also do not want to promote very very short and skimpy summaries, need to find a middle ground. - Is `averaging` a good way to combine `QA_score` and `conciseness_score`? - Ranking based metrics to measure quality of summarization (as mentioned by shahul in the above issue) Given the conclusions we reach based on these discussion points, I will push more commits, let's keep this PR open till we resolve these points. --------- Co-authored-by: Shahules786 <[email protected]>
1 parent 7f1073f commit d58dc01

File tree

2 files changed

+43
-49
lines changed

2 files changed

+43
-49
lines changed

docs/concepts/metrics/summarization_score.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,18 +11,21 @@ We compute the question-answer score using the answers, which is a list of `1`s
1111
\text{QA score} = \frac{|\text{correctly answered questions}|}{|\text{total questions}|}
1212
````
1313
14-
We also introduce an option to penalize larger summaries by proving a conciseness score. If this option is enabled, the final score is calculated as the average of the summarization score and the conciseness score. This conciseness scores ensures that summaries that are just copies of the text do not get a high score, because they will obviously answer all questions correctly.
14+
We also introduce an option to penalize larger summaries by proving a conciseness score. If this option is enabled, the final score is calculated as the weighted average of the summarization score and the conciseness score. This conciseness scores ensures that summaries that are just copies of the text do not get a high score, because they will obviously answer all questions correctly. Also, we do not want the summaries that are empty. We add a small value `1e-10` to the denominator to avoid division by zero.
1515
1616
```{math}
1717
:label: conciseness-score
18-
\text{conciseness score} = 1 - \frac{\text{length of summary}}{\text{length of context}}
18+
\text{conciseness score} = 1 - \frac{\min(\text{length of summary}, \text{length of context})}{\text{length of context} + \text{1e-10}}
1919
````
2020
21+
We also provide a coefficient `coeff`(default value 0.5) to control the weightage of the scores.
22+
2123
The final summarization score is then calculated as:
2224
2325
```{math}
2426
:label: summarization-score
25-
\text{Summarization Score} = \frac{\text{QA score} + \text{conciseness score}}{2}
27+
\text{Summarization Score} = \text{QA score}*\text{coeff} + \\
28+
\text{conciseness score}*\text{(1-coeff)}
2629
````
2730
2831
```{hint}
@@ -61,13 +64,14 @@ The final summarization score is then calculated as:
6164
## Example
6265
6366
```{code-block} python
64-
from datasets import Dataset
6567
from ragas.metrics import summarization_score
6668
from ragas import evaluate
69+
from datasets import Dataset
70+
6771
6872
data_samples = {
69-
'contexts' : [[c1], [c2]],
70-
'summary': [s1, s2]
73+
'contexts':[["A company is launching a new product, a smartphone app designed to help users track their fitness goals. The app allows users to set daily exercise targets, log their meals, and track their water intake. It also provides personalized workout recommendations and sends motivational reminders throughout the day."]],
74+
'summary':['A company is launching a fitness tracking app that helps users set exercise goals, log meals, and track water intake, with personalized workout suggestions and motivational reminders.'],
7175
}
7276
dataset = Dataset.from_dict(data_samples)
7377
score = evaluate(dataset,metrics=[summarization_score])

src/ragas/metrics/_summarization.py

Lines changed: 33 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from langchain.pydantic_v1 import BaseModel
99

1010
from ragas.llms.output_parser import RagasoutputParser, get_json_format_instructions
11-
from ragas.llms.prompt import Prompt, PromptValue
11+
from ragas.llms.prompt import Prompt
1212
from ragas.metrics.base import EvaluationMode, MetricWithLLM
1313

1414
if t.TYPE_CHECKING:
@@ -145,65 +145,49 @@ class SummarizationScore(MetricWithLLM):
145145
name: str = "summary_score" # type: ignore
146146
max_retries: int = 1
147147
length_penalty: bool = True
148-
evaluation_mode: EvaluationMode = EvaluationMode.ca # type: ignore[reportIncompatibleMethodOverride]
148+
coeff: float = 0.5
149+
evaluation_mode: EvaluationMode = EvaluationMode.ca # type: ignore
149150
question_generation_prompt: Prompt = field(
150151
default_factory=lambda: TEXT_GENERATE_QUESTIONS
151152
)
152153
answer_generation_prompt: Prompt = field(
153154
default_factory=lambda: TEXT_GENERATE_ANSWERS
154155
)
155-
156-
def _get_extract_keyphrases_prompt(self, text) -> PromptValue:
157-
return TEXT_EXTRACT_KEYPHRASES.format(text=text)
158-
159-
def _get_question_generation_prompt(self, text, keyphrases) -> PromptValue:
160-
return TEXT_GENERATE_QUESTIONS.format(text=text, keyphrases=keyphrases)
161-
162-
def _get_answer_generation_prompt(
163-
self, questions: t.List, summary: str
164-
) -> PromptValue:
165-
return TEXT_GENERATE_ANSWERS.format(summary=summary, questions=questions)
156+
extract_keyphrases_prompt: Prompt = field(
157+
default_factory=lambda: TEXT_EXTRACT_KEYPHRASES
158+
)
166159

167160
async def _ascore(self, row: Dict, callbacks: Callbacks) -> float:
168-
# text is the contexts provided
169-
# summary is the summary generated by the model
170-
# TODO: add support for the query used as well
171161
text: str = "\n".join(row["contexts"])
172162
summary: str = row["summary"]
173163
keyphrases = await self._extract_keyphrases(text, callbacks)
174164
questions = await self._get_questions(text, keyphrases, callbacks)
175165
answers = await self._get_answers(questions, summary, callbacks)
176166

177-
scores = []
167+
scores = {}
178168
qa_score = self._compute_qa_score(answers)
179-
scores.append(qa_score)
169+
scores["qa_score"] = qa_score
180170
if self.length_penalty:
181171
conciseness_score = self._compute_conciseness_score(text, summary)
182-
scores.append(conciseness_score)
172+
scores["conciseness_score"] = conciseness_score
183173
return self._compute_score(scores)
184174

185175
def _compute_score(self, scores) -> float:
186-
"""Returns average score of the different scores."""
187-
return sum(scores) / len(scores)
176+
return (
177+
scores["qa_score"] * (1 - self.coeff)
178+
+ scores.get("conciseness_score", 0) * self.coeff
179+
)
188180

189181
def _compute_qa_score(self, answers: t.List[str]) -> float:
190-
"""Returns a score between 0 and 1 reflecting the fraction of
191-
correct answers, ie with a value 'yes'
192-
"""
193182
correct = sum([1 for a in answers if a.lower() == "1"])
194183
return correct / len(answers)
195184

196185
def _compute_conciseness_score(self, text, summary) -> float:
197-
"""Returns the conciseness score of the summary. This is calculated as
198-
(1- relative_length_of_summary), where relative_length_of_summary is the
199-
ratio of the length of the summary to the length of the original text.
200-
This promotes shorter summaries.
201-
"""
202-
return 1 - (len(summary) / len(text))
186+
return 1 - min(len(summary), len(text)) / (len(text) + 1e-10)
203187

204188
async def _extract_keyphrases(self, text: str, callbacks: Callbacks) -> t.List[str]:
205189
assert self.llm is not None, "LLM is not initialized"
206-
p_value = self._get_extract_keyphrases_prompt(text)
190+
p_value = self.extract_keyphrases_prompt.format(text=text)
207191
result = await self.llm.generate(
208192
prompt=p_value,
209193
callbacks=callbacks,
@@ -223,7 +207,9 @@ async def _get_questions(
223207
self, text: str, keyphrases: list[str], callbacks: Callbacks
224208
) -> t.List[str]:
225209
assert self.llm is not None, "LLM is not initialized"
226-
p_value = self._get_question_generation_prompt(text, keyphrases)
210+
p_value = self.question_generation_prompt.format(
211+
text=text, keyphrases=keyphrases
212+
)
227213
result = await self.llm.generate(
228214
prompt=p_value,
229215
callbacks=callbacks,
@@ -244,7 +230,9 @@ async def _get_answers(
244230
self, questions: t.List[str], summary: str, callbacks: Callbacks
245231
) -> t.List[str]:
246232
assert self.llm is not None, "LLM is not initialized"
247-
p_value = self._get_answer_generation_prompt(questions, summary)
233+
p_value = self.answer_generation_prompt.format(
234+
questions=questions, summary=summary
235+
)
248236
result = await self.llm.generate(
249237
prompt=p_value,
250238
callbacks=callbacks,
@@ -261,17 +249,19 @@ async def _get_answers(
261249

262250
return response.answers
263251

252+
def adapt(self, language: str, cache_dir: str | None = None) -> None:
253+
assert self.llm is not None, "set LLM before use"
264254

265-
def adapt(self, language: str, cache_dir: str | None = None) -> None:
266-
assert self.llm is not None, "set LLM before use"
267-
268-
logger.info(f"Adapting summarization to {language}")
269-
self.question_generation_prompt = self.question_generation_prompt.adapt(
270-
language, self.llm, cache_dir
271-
)
272-
self.answer_generation_prompt = self.answer_generation_prompt.adapt(
273-
language, self.llm, cache_dir
274-
)
255+
logger.info(f"Adapting summarization to {language}")
256+
self.question_generation_prompt = self.question_generation_prompt.adapt(
257+
language, self.llm, cache_dir
258+
)
259+
self.answer_generation_prompt = self.answer_generation_prompt.adapt(
260+
language, self.llm, cache_dir
261+
)
262+
self.answer_generation_prompt = self.answer_generation_prompt.adapt(
263+
language, self.llm, cache_dir
264+
)
275265

276266

277267
summarization_score = SummarizationScore()

0 commit comments

Comments
 (0)