-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Hi!
I have a question regarding the generation of question sets for the Summarization Score metric. I am working on creating a high-quality summary and need reliable evaluation metrics to assess it. I have found the Summarization Score metric to be very useful for checking the quality of my summaries.
However, I am experiencing some issues with the volatility of the scores. Even though the input is always the same, the questions generated differ each time. Is there a way to ensure that the set of questions remains consistent when the input is the same?
I have an idea related to this issue. If it is not possible to generate a consistent set of questions using an LLM, what do you think about attempting to score the summary multiple times (n times) and averaging the results to get a more stable score? If you have any other good suggestions, I would greatly appreciate it.