Hi,
I was wondering if you have any insights on the performance when using one of the pre-trained evaluators on a different task.
For example, using unieval-sum for some other text generation tasks (which is not summarization), but the same aspects are important for us in evaluation.
Do you think we can still rely on the scores for the sake of comparison?