Skip to content

RAGAS with huggingface models #1090

@SalvatoreRa

Description

@SalvatoreRa

Describe the bug
A clear and concise description of what the bug is.

I tried using RAGAS with a model that is not OpenAI. In general whatever model I use I get this error back:

File /opt/conda/lib/python3.10/site-packages/ragas/evaluation.py:237, in evaluate(dataset, metrics, llm, embeddings, callbacks, in_ci, is_async, run_config, raise_exceptions, column_map)
    235 results = executor.results()
    236 if results == []:
--> 237     raise ExceptionInRunner()
    239 # convert results to dataset_like
    240 for i, _ in enumerate(dataset):

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass raise_exceptions=False incase you want to show only a warning message instead.

/opt/conda/lib/python3.10/site-packages/ipykernel/iostream.py:123: RuntimeWarning: coroutine 'as_completed.<locals>.sema_coro' was never awaited
  await self._event_pipe_gc()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

Which I solved using this:

import nest_asyncio
nest_asyncio.apply()

However, it is not returning error but it is returning:
{'faithfulness': nan, 'answer_relevancy': nan, 'context_utilization': nan}

Code to Reproduce

import pandas as pd
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
from langchain import HuggingFacePipeline
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_utilization
)
from ragas import evaluate
from datasets import Dataset

import nest_asyncio
nest_asyncio.apply()

# embedding model
embedding_model = SentenceTransformer("microsoft/mpnet-base")

# evaluator
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

device = 0  # Use GPU (0 is typically the first GPU device)

pipe = pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    temperature=0.1,
    do_sample=True,
    max_new_tokens = 200,
    repetition_penalty=1.1  # without this output begins repeating

)

evaluator = HuggingFacePipeline(pipeline=pipe)

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
}
dataset = Dataset.from_dict(data_samples)

# ragas
result = evaluate(
    dataset=dataset,
    llm=evaluator,
    embeddings=embedding_model,
    raise_exceptions=False,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_utilization,
    ]
)

print(result)

Error trace
No error, but basically is not working
Expected behavior
It should return the evaluation metrics

Thank you very much for your help

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions