Are evaluation data self sufficient? #3527

nsorros · 2022-11-04T09:54:53Z

nsorros
Nov 4, 2022

I am using the annotation tool to create question and answer pairs and then loading that data as per the evaluation guide https://haystack.deepset.ai/tutorials/05_evaluation to assess my qa system.

I am wondering whether I need to also load the initial documents, i.e. the ones that were used for annotation in the document store, on top of the evaluation data in order for the evaluation to work?

I say that because I am experiencing some errors with some combinations, for example when working with an in memory store and tfidf retriever and do not load the initial docs, only the evalaution data I get Retrieval requires dataframe df and tf-idf matrix but fit() did not calculate them probably due to an empty document store. which is resolved when loading the initial docs 🤔

bogdankostic · 2022-11-07T15:13:10Z

bogdankostic
Nov 7, 2022

Hi @nsorros! You are using TfidfRetriever, right?
Calling add_eval_data on your DocumentStore should be enough to also load the corresponding Documents used to perform evaluation on. I suspect that there is a mismatch in index names here. Please make sure that you use the same index name as value to the index parameter when initialiazing the DocumentStore as for the parameter doc_index when calling add_eval_data.
I adapted Tutorial 5 to work with TfidfRetriever instead of BM25Retriever in the following way which worked fine for me:

doc_index = "tutorial5_docs"
label_index = "tutorial5_labels"

document_store = ElasticsearchDocumentStore(
    host=host,
    username="",
    password="",
    index=doc_index,  # IMPORTANT: this needs to be same as doc_index param in add_eval_data
    label_index=label_index,
    embedding_field="emb",
    embedding_dim=768,
    excluded_meta_data=["emb"],
)

preprocessor = PreProcessor(
    split_by="word",
    split_length=200,
    split_overlap=0,
    split_respect_sentence_boundary=False,
    clean_empty_lines=False,
    clean_whitespace=False,
)

document_store.add_eval_data(
    filename="data/tutorial5/nq_dev_subset_v2.json",
    doc_index=doc_index,  # IMPORTANT: this needs to be same as index param when initialising DocumentStore
    label_index=label_index,
    preprocessor=preprocessor,
)

Let me know if you have further questions :)

0 replies

nsorros · 2022-11-11T14:19:48Z

nsorros
Nov 11, 2022
Author

I did look into the indices to ensure I am using the same but I must have been doing something wrong. Redid it today and it works 👍 Will update if there are any issues. Thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Are evaluation data self sufficient? #3527

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Are evaluation data self sufficient? #3527

Uh oh!

nsorros Nov 4, 2022

Replies: 2 comments

Uh oh!

bogdankostic Nov 7, 2022

Uh oh!

nsorros Nov 11, 2022 Author

nsorros
Nov 4, 2022

bogdankostic
Nov 7, 2022

nsorros
Nov 11, 2022
Author