1.0.0 #1857

tholor · 2021-12-08T08:05:23Z

tholor
Dec 8, 2021
Maintainer

🎁 Haystack 1.0

We worked hard to bring you an early Christmas present: 1.0 is out! In the last months, we re-designed many essential parts of Haystack, introduced new features, and simplified many user-facing methods. We believe Haystack is now much easier to use and a solid base for many exciting upcoming features that we plan. This release is a major milestone on our journey with you, the community, and we want to thank you again for all the great contributions, discussions, questions, and bug reports that helped us to build a better Haystack. This journey has just started 🚀

⭐ Highlights

Improved Evaluation of Pipelines

Evaluation helps you find out how well your system is doing on your data. This includes Pipeline level evaluation to ensure that the system's output is really what you're after, but also Node level evaluation so that you can figure out whether it's your Reader or Retriever that is holding back the performance.

In this release, evaluation is much simpler and cleaner to perform. All the functionality is now baked into the Pipeline class and you can kick off the process by providing Label or MultiLabel objects to the Pipeline.eval() method.

eval_result = pipeline.eval(
    labels=labels,
    params={"Retriever": {"top_k": 5}},
)

The output is an EvaluationResult object which stores each Node's prediction for each sample in a Pandas DataFrame - so you can easily inspect granular predictions and potential mistakes without re-running the whole thing. There is a EvaluationResult.calculate_metrics() method which will return the relevant metrics for your evaluation and you can print a convenient summary report via the new .

metrics = eval_result.calculate_metrics()

pipeline.print_eval_report(eval_result)

If you'd like to start evaluating your own systems on your own data, check out our Evaluation Tutorial!

Table QA

A lot of valuable information is stored in tables - we've heard this again and again from the community. While they are an efficient structured data format, it hasn't been possible to search for table contents using traditional NLP techniques. But now, with the new TableTextRetriever and TableReader our users have all the tools they need to query for relevant tables and perform Question Answering.

The TableTextRetriever is the result of our team's research into table retrieval methods which you can read about in this paper that was presented at EMNLP 2021. Behind the scenes, it uses three transformer-based encoders - one for text passages, one for tables, and one for the query. However, in Haystack, you can swap it in for any other dense retrieval model and start working with tables. The TableReader is built upon the TAPAS model and when handed table containing Documents, it can return a single cell as an answer or perform an aggregation operation on a set of cells to form a final answer.

retriever = TableTextRetriever(
    document_store=document_store,
    query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
    passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
    table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
    embed_meta_fields=["title", "section_title"]
)

reader = TableReader(
		model_name_or_path="google/tapas-base-finetuned-wtq",
		max_seq_len=512
)

Have a look at the Table QA documentation if you'd like to learn more or dive into the Table QA tutorial to start unlocking the information in your table data.

Improved Debugging of Pipelines & Nodes

We've made debugging much simpler and also more informative! As long as your node receives a boolean debug argument, it can propagate its input, output or even some custom information to the output of the pipeline. It is now a built-in feature of all existing nodes and can also easily be inherited by your custom nodes.

result = pipeline.run(
        query="Who is the father of Arya Stark?",
        params={
            "debug": True
        }
    )

{'ESRetriever': {'input': {'debug': True,
                           'query': 'Who is the father of Arya Stark?',
                           'root_node': 'Query',
                           'top_k': 1},
                 'output': {'documents': [<Document: {'content': "\n===In the Riverlands===\nThe Stark army reaches the Twins, a bridge strong", ...}>]
                            ...}

To find out more about this feature, check out debugging. To learn how to define custom debug information, have a look at custom debugging.

FARM Migration

Those of you following Haystack from its first days will know that Haystack first evolved out of the FARM framework. While FARM is designed to handle diverse NLP models and tasks, Haystack gives full end-to-end support to search and question answering use cases with a focus on coordinating all components that take a proof-of-concept into production.

Haystack has always relied on FARM for much lower-level processing and modeling. To reduce the implementation overhead and simplify debugging, we have migrated the relevant parts of FARM into the new haystack/modeling package.

⚠️ Breaking Changes & Migration Guide

Migration to v1.0

With the release of v1.0, we decided to make some bold changes.
We believe this has brought a significant improvement in usability and makes the project more future-proof.
While this does come with a few breaking changes, and we do our best to guide you on how to go from v0.x to v1.0.
For more details see the Migration Guide and if you need more guidance, just reach out via Slack.

New Package Structure & Changed Imports

Due to the ever-increasing number of Nodes and Document Stores being integrated into Haystack,
we felt the need to implement a repository structure that makes it easier to navigate to what you're looking for. We've also shortened the length of the imports.

haystack.document_stores

All Document Stores can now be directly accessed from here
Note the pluralization of document_store to document_stores

haystack.nodes

This directory directly contains any class that can be used as a node
This includes File Converters and PreProcessors

haystack.pipelines

This contains all the base, custom and pre-made pipeline classes
Note the pluralization of pipeline to pipelines

haystack.utils

Any utility functions

➡️ For the large majority of imports, the old style still works but this will be deprecated in future releases!

Primitive Objects

Instead of relying on dictionaries, Haystack now standardizes more of the inputs and outputs of Nodes using the following primitive classes:

With these, there is now support for data structures beyond text and the REST API schema is built around their structure.
Using these classes also allows for the autocompletion of fields in your IDE.

Tip: To see examples of these primitive classes being returned, have a look at Ready-Made Pipelines.

Many of the fields in these classes have also been renamed or removed.
You can see a more comprehensive list of them in this Github issue.
Below, we will go through a few cases that are likely to impact established workflows.

Input Document Format

This dictionary schema used to be the recommended way to prepare your data to be indexed.
Now we strongly recommend using our dedicated Document class as a replacement.
The text field has been renamed content to accommodate for cases where it is used for another data format,
for example in Table QA.

Click here to see code example

v0.x:

doc = {
	'text': 'DOCUMENT_TEXT_HERE',
	'meta': {'name': DOCUMENT_NAME, ...}
}

v1.0:

doc = Document(
    content='DOCUMENT_TEXT_HERE',
    meta={'name': DOCUMENT_NAME, ...}
)

From here, you can take the same steps to write Documents into your Document Store.

document_store.write_documents([doc])

Response format of Reader

All Reader Nodes now return Answer objects instead of dictionaries.

Click here to see code example

v0.x:

[
    {
        'answer': 'Fang',
        'score': 13.26807975769043,
        'probability': 0.9657130837440491,
        'context': """Криволапик (Kryvolapyk, kryvi lapy "crooked paws")
            ===Fang (Hagrid's dog)===
            *Chinese (PRC): 牙牙 (ya2 ya) (from 牙 "tooth", 牙,"""
    }
]

v1.0:

[
    <Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9946763813495636, 'context': "s Nymeria after a legendary warrior queen. She travels...", 'offsets_in_document': [{'start': 147, 'end': 153}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': 'ba2a8e87ddd95e380bec55983ee7d55f', 'meta': {'name': '43_Arya_Stark.txt'}}>,
    <Answer {'answer': 'King Robert', 'type': 'extractive', 'score': 0.9251320660114288, 'context': 'ordered by the Lord of Light. Melisandre later reveals to Gendry that...', 'offsets_in_document': [{'start': 1808, 'end': 1819}], 'offsets_in_context': [{'start': 70, 'end': 81}], 'document_id': '7b67b0e27571c2b2025a34b4db18ad49', 'meta': {'name': '349_List_of_Game_of_Thrones_characters.txt'}}>,
    <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.8103329539299011, 'context': " girl disguised as a boy all along and is surprised to learn she is Arya...", 'offsets_in_document': [{'start': 920, 'end': 923}], 'offsets_in_context': [{'start': 74, 'end': 77}], 'document_id': '7b67b0e27571c2b2025a34b4db18ad49', 'meta': {'name': '349_List_of_Game_of_Thrones_characters.txt'}}>,
    ...
]

Label Structure

The attributes of the Label object have gone through some changes.
To see their current structure see Label.

Click here to see code example

v0.x:

label = Label(
    question=QUESTION_TEXT_HERE,
    answer=ANSWER_STRING_HERE,
    ...
)

v1.0:

label = Label(
    query=QUERY_TEXT_HERE,
    answer=Answer(...),
    ...
)

REST API Format

The response format for the /query matches that of the primitive objects, only in JSON form.
This means, there are similar breaking changes as described above for the Answer format of a Reader.
Particularly, the names of the offset fields have changed and need to be aligned to the new format when coming from Haystack v0.x.
For detailed examples and guidance see the Migration Guide.

Other breaking changes

Save/load of FAISSDocumentstore @ZanSara in FAISS document store should save it's configuration alongside the index #1459
Add AzureConverter & change response format of FileConverter.convert() by @bogdankostic in Add AzureConverter #1813

🤓 Detailed Changes

New Contributors

@mathislucka made their first contribution in feat: normalize embeddings for faiss cosine similarity #1352
@ZanSara made their first contribution in FAISS document store should save it's configuration alongside the index #1459
@ju-gu made their first contribution in changed delete_all_documents to delete_documents in Tutorial5 #1477
@adithyaur99 made their first contribution in Update sql.py to ignore multi thread issues. #1442
@mhamdan91 made their first contribution in Adding TfidfRetriever to __init__.py of the retriever package #1575
@CandiceYu8 made their first contribution in [fix] MySQL connection 'check_same_thread' error #1585
@gak97 made their first contribution in Add checkpointing for reader.train() to allow stopping + resuming training #1554
@fingoldo made their first contribution in Cosine similarity for the rest of DocStores. #1569
@AlonEirew made their first contribution in fix issue #1687 - DPR training fails on multiple GPU's #1688
@tstadel made their first contribution in Tutorial for DocumentClassifier at Index Time #1697
@nishanthcgit made their first contribution in Capitalize starting letter in params #1750
@ArzelaAscoIi made their first contribution in Huggingface private model support via API tokens (FARMReader) #1775
@SjSnowball made their first contribution in Introduced an arg to add synonyms - Elasticsearch #1625
@AhmedIdr made their first contribution in Added max_seq_length and batch_size params to embeddingretriever #1817
@gabinguo made their first contribution in Fix bug ranker: wrong lambda function #1824

❤️ Thanks to all contributors and the whole community!

This discussion was created from the release 1.0.0.

lalitpagaria · 2021-12-08T08:11:34Z

lalitpagaria
Dec 8, 2021

Awesome! Onward and upward 🚀🚀🚀

0 replies

mapapa · 2022-01-10T17:42:40Z

mapapa
Jan 10, 2022

Very nice, quick question, when are you planning to release the master repo that has the ray>=1.9.1 change that resolves the log4j security issue? Many thanks

2 replies

tholor Jan 10, 2022
Maintainer Author

We will probably do a minor release next week (latest) that includes this change

mapapa Jan 10, 2022

Amazing, much appreciated, many thanks for all your wonderful work!

mapapa · 2022-01-10T23:23:08Z

mapapa
Jan 10, 2022

Hihi,

Is it possible to upgrade Pillow from 8.3.2 to 9.0.0 because:

Pillow package for python contains a flaw in pil/pdfparser.py that is triggered as carriage return characters are not properly handled in a regular expression. this may allow a context-dependent attacker to hang or slow down a python process using the library.

Severity Source: CVSS V3 from RBS

Many thanks,
Manos

4 replies

julian-risch Jan 11, 2022
Maintainer

Hi @mapapa thanks for mentioning this. 👍 Could you please create an issue for that via this link ? If you would like to, we would very much appreciate if you create a pull request with that small change. It's basically about changing this line:

haystack/requirements.txt

Line 39 in a44b6c1

pillow==8.3.2

We will take care of running tests then. If you don't have time for the pull request, just let us know and we'll take over.

julian-risch Jan 11, 2022
Maintainer

So far, I don't see a reason why not to upgrade the version. 9.0.0 doesn't support python 3.6 anymore (https://pillow.readthedocs.io/en/stable/installation.html) but we dropped the support for python 3.6 anyway here: #1059

mapapa Jan 11, 2022

Hi @julian-risch, I have raised a new issue and tried to push the respective code change but I am getting the following error:

remote: Permission to deepset-ai/haystack.git denied to mapapa

Thanks

julian-risch Jan 11, 2022
Maintainer

Thanks for creating the issue. I responded there: #1988

1.0.0 #1857

Uh oh!

tholor Dec 8, 2021 Maintainer

🎁 Haystack 1.0

⭐ Highlights

Improved Evaluation of Pipelines

Table QA

Improved Debugging of Pipelines & Nodes

FARM Migration

⚠️ Breaking Changes & Migration Guide

Migration to v1.0

New Package Structure & Changed Imports

Primitive Objects

Input Document Format

Response format of Reader

Label Structure

REST API Format

Other breaking changes

🤓 Detailed Changes

Pipeline

Models

DocumentStores

REST API

UI / Demo

Documentation

Other Changes

New Contributors

Replies: 3 comments · 6 replies

Uh oh!

Uh oh!

lalitpagaria Dec 8, 2021

Uh oh!

mapapa Jan 10, 2022

Uh oh!

Uh oh!

tholor Jan 10, 2022 Maintainer Author

Uh oh!

mapapa Jan 10, 2022

Uh oh!

mapapa Jan 10, 2022

Uh oh!

julian-risch Jan 11, 2022 Maintainer

Uh oh!

julian-risch Jan 11, 2022 Maintainer

Uh oh!

mapapa Jan 11, 2022

Uh oh!

julian-risch Jan 11, 2022 Maintainer

tholor
Dec 8, 2021
Maintainer

Replies: 3 comments 6 replies

lalitpagaria
Dec 8, 2021

mapapa
Jan 10, 2022

tholor Jan 10, 2022
Maintainer Author

mapapa
Jan 10, 2022

julian-risch Jan 11, 2022
Maintainer

julian-risch Jan 11, 2022
Maintainer

julian-risch Jan 11, 2022
Maintainer