Tutorial 8, preprocessing: inconsistent suggested word count when splitting #3103

stevenhaley · 2022-08-25T15:56:27Z

stevenhaley
Aug 25, 2022

The Jupyter notebook and Python file for Tutorial 8, Preprocessing/splitting are not consistent about what document length is ideal. This is in the section with header "PreProcessor". They sometimes use 100, and sometimes 1_000.

The notebook markdown documentation recommends "100 words for dense retrieval methods", and uses that in the code (split_by="word", split_length=100). However, the in-code comments suggests 1_000. On the other hand, the Python code uses 1_000 in the comments and in the code itself (split_by="word", split_length=1000).

I assume that's meant to be 100 everywhere, as sentence transformers such as multi-qa-mpnet-base-dot-v1 have a maximum of 512 word pieces.

Also, if you're updating the code file anyway, you could fix the broken link https://haystack.deepset.ai/docs/latest/preprocessingmd in:

## PreProcessor
    
The PreProcessor class is designed to help you clean text and split text into sensible units.
File splitting can have a very significant impact on the system's performance.
Have a look at the [Preprocessing](https://haystack.deepset.ai/docs/latest/preprocessingmd)
and [Optimization](https://haystack.deepset.ai/docs/latest/optimizationmd) pages on our website for more details.
"""

PS: I wasn't sure if this was technically a bug report a discussion. I figure it's low priority, so a discussion is better.

bglearning · 2022-08-26T10:08:33Z

bglearning
Aug 26, 2022
Collaborator

Ah yes, I think it should be consistent between the two. So as you say, would be 100 everywhere. Thanks for bringing this up!

We'll fix this soon-ish. Or please feel free to create an issue from it and start a PR. :)

cc: @julian-risch

1 reply

stevenhaley Sep 1, 2022
Author

I've created #3133 to fix it 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tutorial 8, preprocessing: inconsistent suggested word count when splitting #3103

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Tutorial 8, preprocessing: inconsistent suggested word count when splitting #3103

Uh oh!

stevenhaley Aug 25, 2022

Replies: 1 comment · 1 reply

Uh oh!

bglearning Aug 26, 2022 Collaborator

Uh oh!

stevenhaley Sep 1, 2022 Author

stevenhaley
Aug 25, 2022

Replies: 1 comment 1 reply

bglearning
Aug 26, 2022
Collaborator

stevenhaley Sep 1, 2022
Author