Tutorial 8, preprocessing: inconsistent suggested word count when splitting #3103
stevenhaley
started this conversation in
General
Replies: 1 comment 1 reply
-
Ah yes, I think it should be consistent between the two. So as you say, would be We'll fix this soon-ish. Or please feel free to create an issue from it and start a PR. :) cc: @julian-risch |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The Jupyter notebook and Python file for Tutorial 8, Preprocessing/splitting are not consistent about what document length is ideal. This is in the section with header "PreProcessor". They sometimes use 100, and sometimes 1_000.
The notebook markdown documentation recommends "100 words for dense retrieval methods", and uses that in the code (
split_by="word", split_length=100
). However, the in-code comments suggests 1_000. On the other hand, the Python code uses 1_000 in the comments and in the code itself (split_by="word", split_length=1000
).I assume that's meant to be 100 everywhere, as sentence transformers such as multi-qa-mpnet-base-dot-v1 have a maximum of 512 word pieces.
Also, if you're updating the code file anyway, you could fix the broken link
https://haystack.deepset.ai/docs/latest/preprocessingmd
in:PS: I wasn't sure if this was technically a bug report a discussion. I figure it's low priority, so a discussion is better.
Beta Was this translation helpful? Give feedback.
All reactions