What are the most important steps in preprocessingt text? #2156

gabriead · 2022-02-10T09:32:29Z

gabriead
Feb 10, 2022

Hi community,
I want to use a custom data set that will contain technical documents. What are the most important text preprocessing steps that I should do? Should I indicate a headline (how would that be indicated?), start-end of a paragraph etc.? Currently I am using the entire text of an article with end of line tags as context value and the headline in as meta value. What meta information is important to improve search results?

Thank's a lot!

Answered by julian-risch

Feb 10, 2022

Hi @gabriead first of all, the preprocessing and indexing is pretty straightforward and I'd say that the approach you describe sounds good so you don't need worry too much about it. You can find some tipps and examples in our tutorial on preproccessing, which you have probably already found: https://haystack.deepset.ai/tutorials/preprocessing
What data to index as metadata depends on your search application and your use case. Having the headline as meta data isn't a bad idea, still I would also add it to the article text if it's not already in there. Meta data can help for filtering search results. We have an article on that topic here: https://www.deepset.ai/blog/metadata-filtering-in-ha…

View full answer

julian-risch · 2022-02-10T15:07:29Z

julian-risch
Feb 10, 2022
Maintainer

Hi @gabriead first of all, the preprocessing and indexing is pretty straightforward and I'd say that the approach you describe sounds good so you don't need worry too much about it. You can find some tipps and examples in our tutorial on preproccessing, which you have probably already found: https://haystack.deepset.ai/tutorials/preprocessing
What data to index as metadata depends on your search application and your use case. Having the headline as meta data isn't a bad idea, still I would also add it to the article text if it's not already in there. Meta data can help for filtering search results. We have an article on that topic here: https://www.deepset.ai/blog/metadata-filtering-in-haystack Happy reading! :)

0 replies

vibha0411 · 2022-12-12T20:46:10Z

vibha0411
Dec 12, 2022

Hi, I get a lot of \n characters when I convert my pdf to text.
Is it possible to remove '\n' characters by using the haystack preprocessing pipeline?
Or is there some alternative to this?

2 replies

vblagoje Dec 13, 2022
Maintainer

cc @bogdankostic

bogdankostic Dec 13, 2022

Hi @vibha0411, please have a look at our PreProcessor node, we provide two options to remove \n chars: clean_empty_lines and clean_whitespace.

vibha0411 · 2022-12-13T14:44:53Z

vibha0411
Dec 13, 2022

Thank you for your quick response :) Yes, I have tried clean_empty_lines . However, to the best of my knowledge what I understand is that it removes only 3 consecutive empty lines, not a single one. Hence I end up with a lot of '\n' in my text.

…

On Tue, Dec 13, 2022 at 2:40 PM bogdankostic ***@***.***> wrote: Hi @vibha0411 <https://github.com/vibha0411>, please have a look at our PreProcessor <https://docs.haystack.deepset.ai/docs/preprocessor> node, we provide two options to remove \n chars: clean_empty_lines and clean_whitespace. — Reply to this email directly, view it on GitHub <#2156 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYFTSEKVXIGZJ3ISLRYM7YDWNB4GPANCNFSM5OADF3WA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

bogdankostic Dec 15, 2022

If your goal is to simply replace all newline chars by a single whitespace char, you can directly access the content field of each document and call pythons built-in replacemethod:

document.content = document.content.replace("\n", " ")

jeromemassot Dec 31, 2022

This option to remove the "\n" should be added in the preprocessing option IMO. It is not very convenient to use a str.replace for this when Haystack is proposing a specific PreProcessor to do all the text cleaning job except this one, which is a very common one. Thanks.

vibha0411 · 2022-12-15T16:43:48Z

vibha0411
Dec 15, 2022

Thanks, this is really helpful.

…

On Thu, Dec 15, 2022 at 5:39 PM bogdankostic ***@***.***> wrote: If your goal is to simply replace all newline chars by a single whitespace char, you can directly access the content field of each document and call pythons built-in replacemethod: document.content = document.content.replace("\n", " ") — Reply to this email directly, view it on GitHub <#2156 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYFTSEINWGNVB77BYU6VNQDWNNCVDANCNFSM5OADF3WA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What are the most important steps in preprocessingt text? #2156

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What are the most important steps in preprocessingt text? #2156

Uh oh!

gabriead Feb 10, 2022

Replies: 4 comments · 4 replies

Uh oh!

julian-risch Feb 10, 2022 Maintainer

Uh oh!

vibha0411 Dec 12, 2022

Uh oh!

vblagoje Dec 13, 2022 Maintainer

Uh oh!

bogdankostic Dec 13, 2022

Uh oh!

vibha0411 Dec 13, 2022

Uh oh!

bogdankostic Dec 15, 2022

Uh oh!

jeromemassot Dec 31, 2022

Uh oh!

vibha0411 Dec 15, 2022

gabriead
Feb 10, 2022

Replies: 4 comments 4 replies

julian-risch
Feb 10, 2022
Maintainer

vibha0411
Dec 12, 2022

vblagoje Dec 13, 2022
Maintainer

vibha0411
Dec 13, 2022

vibha0411
Dec 15, 2022