What are the most important steps in preprocessingt text? #2156
-
Hi community, Thank's a lot! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 4 replies
-
Hi @gabriead first of all, the preprocessing and indexing is pretty straightforward and I'd say that the approach you describe sounds good so you don't need worry too much about it. You can find some tipps and examples in our tutorial on preproccessing, which you have probably already found: https://haystack.deepset.ai/tutorials/preprocessing |
Beta Was this translation helpful? Give feedback.
-
Hi, I get a lot of \n characters when I convert my pdf to text. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your quick response :)
Yes, I have tried clean_empty_lines .
However, to the best of my knowledge what I understand is that it removes
only 3 consecutive empty lines, not a single one.
Hence I end up with a lot of '\n' in my text.
…On Tue, Dec 13, 2022 at 2:40 PM bogdankostic ***@***.***> wrote:
Hi @vibha0411 <https://github.com/vibha0411>, please have a look at our
PreProcessor <https://docs.haystack.deepset.ai/docs/preprocessor> node,
we provide two options to remove \n chars: clean_empty_lines and
clean_whitespace.
—
Reply to this email directly, view it on GitHub
<#2156 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYFTSEKVXIGZJ3ISLRYM7YDWNB4GPANCNFSM5OADF3WA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Thanks, this is really helpful.
…On Thu, Dec 15, 2022 at 5:39 PM bogdankostic ***@***.***> wrote:
If your goal is to simply replace all newline chars by a single whitespace
char, you can directly access the content field of each document and call
pythons built-in replacemethod:
document.content = document.content.replace("\n", " ")
—
Reply to this email directly, view it on GitHub
<#2156 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYFTSEINWGNVB77BYU6VNQDWNNCVDANCNFSM5OADF3WA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Hi @gabriead first of all, the preprocessing and indexing is pretty straightforward and I'd say that the approach you describe sounds good so you don't need worry too much about it. You can find some tipps and examples in our tutorial on preproccessing, which you have probably already found: https://haystack.deepset.ai/tutorials/preprocessing
What data to index as metadata depends on your search application and your use case. Having the headline as meta data isn't a bad idea, still I would also add it to the article text if it's not already in there. Meta data can help for filtering search results. We have an article on that topic here: https://www.deepset.ai/blog/metadata-filtering-in-ha…