puzzled by results processing a project guternberg text #12381
-
I downloaded from project gutenberg: ANABASIS BY XENOPHON (pg1170.txt). sample text: ----------- outputs: |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Hey, thanks for your post! Generally, depending on the data, ML models are mostly never 100% and do misclassification sometimes. |
Beta Was this translation helpful? Give feedback.
-
I tried all sizes of en_core_web_xx. There was a bit of change with each, but not much to the good. The result posted is with this model: hacked python code is in the zip I attached. I mostly used snippets of code I found online in website posts purporting to teach about Spacy and Python. The snippets were eerily similar over several different websites, as if the authors each were borrowing extensively from other webites. I am new to this. I don't know what training a model would entail, time-wise. I was hoping Spacy would give a good enough result out of the box, but alas, no. I have no idea how much of a time commitment it would be to train a language model using project gutenberg English texts (as in, both text from native English writers and text translated from Russian or Greek or French). |
Beta Was this translation helpful? Give feedback.
-
OK. To me, Anabasis, as translated to English, is in the domain of plain
simple English text.
Maybe Project Gutenberg should train an English language model for Spacy.
I will use the mostly accurate Proper Nouns to continue my project.
Thanks for the feedback.
…On Mon, Mar 13, 2023 at 8:26 AM Edward ***@***.***> wrote:
The main issue is that the pretrained models were trained on a different
corpus, that's why they seem to have trouble with analyzing the ANABASIS
BY XENOPHON text, which is a very specific domain.
About training, you can read here all about the process
<https://spacy.io/usage/training> and decide whether it's worth your time.
I hope this was helpful!
—
Reply to this email directly, view it on GitHub
<#12381 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJSF3KYM7H34C5VILVD5CMDW34G65ANCNFSM6AAAAAAVS257HY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Hey, thanks for your post!
Can you provide more information about which spaCy model you are using? Is it a pre-trained model, or did you train from scratch?
You could try to use other pre-trained models and see if the results are improving.
Generally, depending on the data, ML models are mostly never 100% and do misclassification sometimes.
We have a little thread about this issue here , which goes into that topic with more detail.