Spacy for computational grammar and semantic analysis #9827
-
For a while I have been wanting to find a way to check (and formulate) grammar rules computationally. Let’s say a language learner asks “Why do I have to use the article ‘the’ with this word in this case?” And I refer to my grammar books and I say, “Well, you always use the article in this and that case.” (I.e., if it’s a specific one rather than a general, or maybe it’s just an expression like “the internet”.) I would like to try to test / prove the assertion. Going over a corpus, I’d like an algorithm that identifies all sentences fitting the description of the situation and checking the article is properly used. It might confirm that the rule has very few exceptions, like “an is used before a word with a vowel sound”, or it might show the situation to be more complicated than was thought. I don’t know what tools can allow this but I think it would be cool it Spacy supported it because then it could be a really useful multi-functional linguistics library. So: I assume Spacy has a sentence tree parser. What tools does it offer for semantic analysis? I know Spacy has some downloadable data packages like “web_core_en” or something. Does Spacy offer methods that interface directly with really high quality, built-in corpora, like for the application above? Can anyone recommend a way I could do the above - of testing and proving grammar rules computationally? Thank you |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
It sounds like you should be able to do most of this with the rule-based matchers.
No. Large, high-quality corpora tend to require licensing arrangements which prevent us from redistributing them. If you want to test assertions about language on a lot of unannotated text you have more options, like Wikipedia. You might want to take a look at the SPIKE project from AllenAI, where the web demo integrates the query language with datasets like Wikipedia, US patents, etc. |
Beta Was this translation helpful? Give feedback.
-
Thank you.
What is the preferred way to access Wikipedia as a dataset? Do you just
download a dataset from Kaggle or something or do you write a custom
scraper in Scrapy?
That Allen project seems to offer syntactic searching, I’ll look more into
that.
From what I could tell Spacy’s rule based matcher didn’t have syntactic
matching, is that correct? I’m picturing searching through a dataset for a
sentence of a particular syntactic form.
Thanks very much.
…On Wed 8. Dec 2021 at 05:43, polm ***@***.***> wrote:
It sounds like you should be able to do most of this with the rule-based
matchers <https://spacy.io/usage/rule-based-matching>.
Does Spacy offer methods that interface directly with really high quality,
built-in corpora, like for the application above?
No. Large, high-quality corpora tend to require licensing arrangements
which prevent us from redistributing them. If you want to test assertions
about language on a lot of unannotated text you have more options, like
Wikipedia.
You might want to take a look at the SPIKE
<https://spike.apps.allenai.org/datasets> project from AllenAI, where the
web demo integrates the query language with datasets like Wikipedia, US
patents, etc.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9827 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVQQT7UBTLD3JIIXLCUE3SLUP3O55ANCNFSM5JRCZA3Q>
.
|
Beta Was this translation helpful? Give feedback.
It sounds like you should be able to do most of this with the rule-based matchers.
No. Large, high-quality corpora tend to require licensing arrangements which prevent us from redistributing them. If you want to test assertions about language on a lot of unannotated text you have more options, like Wikipedia.
You might want to take a look at the SPIKE project from AllenAI, where the web demo integrates the query language with datasets like Wikipedia, US patents, etc.