Spacy for computational grammar and semantic analysis #9827

julkhami · 2021-12-07T12:09:30Z

julkhami
Dec 7, 2021

For a while I have been wanting to find a way to check (and formulate) grammar rules computationally.

Let’s say a language learner asks “Why do I have to use the article ‘the’ with this word in this case?”

And I refer to my grammar books and I say, “Well, you always use the article in this and that case.” (I.e., if it’s a specific one rather than a general, or maybe it’s just an expression like “the internet”.)

I would like to try to test / prove the assertion. Going over a corpus, I’d like an algorithm that identifies all sentences fitting the description of the situation and checking the article is properly used. It might confirm that the rule has very few exceptions, like “an is used before a word with a vowel sound”, or it might show the situation to be more complicated than was thought.

I don’t know what tools can allow this but I think it would be cool it Spacy supported it because then it could be a really useful multi-functional linguistics library.

So:

I assume Spacy has a sentence tree parser. What tools does it offer for semantic analysis?

I know Spacy has some downloadable data packages like “web_core_en” or something. Does Spacy offer methods that interface directly with really high quality, built-in corpora, like for the application above?

Can anyone recommend a way I could do the above - of testing and proving grammar rules computationally?

Thank you

Answered by polm

Dec 8, 2021

It sounds like you should be able to do most of this with the rule-based matchers.

Does Spacy offer methods that interface directly with really high quality, built-in corpora, like for the application above?

No. Large, high-quality corpora tend to require licensing arrangements which prevent us from redistributing them. If you want to test assertions about language on a lot of unannotated text you have more options, like Wikipedia.

You might want to take a look at the SPIKE project from AllenAI, where the web demo integrates the query language with datasets like Wikipedia, US patents, etc.

View full answer

polm · 2021-12-08T04:42:58Z

polm
Dec 8, 2021

It sounds like you should be able to do most of this with the rule-based matchers.

Does Spacy offer methods that interface directly with really high quality, built-in corpora, like for the application above?

No. Large, high-quality corpora tend to require licensing arrangements which prevent us from redistributing them. If you want to test assertions about language on a lot of unannotated text you have more options, like Wikipedia.

You might want to take a look at the SPIKE project from AllenAI, where the web demo integrates the query language with datasets like Wikipedia, US patents, etc.

0 replies

julkhami · 2021-12-23T09:18:29Z

julkhami
Dec 23, 2021
Author

Thank you. What is the preferred way to access Wikipedia as a dataset? Do you just download a dataset from Kaggle or something or do you write a custom scraper in Scrapy? That Allen project seems to offer syntactic searching, I’ll look more into that. From what I could tell Spacy’s rule based matcher didn’t have syntactic matching, is that correct? I’m picturing searching through a dataset for a sentence of a particular syntactic form. Thanks very much.

…

On Wed 8. Dec 2021 at 05:43, polm ***@***.***> wrote: It sounds like you should be able to do most of this with the rule-based matchers <https://spacy.io/usage/rule-based-matching>. Does Spacy offer methods that interface directly with really high quality, built-in corpora, like for the application above? No. Large, high-quality corpora tend to require licensing arrangements which prevent us from redistributing them. If you want to test assertions about language on a lot of unannotated text you have more options, like Wikipedia. You might want to take a look at the SPIKE <https://spike.apps.allenai.org/datasets> project from AllenAI, where the web demo integrates the query language with datasets like Wikipedia, US patents, etc. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9827 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AVQQT7UBTLD3JIIXLCUE3SLUP3O55ANCNFSM5JRCZA3Q> .

5 replies

hmltn-0 Dec 23, 2021

Sorry, I see now indeed that there is dependency matching and POS matching, I’ll be sure to use that. Still curious if Spacy has an easy way to download a Wikipedia dataset built in. Thanks

polm Dec 23, 2021

What is the preferred way to access Wikipedia as a dataset? Do you just
download a dataset from Kaggle or something or do you write a custom
scraper in Scrapy?

There are many ways to do it. If you only want select articles, then using the API is easiest. If you want the whole thing and don't need it to be current, then wiki 40B is good. If you want the whole thing and want it current, I like the Cirrus dumps.

I would strongly recommend against scraping the pages directly.

From what I could tell Spacy’s rule based matcher didn’t have syntactic
matching, is that correct? I’m picturing searching through a dataset for a
sentence of a particular syntactic form.

What exactly do you mean by syntactic matching? Between the normal Matcher and the DependencyMatcher I would expect you to be able to match many different kinds of forms.

polm Dec 23, 2021

@peterelbert Are you the same person?

spaCy has no built-in support for downloading specific datasets.

polm Dec 23, 2021

Actually, as a correction, we do provide ml-datasets with support for some datasets. Wikipedia isn't one of them though.

hmltn-0 Dec 23, 2021

Thank you.
Yes, sorry, my accounts got mixed up.

I think this could be a good feature for Spacy to include, as it's nearly the one-stop shop for NLP, it seems to me.
In the same way you can load the object "English" - a multi-faceted object which is essentially a multi-purpose language processing class, trained specifically on English - it would be cool to also have NLG functionalities as well, for creating English data to work with.

Thanks very much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spacy for computational grammar and semantic analysis #9827

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spacy for computational grammar and semantic analysis #9827

Uh oh!

julkhami Dec 7, 2021

Replies: 2 comments · 5 replies

Uh oh!

polm Dec 8, 2021

Uh oh!

julkhami Dec 23, 2021 Author

Uh oh!

hmltn-0 Dec 23, 2021

Uh oh!

polm Dec 23, 2021

Uh oh!

polm Dec 23, 2021

Uh oh!

polm Dec 23, 2021

Uh oh!

hmltn-0 Dec 23, 2021

julkhami
Dec 7, 2021

Replies: 2 comments 5 replies

polm
Dec 8, 2021

julkhami
Dec 23, 2021
Author