Skip to content

Add phrase counts or parts-of-speech token counts after extracting entities from a sentence #15

@neomatrix369

Description

@neomatrix369

On the back of the PR #13, it appears there are other types of phrase i.e. pronouns, or dates or organisations etc... - the details can be discussed. So far we have achieved these and there are a number of others to cover:

Name entity recognition features:

  • PERSON | People, including fictional.
  • NORP | Nationalities or religious or political groups.
  • FAC | Buildings, airports, highways, bridges, etc.
  • ORG | Companies, agencies, institutions, etc.
  • GPE | Countries, cities, states.
  • LOC | Non-GPE locations, mountain ranges, bodies of water.
  • PRODUCT | Objects, vehicles, foods, etc. (Not services.)
  • EVENT | Named hurricanes, battles, wars, sports events, etc.
  • WORK_OF_ART | Titles of books, songs, etc.
  • LAW | Named documents made into laws.
  • LANGUAGE | Any named language. (related to Language Detection Feature #4 feature request)
  • DATE | Absolute or relative dates or periods.
  • TIME | Times smaller than a day.
  • PERCENT | Percentage, including ”%“.
  • MONEY | Monetary values, including unit.
  • QUANTITY | Measurements, as of weight or distance.
  • ORDINAL | “first”, “second”, etc.
  • CARDINAL | Numerals that do not fall under another type.

Parts of speech features:

  • (NOUN | noun | girl, cat, tree, air, beauty) Noun phrase count via Added Noun phrase count #13 by @ritikjain51 and Add noun phrase count to the granular features functionality #47
  • ADJ | adjective | big, old, green, incomprehensible, first
  • ADP | adposition | in, to, during
  • ADV | adverb | very, tomorrow, down, where, there
  • AUX | auxiliary | is, has (done), will (do), should (do)
  • CONJ | conjunction | and, or, but
  • CCONJ | coordinating conjunction | and, or, but
  • DET | determiner | a, an, the
  • INTJ | interjection | psst, ouch, bravo, hello
  • NUM | numeral | 1, 2017, one, seventy-seven, IV, MMXIV
  • PART | particle | ’s, not,
  • PRON | pronoun | I, you, he, she, myself, themselves, somebody
  • PROPN | proper noun | Mary, John, London, NATO, HBO
  • PUNCT | punctuation | ., (, ), ?
  • SCONJ | subordinating conjunction | if, while, that
  • SYM | symbol | $, %, §, ©, +, −, ×, ÷, =, :), 😝
  • VERB | verb | run, runs, running, eat, ate, eating
  • SPACE | space

See https://spacy.io/api/annotation#section-named-entities and http://www.nltk.org/book/ for details on the above items.

We will replace one or more existing functionalities in the libraries with the above, case-by-case basis. It would be best to group each of them and give them unique names like name-entity-recognition-features and parts-of-speech-features, respectively and club them with granular features.

Both NLTK and Spacey would be used to fulfill these functionalities.

Metadata

Metadata

Assignees

No one assigned

    Labels

    2. medium-priorityGood if it can be attended to be soon, but not urgent enoughenhancementNew feature or requestgranular feature(s)Low-level/granular feature(s)hacktoberfestClassify topic. Part of the Hacktoberfest 2020 (https://hacktoberfest.digitalocean.com)help wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions