Eliminate grammar parser/blockly interface overlap

The pressing issues with this part of the pipeline are with robustness, scalability and testing. For the final product, we need a lot of simplifications. To organize and document the development, I will be tracking that in the issue tracker here.

Currently, if I understand correctly, the procedure can be roughly sketched as follows. I will edit as I go along; please comment if I am mistaken.

1.  The question is cleaned. [`nltk`][nltk] is used to detect and clean adjectives like 'nearest', so that the important nouns can be isolated and recognized in subsequent steps.
2.  Important words in the questions are annotated.
    1.  Recognize concepts, amounts and proportions via a pre-defined concept dictionary.
    2.  Recognize place names via ELMO-based named entity recognition (NER) from [`allennlp`][allennlp].
    3.  Recognize times, quantities and dates via NER from [`spaCy`][spaCy].
3.  Extract functional roles based on syntactic structures and connective words, via a grammar implemented in [ANTLR][antlr]. This yields a parse tree.
4.  Convert parse trees into transformations between concept types.
    1.  Find input and output concept types by matching questions to categories that are associated with concept transformation models.
    2.  The order of concept types is derived from the function of each phrase in which they occur: subcondition is calculated before the condition, etcetera. A table is generated that calculates the order for each functional part, which is then itself combined in a rule-based way (see Algorithm 1 in the paper).
5.  Transform concept types into [`cct`][cct] types via manually constructed rules based on the concepts/extents/transformations that were found in previous steps.

The issue is that this is rather fragile; it depends (among other things) on:

-   All concepts and entities being annotated properly.
-   Having a complete rule set for converting concept types into CCT types.

We have chosen [`blockly`][blockly] to constrain the natural language at the user end, in such a way that the questions that may be presented to the parser are questions that the parser can handle. However, this only formats the question to reduce the problems of an otherwise unchanged natural language processing pipeline. As discussed in the meeting and elsewhere:

1.  Given that we already know the type of entity when constructing a query via `blockly` instead of freeform text, we will **no longer need named entity recognition or question cleaning**. This would strip out the `nltk`, `spaCy`, and `allenlp` packages, tremendously simplifying the process.
2.  To guarantee robustness, the visual blocks need to be in perfect accordance with the parser. For this, they should be automatically constructed from **one common source of truth**.
3.  In fact, given that the `blockly`-constructed query can output something different than what's written on the blocks, we might even forego the natural language parser completely, in favour of **JSON output at the blockly level** (or another format that is easily parsed). This would eliminate even the ANTLR parser, further reducing complexity. The downside is that we would no longer be able to parse freeform text (though that would be impacted by the removal of named entity recognition anyway). We could describe this with [JSON Schema][jsons] to really pin it down.
4.  To make sure that no regressions are introduced, we should have expected output for every step (that is, not just expected output from the whole pipeline).

This would make this repository not so much a `geo-question-parser` as much as a `geo-question-formulator`. This is good, because the code right now is very complex and very fitted to the specific questions in the original corpus, which isn't acceptable in a situation where users can pose their own questions.

Note: If we simplify to this extent, it might be nice to use [`rdflib.js`][rdflib.js] to output a transformation graph directly, but that is for later.

The process would thus become:

1. In `blockly`, construct JSON that represents a question.
2. Convert that question into transformations between concept types.
    1.  Find input and output concept types by matching questions to transformation categories.
    2.  Find concept type ordering.
3.  Transform concept types into `cct` types via rules.

I'm not sure to what extent we can still simplify step 2. Depending how much code would be left, it would be nice to port/rewrite in JavaScript, alongside `blockly`, so that we can visualize most things client-side and with minimal moving parts.

[rdflib.js]: https://github.com/linkeddata/rdflib.js
[blockly]: https://github.com/google/blockly
[cct]: https://github.com/quangis/cct
[spaCy]: https://spacy.io/models/en#en_core_web_sm
[allennlp]: https://allenai.org/allennlp
[nltk]: https://www.nltk.org/
[antlr]: https://github.com/antlr/antlr4
[jsons]: http://json-schema.org/learn/getting-started-step-by-step

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate grammar parser/blockly interface overlap #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eliminate grammar parser/blockly interface overlap #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions