-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The pressing issues with this part of the pipeline are with robustness, scalability and testing. For the final product, we need a lot of simplifications. To organize and document the development, I will be tracking that in the issue tracker here.
Currently, if I understand correctly, the procedure can be roughly sketched as follows. I will edit as I go along; please comment if I am mistaken.
- The question is cleaned.
nltkis used to detect and clean adjectives like 'nearest', so that the important nouns can be isolated and recognized in subsequent steps. - Important words in the questions are annotated.
- Extract functional roles based on syntactic structures and connective words, via a grammar implemented in ANTLR. This yields a parse tree.
- Convert parse trees into transformations between concept types.
- Find input and output concept types by matching questions to categories that are associated with concept transformation models.
- The order of concept types is derived from the function of each phrase in which they occur: subcondition is calculated before the condition, etcetera. A table is generated that calculates the order for each functional part, which is then itself combined in a rule-based way (see Algorithm 1 in the paper).
- Transform concept types into
ccttypes via manually constructed rules based on the concepts/extents/transformations that were found in previous steps.
The issue is that this is rather fragile; it depends (among other things) on:
- All concepts and entities being annotated properly.
- Having a complete rule set for converting concept types into CCT types.
We have chosen blockly to constrain the natural language at the user end, in such a way that the questions that may be presented to the parser are questions that the parser can handle. However, this only formats the question to reduce the problems of an otherwise unchanged natural language processing pipeline. As discussed in the meeting and elsewhere:
- Given that we already know the type of entity when constructing a query via
blocklyinstead of freeform text, we will no longer need named entity recognition or question cleaning. This would strip out thenltk,spaCy, andallenlppackages, tremendously simplifying the process. - To guarantee robustness, the visual blocks need to be in perfect accordance with the parser. For this, they should be automatically constructed from one common source of truth.
- In fact, given that the
blockly-constructed query can output something different than what's written on the blocks, we might even forego the natural language parser completely, in favour of JSON output at the blockly level (or another format that is easily parsed). This would eliminate even the ANTLR parser, further reducing complexity. The downside is that we would no longer be able to parse freeform text (though that would be impacted by the removal of named entity recognition anyway). We could describe this with JSON Schema to really pin it down. - To make sure that no regressions are introduced, we should have expected output for every step (that is, not just expected output from the whole pipeline).
This would make this repository not so much a geo-question-parser as much as a geo-question-formulator. This is good, because the code right now is very complex and very fitted to the specific questions in the original corpus, which isn't acceptable in a situation where users can pose their own questions.
Note: If we simplify to this extent, it might be nice to use rdflib.js to output a transformation graph directly, but that is for later.
The process would thus become:
- In
blockly, construct JSON that represents a question. - Convert that question into transformations between concept types.
- Find input and output concept types by matching questions to transformation categories.
- Find concept type ordering.
- Transform concept types into
ccttypes via rules.
I'm not sure to what extent we can still simplify step 2. Depending how much code would be left, it would be nice to port/rewrite in JavaScript, alongside blockly, so that we can visualize most things client-side and with minimal moving parts.