Text parser 2

This is a sort of scrap book for the work on text-parser2.

Research approaches based on AST/TST

Research into using a universal AST (Abstract Syntax Tree) to represent a document. A complex language parser can provide a rich and complete AST while a simplistic parser can simply build a flat tree.

Example implementation of a patching AST parser for JavaScript

Current work and results are sometimes updated here: https://gist.github.com/764184

Outline:

The text parser runs in the embedded nodejs runtime thread.
Kod can send an "edited text" message to the parser at any given time.
The parser can send a "tree updated" message to kod at any given time.

Informal protocol

Kod sending a "text edited" message to the parser system:

[ui -> parser] "text edited" {
                  document: [KDocument],
                  modifiedRange: [123, 4],
                  changeDelta: 4,
                  version: 12345
               }

document is a shared object which represents the document which text was altered. This document can be queried for:
- language type
- modification timestamp
- complete source text
modifiedRange represent the character range in the source before the edit occured. E.g. removing "re" from "fred" results in the range [1,2] ("edit started at position 1 and extended 2 characters").
changeDelta tells how many characters where added or removed. In the above case of removing "re" from "fred", the value is -2 ("two characters where removed")
version is an opaque value which uniquely identifies this edit. The version is simply passed along back to the ui when the paser is done working.

This information should be sufficient to calculate the (probably domain-specific) effective extent of the edit. The parser should also be able to consult the AST on what node(s) of the tree was affected by the edit and need to be replaced or re-evaluated.

At this point in time, the parser should know the following:

What node in the AST need to be replaced
The full semantically complete range of characters in the new source which need to be parsed

Next, the parser simply parses the sbustring of the new source into a partial AST and finally replaces the old node with the new root node from the partial AST.

Picture yourself a tree... Imagine the AST is a tree in nature and the edit is a scratched branch on that tree. We need to cut of that branch from the tree, but we want to do as little damage as possible, thus we cut of the branch as far out as possible. Then we consult your local god to create a new branch without scratches and insert that one where we cut off the damaged branch.

Finally, the parser sends a message to Kod to inform "the UI" that the semantics of document has changed:

[parser -> ui] "ast changed" {
                  document: [KDocument],
                  affectedRange: [120, 12],
                  version: version,
                  ast: [ast]
               }

document is a shared object representing the document in question.
affectedRange represents a character range (in the current source text space) which the update affected.
version carries the opaque version value we received from the UI when starting our parsing.
ast is the new/current AST (TODO: maybe the ast could be a persistent property of the document instead so we don't need to pass it along all the time?)

When the UI receives a "ast changed" message it compares the current version of the document to the version received.

editAccepted = false
document.lock() // temporarily retains exclusive access to the document
currentVersion = document.version
if (currentVersion == version) {
  editAccepted = true
  updateAST(ast)
}
document.unlock()

If editAccepted is false, then another edit occured while the parser was working and the parser is currently busy interpreting that change. Here we need some way to coalesce edits. Here's one idea for an approach to a solution:

failedEdits = []
on("edit failed"), {
  failedEdits.push(affectedRange)
}
on("ast changed", {
  unionRange = calculateUnionOfRanges(failedEdits)
  failedEdits = []
  if (  unionRange.start < affectedRange.start
     || unionRange.end > affectedRange.end ) {
    // some of the failed edits where supposed to cause a change beyond what was
    // just changed. As there might be "dirty" parts outside of what wehere just
    // updated
    [ui -> parser] "text edited" {... modifiedRange: unionRange ...}
  }
}

The UI updates displayed text in the affectedRange to have its style updated to visually reflect the containg elements' semantic meaning (i.e. its kind, relation, etc). The UI can do other things as well, like updating a visual tree representation of the AST in which the user can click to jump between the visual tree and the source text.

Revision 2 of the informal protocol

We limit the number of threads involved to exactly 2: {main} and {nodejs}.

{main} makes changes to a document's text and enqueues "text updated" for {nodejs} in a sequential fashion using a shared atomic LIFO queue.
{nodejs} dequeues "text updated" events--when notified--from the shared LIFO.

This reasonable limitation would remove the need for version passing and parse-retrying. However, one problem still exists and is that of mutual access to a document's text.

One obvious solution is that {main} makes a complete copy of the text when enqueuing the "text edited" event. Such a solution have the following characteristics and effects:

Less performant as {main} need to block during the copy-ing of it's text
Removes the need for versioning and complex managed coalescing of edits, simpifying both the implementation and the API
Removes the need to implement a special v8 C++ wrapper object to allow mutual access to text between objc and v8, which could potentially be very messy since one would have to create a custom implementation of NSString where the backing memory is accessible.

Given this revision, the API messages would instead look like this:

[ui -> parser] "text edited" {
                  document: [KDocument],
                  text: "complete current source...",
                  modifiedRange: [123, 4],
                  changeDelta: 4
               }

[parser -> ui] "ast changed" {
                  document: [KDocument],
                  affectedRange: [120, 12],
                  ast: [ast]
               }

After a successful parse in {nodejs}, the "ast changed" message need to be delayed by one runloop tick. This important since there might be queued "text edited" messages which need to be processed before we send "ast changed". Something like this:

# runloop notified to run an iteration

recv> edit1()                  # finQ: []        sendQ: []
push> fin1                     # finQ: [ fin1 ]  sendQ: []
if finQ.length == 0:
  # not happening since finQ isn't empty
  # ...

yield
  # a new edit was made while we processed edit1, thus the event got queued on
  # the runloop with a higher priority than our nextTick closure

recv> edit2()                  # finQ: [ fin1 ]  sendQ: []
pop>  fin1() -> res1 > sendQ   # finQ: []        sendQ: [ res1 ]
push> fin2                     # finQ: [ fin2 ]  sendQ: [ res1 ]
if finQ.length == 0:
  # not happening since finQ isn't empty
  # ...

yield
  # this time no new edit was made, but the runloop still triggers an iteration
  # since we queued a nextTick()

pop>  fin2() -> res1 > sendQ   # finQ: []        sendQ: [ res1, res2 ]
if finQ.length == 0:
  # this do happe since finQ is empty
  for each res in sendQ:
    res()
  clear sendQ                  # finQ: []        sendQ: []

yield
  # runloop goes into idle mode...

Which could be implemented like this:

var finQ = []
var sendQ = []

function dequeue() {
  var queuedFinalizer = finQ.shift()
  if (queuedFinalizer) {
    var result = queuedFinalizer()
    sendQ.push(result)
  }
  if (finQ.length == 0) {
    sendQ.forEach(function (result) {
      send(result)
    })
    sendQ = []
  }
}

on("text edited", message, {
  var finalizer = parse(message)
  finQ.push(finalizer)
  process.nextTick(dequeue)
})

Ideas

Mathematica-style semantic expansion of the cursor

In Mathematica frontend, a user can press a key [Ctrl+.], and the token the cursor is on will be selected (highlighted), when user presses the key again, the selection expands to highlight the next smallest semantic unit. When the key is pressed again, it extends further.

-- http://xahlee.org/emacs/syntax_tree_walk.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text parser 2

Text parser 2

Research approaches based on AST/TST

Example implementation of a patching AST parser for JavaScript

Informal protocol

Revision 2 of the informal protocol

Ideas

Mathematica-style semantic expansion of the cursor

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally