Incremental Parsing#2387
Conversation
Remove condition that the file has to be in the consideredFilesList of files context as only parsing a file does not add it to consideredFilesList
Rework the incremental parsing handoff so (1) invalidation no longer only prepares incremental reparses for changed files while forcing unchanged files through a full parse again and (2) invalidation no longer computes eager reparse metadata inside FlowrAnalyzerCache. Previously, a mixed update such as “A changed, B unchanged” behaved like this: - reparse info was generated for A - no reparse info was generated for B - the cache pipeline was rebuilt - A was reparsed incrementally - B was parsed from scratch The new architecture fixes that by treating the previous successful parse run as the baseline for the next one: - File invalidation events now record the file path together with the old source text in the incremental analysis context. - FlowrAnalyzerCache snapshots the latest completed Tree-sitter parse results after successful parse-oriented runs. - On the next parse request, TreeSitterExecutor derives reparse info lazily from: - the old parse tree - the old source text, if the file was invalidated - the current file content - Changed files get a minimal edit region and are reparsed incrementally. - Unchanged files now reuse their previous parse tree directly instead of being parsed again from scratch. - Parser call sites, invalidation plumbing, and documentation were updated to support this context-driven flow. Net effect: incremental parsing now correctly handles mixed workloads by incrementally reparsing changed files while reusing old parse results for unchanged files.
Adapt the incremental parsing tests to the new architecture where incremental state is derived lazily from the previous successful parse run instead of from eagerly stored reparse info. The tests no longer inspect the removed reparseInfoMap. Instead they now: - capture the previous Tree-sitter trees after the first analysis run - invalidate files by updating their content - verify that invalidation clears the current parse pipeline - trace the second parse run to observe which previous tree is reused for which file - assert that changed files use their own previous tree for incremental reparsing - assert that unchanged files reuse their previous tree directly without reparsing This specifically covers the mixed case that motivated the refactor: when one file changes and another does not, the changed file must be reparsed incrementally and the unchanged file must keep its old parse tree rather than being parsed from scratch.
Group incremental parsing tests by one vs multiple update sets, remove the old broad successive-state cases, and add focused multi-step edge cases for single-file and multi-file analyzer reuse.
There was a problem hiding this comment.
Pull request overview
This PR introduces incremental parsing support (Tree-sitter) by propagating file invalidation events through the analyzer/cache and reusing prior parse trees + minimal edit regions, with new tests validating correctness against full reparses.
Changes:
- Add file-level invalidation events and an incremental analysis context to carry old file contents + previous parse trees between runs.
- Update parser plumbing to pass analyzer context + filePath into parse calls, enabling Tree-sitter incremental reparsing.
- Add unit + integration tests to verify edit-region computation and incremental parsing equivalence to full parsing.
Reviewed changes
Copilot reviewed 25 out of 26 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/functionality/incremental/incremental-parsing.test.ts | New end-to-end incremental parsing scenarios (tree reuse + result equivalence vs full parse). |
| test/functionality/incremental/edit-computation.test.ts | Unit tests for minimal edit-region computation. |
| src/r-bridge/shell.ts | Mark RShell as non-incremental. |
| src/r-bridge/shell-executor.ts | Mark RShellExecutor as non-incremental. |
| src/r-bridge/parser.ts | Extend parser interface to accept context + optional filePath; pass context through parseRequests. |
| src/r-bridge/lang-4.x/tree-sitter/tree-sitter-executor.ts | Implement incremental Tree-sitter parsing via previous tree reuse + computed edit region. |
| src/project/plugins/file-plugins/files/flowr-sweave-file.ts | Adjust FlowrFile base type usage for Sweave wrapper. |
| src/project/plugins/file-plugins/files/flowr-rmarkdown-file.ts | Adjust FlowrFile base type usage for Rmd wrapper. |
| src/project/plugins/file-plugins/files/flowr-jupyter-file.ts | Adjust FlowrFile base type usage for Jupyter wrapper. |
| src/project/incremental/incremental-parse/incremental-parse.ts | Compute incremental reparse info from context (previous tree + old/new content). |
| src/project/incremental/incremental-parse/edit-computation.ts | Compute a minimal Tree-sitter Edit region from old/new text. |
| src/project/flowr-analyzer.ts | Add invalidation event receiver; route parseStandalone through ctx-aware parser APIs. |
| src/project/flowr-analyzer-builder.ts | Add documentation for builder configure. |
| src/project/context/flowr-file.ts | Add invalidation callbacks + inline content update that triggers invalidation. |
| src/project/context/flowr-analyzer-meta-context.ts | Make meta context respond to invalidation events. |
| src/project/context/flowr-analyzer-incremental-analysis-context.ts | New context to store old contents + prior parse trees for incremental runs. |
| src/project/context/flowr-analyzer-files-context.ts | Register file invalidation handlers; expose getFile; handle invalidation events. |
| src/project/context/flowr-analyzer-dependencies-context.ts | Make deps context respond to invalidation events. |
| src/project/context/flowr-analyzer-context.ts | Wire incremental context into the analyzer context; propagate invalidation events; reset via Full invalidation. |
| src/project/cache/flowr-cache.ts | Generalize invalidation events (Full + FileInvalidate) and propagate to dependents. |
| src/project/cache/flowr-analyzer-cache.ts | Store parse snapshots for incremental reuse; handle new invalidation event types. |
| src/documentation/wiki-analyzer.ts | Document incremental analysis context + incremental parsing behavior. |
| src/dataflow/internal/process/functions/call/built-in/built-in-source.ts | Pass analyzer context into parser.parse calls. |
| src/config.ts | Add this: void to FlowrConfig helper methods to prevent accidental this binding. |
| package.json | Add diff dependency. |
| package-lock.json | Update lock entries for diff. |
Comments suppressed due to low confidence (3)
src/project/cache/flowr-analyzer-cache.ts:164
parse/normalize/dataflowcaptureconst d = this.get()before callingrunTapeUntil(force, ...). Ifforceis true,runTapeUntilresets/recreates the pipeline, sodrefers to the old pipeline output and theuntilpredicate will never observe the new results (or may return stale results). Consider moving theget()call insiderunTapeUntil/until(e.g.,() => this.get().parse) or performing the reset before capturing the output object.
public async parse(force?: boolean): Promise<NonNullable<AnalyzerCacheType<Parser>['parse']>> {
const d = this.get();
return this.runTapeUntil(force, () => d.parse);
}
/**
* Get the parse output for the request if already available, otherwise return `undefined`.
* This will not trigger a new parse.
* @see {@link FlowrAnalyzerCache#parse} - to get the parse output, parsing if necessary.
*/
public peekParse(): NonNullable<AnalyzerCacheType<Parser>['parse']> | undefined {
return this.get()?.parse;
}
/**
* Get the normalized abstract syntax tree for the request, normalizing if necessary.
* @param force - Do not use the cache, instead force new analyses.
* @see {@link FlowrAnalyzerCache#peekNormalize} - to get the normalized AST if already available without triggering a new normalization.
*/
public async normalize(force?: boolean): Promise<NonNullable<AnalyzerCacheType<Parser>['normalize']>> {
const d = this.get();
return this.runTapeUntil(force, () => d.normalize);
}
/**
* Get the normalized abstract syntax tree for the request if already available, otherwise return `undefined`.
* This will not trigger a new normalization.
* @see {@link FlowrAnalyzerCache#normalize} - to get the normalized AST, normalizing if necessary.
*/
public peekNormalize(): NonNullable<AnalyzerCacheType<Parser>['normalize']> | undefined {
return this.get()?.normalize;
}
/**
* Get the dataflow graph for the request, computing if necessary.
* @param force - Do not use the cache, instead force new analyses.
* @see {@link FlowrAnalyzerCache#peekDataflow} - to get the dataflow graph if already available without triggering a new computation.
*/
public async dataflow(force?: boolean): Promise<NonNullable<AnalyzerCacheType<Parser>['dataflow']>> {
const d = this.get();
return this.runTapeUntil(force, () => d.dataflow);
}
src/project/context/flowr-analyzer-files-context.ts:246
addFileregisters anonInvalidatehandler unconditionally. If the sameFlowrFileProviderinstance is added again (allowed whenexist === f), this will attach duplicate handlers and cause multiple invalidation cascades per edit. Consider only registering the handler when the file is first added, or makeaddOnInvalidateidempotent (e.g., store handlers in aSet).
public addFile(file: string | FlowrFileProvider | RParseRequestFromFile, roles?: readonly FileRole[]) {
const f = this.fileLoadPlugins(wrapFile(file, roles));
f.addOnInvalidate(c => {
this.context.analyzer?.receive(c);
});
if(f.path() === FlowrFile.INLINE_PATH) {
this.inlineFiles.push(f);
} else {
const exist = this.files.get(f.path());
guard(exist === undefined || exist === f, `File ${f.path()} already added to the context.`);
this.files.set(f.path(), f);
}
package.json:214
- The PR adds the
diffdependency, but there are no imports/usages ofdiffin the repository. If it’s not needed anymore, please remove it fromdependencies(and the lockfile); otherwise, consider adding the usage in this PR so the new dependency is justified.
"dependencies": {
"@eagleoutice/tree-sitter-r": "^1.1.2",
"@jupyterlab/nbformat": "^4.5.4",
"@xmldom/xmldom": "^0.9.7",
"clipboardy": "^4.0.0",
"command-line-args": "^6.0.1",
"command-line-usage": "^7.0.3",
"commonmark": "^0.31.2",
"dagre": "^0.8.5",
"diff": "^8.0.3",
"gray-matter": "^4.0.3",
"joi": "^18.0.1",
"lz-string": "^1.5.0",
"n-readlines": "^1.0.3",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public parseStandalone(data: `${typeof fileProtocol}${string}` | string | RParseRequest): Tree { | ||
| console.log('parseStandalone'); | ||
| const request = isParseRequest(data) ? data : requestFromInput(data); | ||
| if(this.parser.name === 'tree-sitter') { |
There was a problem hiding this comment.
Leftover console.log('parseStandalone') will print on every standalone parse (including library consumers) and is inconsistent with the rest of the codebase’s logging. Please remove it or use the existing logger infrastructure if you need trace output.
| export function computeEditRegion(oldContent: string, newContent: string): Parser.Edit { | ||
| const oldLen = oldContent.length; | ||
| const newLen = newContent.length; | ||
|
|
||
| // 1) Longest common prefix | ||
| let startIndex = 0; | ||
| while( | ||
| startIndex < oldLen && | ||
| startIndex < newLen && | ||
| oldContent[startIndex] === newContent[startIndex] | ||
| ) { | ||
| startIndex++; | ||
| } | ||
|
|
||
| // 2) Longest common suffix, without overlapping the prefix | ||
| let oldSuffixIndex = oldLen; | ||
| let newSuffixIndex = newLen; | ||
| while( | ||
| oldSuffixIndex > startIndex && | ||
| newSuffixIndex > startIndex && | ||
| oldContent[oldSuffixIndex - 1] === newContent[newSuffixIndex - 1] | ||
| ) { | ||
| oldSuffixIndex--; | ||
| newSuffixIndex--; | ||
| } | ||
|
|
||
| const oldEndIndex = oldSuffixIndex; | ||
| const newEndIndex = newSuffixIndex; | ||
|
|
||
| return { | ||
| startIndex, | ||
| oldEndIndex, | ||
| newEndIndex, | ||
| startPosition: indexToPoint(oldContent, startIndex), | ||
| oldEndPosition: indexToPoint(oldContent, oldEndIndex), | ||
| newEndPosition: indexToPoint(newContent, newEndIndex), | ||
| }; |
There was a problem hiding this comment.
computeEditRegion calls indexToPoint three times, and indexToPoint scans from the start of the text each time. For larger files this makes edit computation O(n) with a larger constant factor on every incremental parse. Consider computing row/column positions in a single pass up to max(startIndex, oldEndIndex, newEndIndex) or caching newline offsets to avoid repeated full scans.
| export class FlowrSweaveFile extends FlowrFile { | ||
| private readonly wrapped: FlowrFileProvider<string>; | ||
| private data?: SweaveInfo; |
There was a problem hiding this comment.
This change drops the explicit FlowrFile<string> generic and weakens the content() return type to StringableContent, which reduces type safety for callers. If the content is always textual here, consider keeping extends FlowrFile<string> for a more precise API.
| export class FlowrRMarkdownFile extends FlowrFile { | ||
| private data?: RmdInfo; | ||
| private readonly wrapped: FlowrFileProvider<string>; | ||
|
|
There was a problem hiding this comment.
This change drops the explicit FlowrFile<string> generic and weakens the content() return type to StringableContent, which reduces type safety for callers. If the content is always textual here, consider keeping extends FlowrFile<string> for a more precise API.
| export class FlowrJupyterFile extends FlowrFile { | ||
| private readonly wrapped: FlowrFileProvider<string>; | ||
|
|
There was a problem hiding this comment.
This change drops the explicit FlowrFile<string> generic and weakens the content() return type to StringableContent, which reduces type safety for callers. If the content is always textual here, consider keeping extends FlowrFile<string> for a more precise API.
No description provided.