Skip to content

Incremental Parsing#2387

Draft
EagleoutIce wants to merge 19 commits intomainfrom
staging/incremental-analysis
Draft

Incremental Parsing#2387
EagleoutIce wants to merge 19 commits intomainfrom
staging/incremental-analysis

Conversation

@EagleoutIce
Copy link
Copy Markdown
Member

No description provided.

EagleoutIce and others added 19 commits March 6, 2026 20:15
Remove condition that the file has to be in the consideredFilesList of files context as only parsing a file does not add it to consideredFilesList
Rework the incremental parsing handoff so (1) invalidation no longer only
prepares incremental reparses for changed files while forcing unchanged
files through a full parse again and (2) invalidation no longer
computes eager reparse metadata inside FlowrAnalyzerCache.

Previously, a mixed update such as “A changed, B unchanged” behaved like
this:
- reparse info was generated for A
- no reparse info was generated for B
- the cache pipeline was rebuilt
- A was reparsed incrementally
- B was parsed from scratch

The new architecture fixes that by treating the previous successful parse
run as the baseline for the next one:

- File invalidation events now record the file path together with the old
  source text in the incremental analysis context.
- FlowrAnalyzerCache snapshots the latest completed Tree-sitter parse
  results after successful parse-oriented runs.
- On the next parse request, TreeSitterExecutor derives reparse info
  lazily from:
  - the old parse tree
  - the old source text, if the file was invalidated
  - the current file content
- Changed files get a minimal edit region and are reparsed incrementally.
- Unchanged files now reuse their previous parse tree directly instead of
  being parsed again from scratch.
- Parser call sites, invalidation plumbing, and documentation were
  updated to support this context-driven flow.

Net effect: incremental parsing now correctly handles mixed workloads by
incrementally reparsing changed files while reusing old parse results for
unchanged files.
Adapt the incremental parsing tests to the new architecture where
incremental state is derived lazily from the previous successful parse
run instead of from eagerly stored reparse info.

The tests no longer inspect the removed reparseInfoMap. Instead they now:
- capture the previous Tree-sitter trees after the first analysis run
- invalidate files by updating their content
- verify that invalidation clears the current parse pipeline
- trace the second parse run to observe which previous tree is reused for
  which file
- assert that changed files use their own previous tree for incremental
  reparsing
- assert that unchanged files reuse their previous tree directly without
  reparsing

This specifically covers the mixed case that motivated the refactor:
when one file changes and another does not, the changed file must be
reparsed incrementally and the unchanged file must keep its old parse
tree rather than being parsed from scratch.
Group incremental parsing tests by one vs multiple update sets, remove the old broad successive-state cases, and add focused multi-step edge cases for single-file and multi-file analyzer reuse.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces incremental parsing support (Tree-sitter) by propagating file invalidation events through the analyzer/cache and reusing prior parse trees + minimal edit regions, with new tests validating correctness against full reparses.

Changes:

  • Add file-level invalidation events and an incremental analysis context to carry old file contents + previous parse trees between runs.
  • Update parser plumbing to pass analyzer context + filePath into parse calls, enabling Tree-sitter incremental reparsing.
  • Add unit + integration tests to verify edit-region computation and incremental parsing equivalence to full parsing.

Reviewed changes

Copilot reviewed 25 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/functionality/incremental/incremental-parsing.test.ts New end-to-end incremental parsing scenarios (tree reuse + result equivalence vs full parse).
test/functionality/incremental/edit-computation.test.ts Unit tests for minimal edit-region computation.
src/r-bridge/shell.ts Mark RShell as non-incremental.
src/r-bridge/shell-executor.ts Mark RShellExecutor as non-incremental.
src/r-bridge/parser.ts Extend parser interface to accept context + optional filePath; pass context through parseRequests.
src/r-bridge/lang-4.x/tree-sitter/tree-sitter-executor.ts Implement incremental Tree-sitter parsing via previous tree reuse + computed edit region.
src/project/plugins/file-plugins/files/flowr-sweave-file.ts Adjust FlowrFile base type usage for Sweave wrapper.
src/project/plugins/file-plugins/files/flowr-rmarkdown-file.ts Adjust FlowrFile base type usage for Rmd wrapper.
src/project/plugins/file-plugins/files/flowr-jupyter-file.ts Adjust FlowrFile base type usage for Jupyter wrapper.
src/project/incremental/incremental-parse/incremental-parse.ts Compute incremental reparse info from context (previous tree + old/new content).
src/project/incremental/incremental-parse/edit-computation.ts Compute a minimal Tree-sitter Edit region from old/new text.
src/project/flowr-analyzer.ts Add invalidation event receiver; route parseStandalone through ctx-aware parser APIs.
src/project/flowr-analyzer-builder.ts Add documentation for builder configure.
src/project/context/flowr-file.ts Add invalidation callbacks + inline content update that triggers invalidation.
src/project/context/flowr-analyzer-meta-context.ts Make meta context respond to invalidation events.
src/project/context/flowr-analyzer-incremental-analysis-context.ts New context to store old contents + prior parse trees for incremental runs.
src/project/context/flowr-analyzer-files-context.ts Register file invalidation handlers; expose getFile; handle invalidation events.
src/project/context/flowr-analyzer-dependencies-context.ts Make deps context respond to invalidation events.
src/project/context/flowr-analyzer-context.ts Wire incremental context into the analyzer context; propagate invalidation events; reset via Full invalidation.
src/project/cache/flowr-cache.ts Generalize invalidation events (Full + FileInvalidate) and propagate to dependents.
src/project/cache/flowr-analyzer-cache.ts Store parse snapshots for incremental reuse; handle new invalidation event types.
src/documentation/wiki-analyzer.ts Document incremental analysis context + incremental parsing behavior.
src/dataflow/internal/process/functions/call/built-in/built-in-source.ts Pass analyzer context into parser.parse calls.
src/config.ts Add this: void to FlowrConfig helper methods to prevent accidental this binding.
package.json Add diff dependency.
package-lock.json Update lock entries for diff.
Comments suppressed due to low confidence (3)

src/project/cache/flowr-analyzer-cache.ts:164

  • parse/normalize/dataflow capture const d = this.get() before calling runTapeUntil(force, ...). If force is true, runTapeUntil resets/recreates the pipeline, so d refers to the old pipeline output and the until predicate will never observe the new results (or may return stale results). Consider moving the get() call inside runTapeUntil/until (e.g., () => this.get().parse) or performing the reset before capturing the output object.
	public async parse(force?: boolean): Promise<NonNullable<AnalyzerCacheType<Parser>['parse']>> {
		const d = this.get();
		return this.runTapeUntil(force, () => d.parse);
	}

	/**
	 * Get the parse output for the request if already available, otherwise return `undefined`.
	 * This will not trigger a new parse.
	 * @see {@link FlowrAnalyzerCache#parse} - to get the parse output, parsing if necessary.
	 */
	public peekParse(): NonNullable<AnalyzerCacheType<Parser>['parse']> | undefined {
		return this.get()?.parse;
	}

	/**
	 * Get the normalized abstract syntax tree for the request, normalizing if necessary.
	 * @param force - Do not use the cache, instead force new analyses.
	 * @see {@link FlowrAnalyzerCache#peekNormalize} - to get the normalized AST if already available without triggering a new normalization.
	 */
	public async normalize(force?: boolean): Promise<NonNullable<AnalyzerCacheType<Parser>['normalize']>> {
		const d = this.get();
		return this.runTapeUntil(force, () => d.normalize);
	}

	/**
	 * Get the normalized abstract syntax tree for the request if already available, otherwise return `undefined`.
	 * This will not trigger a new normalization.
	 * @see {@link FlowrAnalyzerCache#normalize} - to get the normalized AST, normalizing if necessary.
	 */
	public peekNormalize(): NonNullable<AnalyzerCacheType<Parser>['normalize']> | undefined {
		return this.get()?.normalize;
	}

	/**
	 * Get the dataflow graph for the request, computing if necessary.
	 * @param force - Do not use the cache, instead force new analyses.
	 * @see {@link FlowrAnalyzerCache#peekDataflow} - to get the dataflow graph if already available without triggering a new computation.
	 */
	public async dataflow(force?: boolean): Promise<NonNullable<AnalyzerCacheType<Parser>['dataflow']>> {
		const d = this.get();
		return this.runTapeUntil(force, () => d.dataflow);
	}

src/project/context/flowr-analyzer-files-context.ts:246

  • addFile registers an onInvalidate handler unconditionally. If the same FlowrFileProvider instance is added again (allowed when exist === f), this will attach duplicate handlers and cause multiple invalidation cascades per edit. Consider only registering the handler when the file is first added, or make addOnInvalidate idempotent (e.g., store handlers in a Set).
	public addFile(file: string | FlowrFileProvider | RParseRequestFromFile, roles?: readonly FileRole[]) {
		const f = this.fileLoadPlugins(wrapFile(file, roles));

		f.addOnInvalidate(c => {
			this.context.analyzer?.receive(c);
		});

		if(f.path() === FlowrFile.INLINE_PATH) {
			this.inlineFiles.push(f);
		} else {
			const exist = this.files.get(f.path());
			guard(exist === undefined || exist === f, `File ${f.path()} already added to the context.`);
			this.files.set(f.path(), f);
		}

package.json:214

  • The PR adds the diff dependency, but there are no imports/usages of diff in the repository. If it’s not needed anymore, please remove it from dependencies (and the lockfile); otherwise, consider adding the usage in this PR so the new dependency is justified.
  "dependencies": {
    "@eagleoutice/tree-sitter-r": "^1.1.2",
    "@jupyterlab/nbformat": "^4.5.4",
    "@xmldom/xmldom": "^0.9.7",
    "clipboardy": "^4.0.0",
    "command-line-args": "^6.0.1",
    "command-line-usage": "^7.0.3",
    "commonmark": "^0.31.2",
    "dagre": "^0.8.5",
    "diff": "^8.0.3",
    "gray-matter": "^4.0.3",
    "joi": "^18.0.1",
    "lz-string": "^1.5.0",
    "n-readlines": "^1.0.3",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 217 to 220
public parseStandalone(data: `${typeof fileProtocol}${string}` | string | RParseRequest): Tree {
console.log('parseStandalone');
const request = isParseRequest(data) ? data : requestFromInput(data);
if(this.parser.name === 'tree-sitter') {
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover console.log('parseStandalone') will print on every standalone parse (including library consumers) and is inconsistent with the rest of the codebase’s logging. Please remove it or use the existing logger infrastructure if you need trace output.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +45
export function computeEditRegion(oldContent: string, newContent: string): Parser.Edit {
const oldLen = oldContent.length;
const newLen = newContent.length;

// 1) Longest common prefix
let startIndex = 0;
while(
startIndex < oldLen &&
startIndex < newLen &&
oldContent[startIndex] === newContent[startIndex]
) {
startIndex++;
}

// 2) Longest common suffix, without overlapping the prefix
let oldSuffixIndex = oldLen;
let newSuffixIndex = newLen;
while(
oldSuffixIndex > startIndex &&
newSuffixIndex > startIndex &&
oldContent[oldSuffixIndex - 1] === newContent[newSuffixIndex - 1]
) {
oldSuffixIndex--;
newSuffixIndex--;
}

const oldEndIndex = oldSuffixIndex;
const newEndIndex = newSuffixIndex;

return {
startIndex,
oldEndIndex,
newEndIndex,
startPosition: indexToPoint(oldContent, startIndex),
oldEndPosition: indexToPoint(oldContent, oldEndIndex),
newEndPosition: indexToPoint(newContent, newEndIndex),
};
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeEditRegion calls indexToPoint three times, and indexToPoint scans from the start of the text each time. For larger files this makes edit computation O(n) with a larger constant factor on every incremental parse. Consider computing row/column positions in a single pass up to max(startIndex, oldEndIndex, newEndIndex) or caching newline offsets to avoid repeated full scans.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to 12
export class FlowrSweaveFile extends FlowrFile {
private readonly wrapped: FlowrFileProvider<string>;
private data?: SweaveInfo;
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change drops the explicit FlowrFile<string> generic and weakens the content() return type to StringableContent, which reduces type safety for callers. If the content is always textual here, consider keeping extends FlowrFile<string> for a more precise API.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to 16
export class FlowrRMarkdownFile extends FlowrFile {
private data?: RmdInfo;
private readonly wrapped: FlowrFileProvider<string>;

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change drops the explicit FlowrFile<string> generic and weakens the content() return type to StringableContent, which reduces type safety for callers. If the content is always textual here, consider keeping extends FlowrFile<string> for a more precise API.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to 12
export class FlowrJupyterFile extends FlowrFile {
private readonly wrapped: FlowrFileProvider<string>;

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change drops the explicit FlowrFile<string> generic and weakens the content() return type to StringableContent, which reduces type safety for callers. If the content is always textual here, consider keeping extends FlowrFile<string> for a more precise API.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants