Skip to content

Conversation

@holtvogt
Copy link
Contributor

@holtvogt holtvogt commented Aug 7, 2025

JPlag's current Python support relies on ANTLR grammars that struggle with modern Python syntax (3.10+ match statements, 3.11+ exception groups, 3.12+ type aliases) and require maintaining custom grammars that lag behind language evolution.

This PR introduces Tree-sitter as JPlag's new parsing foundation, starting with Python. Tree-sitter provides native parsing performance, community-maintained grammars that stay current with language specs, and cross-platform native library distribution.

Technical Notes

  • Requires Java JDK 22 for Java's Foreign Function and Memory (FFM) API and jextract to generate Java bindings from Tree-sitter C headers
  • Depends on Zig build support in Tree-sitter repositories for Windows library compilation

Architecture Changes

New language-tree-sitter-utils module

  • Uses Java's FFM API bindings for Tree-sitter native libraries
  • Introduces an abstract base class for future Tree-sitter language implementations
  • Enables cross-platform native library loading

First Tree-sitter Python language module implementation

  • PythonParser: Orchestrates parsing and delegates to token collector
  • PythonTokenCollector: AST visitor that maps Tree-sitter nodes to JPlag tokens
  • TreeSitterPython: Singleton grammar loader using FFM API
  • Handler-based token extraction with support for nested constructs

Native library build system

  • GitHub Actions workflow for cross-platform compilation (Linux/macOS/Windows)
  • Local native library compilation via mvn -Pbuild-native-libraries generate-resources
  • Supports versioning of native Tree-sitter libraries to enable language modules to run on different Tree-sitter releases

Testing

# Build native libraries locally
mvn -Pbuild-native-libraries generate-resources

# Test new Python module
mvn clean test-compile -pl :python -am
mvn test -pl :python

# Test CLI integration
mvn -Pwith-report-viewer clean package assembly:single
java -jar cli/target/jplag-*-jar-with-dependencies.jar -l python [files]

Related

holtvogt and others added 30 commits May 21, 2025 22:46
Convert to abstract class with handler maps and template method pattern.
Reduces boilerplate and enforces consistent visitor implementations.
The "Adapter" suffix was misleading as these classes are direct parsers,
not adapters between incompatible interfaces. Updated all references
and documentation to reflect the cleaner, more accurate naming.
Removed redundant token list initialization from PythonTokenCollector and added a centralized token list in TreeSitterVisitor. Introduced a method to retrieve collected tokens.
Updated the language module documentation to clarify core components and added sections for ANTLR and Tree-sitter parsing technologies. Included detailed examples for setting up language modules with Tree-sitter, emphasizing its implementation specifics.
@holtvogt holtvogt marked this pull request as ready for review September 2, 2025 21:03
@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change language PR / Issue deals (partly) with new and/or existing languages for JPlag labels Sep 4, 2025
@robinmaisch robinmaisch requested a review from a team September 4, 2025 13:52
@tsaglam tsaglam requested a review from Copilot September 15, 2025 06:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive Tree-sitter-based Python language support to JPlag, providing modern Python syntax support (3.10+ match statements, 3.11+ exception groups, 3.12+ type aliases) through native parsing performance and community-maintained grammars.

  • Implements new language-tree-sitter-utils module with abstract base classes and native library management
  • Adds Tree-sitter Python language module with token extraction for all Python constructs
  • Updates Java compiler target from JDK 21 to JDK 22 for Foreign Function and Memory API support

Reviewed Changes

Copilot reviewed 52 out of 53 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/build-native-libraries.sh Cross-platform native library build script for Tree-sitter grammars
pom.xml Updates Java version to 22 and adds Tree-sitter dependencies and build profile
language-tree-sitter-utils/ New module with abstract base classes for Tree-sitter language implementations
languages/python/ Complete Tree-sitter Python implementation with parser, token collector, and tests
CI workflows Updates Java version and adds native library build steps

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@ruro
Copy link

ruro commented Dec 20, 2025

I'm interested in this PR, so I tried building it, however the build failed with errors along the lines of

Details
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /build/source/core/src/main/java/de/jplag/options/JPlagOptions.java:[60,39] package JPlagOptionsBuilder does not exist
[ERROR] /build/source/core/src/main/java/de/jplag/options/JPlagOptions.java:[75,20] cannot find symbol
  symbol: method withLanguage(de.jplag.Language)
[ERROR] /build/source/core/src/main/java/de/jplag/options/JPlagOptions.java:[211,24] cannot find symbol
  symbol: method withBaseCodeSubmissionDirectory(java.io.File)
[INFO] 3 errors 
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for JPlag Plagiarism Detector 6.2.0:
[INFO] 
[INFO] JPlag Plagiarism Detector .......................... SUCCESS [ 11.823 s]
[INFO] JPlag: API for Language Modules .................... SUCCESS [  7.408 s]
[INFO] JPlag: Utilities for Language Module Testing ....... SUCCESS [  2.252 s]
[INFO] JPlag: Language Modules (Parent) ................... SUCCESS [  1.451 s]
[INFO] JPlag: Java Language Module ........................ SUCCESS [  4.986 s]
[INFO] JPlag: Core ........................................ FAILURE [  1.579 s]
[INFO] JPlag: Text Language Module ........................ SKIPPED
[INFO] language-tree-sitter-utils ......................... SKIPPED
[INFO] JPlag: Utilities for ANTLR-based Language Modules .. SKIPPED
[INFO] JPlag: Python3 Language Module ..................... SKIPPED
[INFO] python ............................................. SKIPPED
[INFO] JPlag: C# Language Module .......................... SKIPPED
[INFO] JPlag: C Language Module ........................... SKIPPED
[INFO] JPlag: C++ Language Module ......................... SKIPPED
[INFO] JPlag: Go Language Module .......................... SKIPPED
[INFO] JPlag: Kotlin Language Module ...................... SKIPPED
[INFO] JPlag: R Language Module ........................... SKIPPED
[INFO] JPlag: Rust Language Module ........................ SKIPPED
[INFO] JPlag: Scala Language Module ....................... SKIPPED
[INFO] JPlag: Scheme Language Module ...................... SKIPPED
[INFO] JPlag: SCXML Language Module ....................... SKIPPED
[INFO] JPlag: Swift Language Module ....................... SKIPPED
[INFO] JPlag: EMF Metamodel Language Module ............... SKIPPED
[INFO] JPlag: EMF Metamodel (Dynamic) Language Module ..... SKIPPED
[INFO] JPlag: EMF Model Language Module ................... SKIPPED
[INFO] JPlag: TypeScript Language Module .................. SKIPPED
[INFO] JPlag: JavaScript Language Module .................. SKIPPED
[INFO] JPlag: LLVM IR Language Module ..................... SKIPPED
[INFO] JPlag: Multi-Language Module ....................... SKIPPED
[INFO] JPlag: Command Line Interface ...................... SKIPPED
[INFO] endtoend-testing ................................... SKIPPED
[INFO] coverage-report .................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  36.082 s
[INFO] Finished at: 2025-12-19T23:08:44Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.13.0:compile (default-compile) on project jplag: Compilation failure: Compilation failure: 
[ERROR] /build/source/core/src/main/java/de/jplag/options/JPlagOptions.java:[60,39] package JPlagOptionsBuilder does not exist
[ERROR] /build/source/core/src/main/java/de/jplag/options/JPlagOptions.java:[75,20] cannot find symbol
[ERROR]   symbol: method withLanguage(de.jplag.Language)
[ERROR] /build/source/core/src/main/java/de/jplag/options/JPlagOptions.java:[211,24] cannot find symbol
[ERROR]   symbol: method withBaseCodeSubmissionDirectory(java.io.File)

I was able to successfully build both v6.2.0 and v6.3.0, but both jplag:feature/tree-sitter-parser-integration and holtvogt:feature/tree-sitter-parser-integration fail with the above error.

It seems that the de.jplag.cli isn't being included in the build (or at least not in this stage of the build)? I am not a Java dev, so I might be missing something obvious. Any ideas, what could be the cause?

@holtvogt
Copy link
Contributor Author

@ruro Thanks for your interest in this PR. Unfortunately, I couldn't reproduce this when I cloned the fork again.

However, Tree-sitter requires JDK 22 and jextract as it uses Java's newly introduced Foreign Function & Memory API (JEP 454) to link the native functions of the Tree-sitter C libraries.

Did you install both beforehand? Feel free to look at #2370 if you haven't already. You'll find a short installation instruction for jextract in the comments. Let me know if that solved the build issue for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag major Major issue/feature/contribution/change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants