This repository contains the unified solution for SemEval-2026 Task 13, an international competition focused on the distinction, attribution, and analysis of source code generated by Large Language Models (LLMs) versus human-written code.
The project is structured in a modular way to address the 3 Subtasks of the competition, sharing a common execution environment and a centralized data analysis pipeline.
The project is divided into three main modules, each with specific objectives and architectures. Click on the task name to access detailed documentation.
| Task | Name | Objective | Problem Type |
|---|---|---|---|
| Subtask A | Machine-Generated Code Detection | Distinguish whether code is written by a Human or a Machine. | Binary Classification |
| Subtask B | Multi-Class Authorship Detection | Identify the specific author model (e.g., GPT-4, Llama-3). | Multi-class Classification |
| Subtask C | Mixed-Source Analysis | Analyze modifications, refactoring, and hybrid Human/AI code. | Regression / Hybrid |
The folder organization is designed to separate data, analysis visuals, and source code.
.
├── 📁 data/ # Datasets (parquet) split by Task
├── 📁 img/ # Visual outputs from analysis scripts (EDA)
│ ├── 📁 img_TaskA/ # Plots specific to Task A
│ ├── 📁 img_TaskB/ # Plots specific to Task B
│ └── 📁 img_TaskC/ # Plots specific to Task C
│
├── 📁 info_dataset/ # Scripts for statistical data analysis
│ ├── 🐍 info_dataset_subTaskA.py
│ ├── 🐍 info_dataset_subTaskB.py
│ └── 🐍 info_dataset_subtaskC.py
│
├── 📁 src/ # Model source code
│ ├── 📁 src_TaskA/ # Complete pipeline for Subtask A
│ ├── 📁 src_TaskB/ # Complete pipeline for Subtask B
│ └── 📁 src_TaskC/ # Complete pipeline for Subtask C
│
├── 🐍 data.py # Downloads the chosen dataset from Kaggle
│
├── 📝 README.md
├── 📄 prepare.sh # Setup automation script (folder creation & env)
├── ⚙️ environment.yml # Shared Conda dependencies
└── ⚙️ .env # Environment variables (generated by prepare.sh)Important
Remember to generate the kaggle.json file from your Kaggle account:
{"username":"your_username","key":"your_kaggle_key"}Since the three tasks share the same base dependencies and structure, a centralized setup has been prepared to facilitate startup.
- Anaconda or Miniconda installed on your system.
- An NVIDIA GPU with updated drivers (recommended for training).
- Linux/Mac OS (or WSL for Windows) to run bash scripts.
Run the prepare.sh script from the project root. This script will:
- Create the output directory structure (
results,checkpoints, etc.). - Generate the
.envfile for environment variables. - Create and install the Conda virtual environment defined in
environment.yml.
chmod +x prepare.sh
./prepare.shOnce the setup is complete, activate the environment:
conda activate semevalOpen the .env file generated in the project root. Ensure that the DATA_PATH variable points to the directory containing the .parquet files (or the folder downloaded from Kaggle).
Example .env:
KAGGLE_USERNAME=Your_kaggle_username
KAGGLE_KEY=Your_kaggle_key
DATA_PATH=./data
IMG_PATH=./img
COMET_API_KEY=comet_api_key
COMET_PROJECT_NAME=comet_project_name
COMET_WORKSPACE=comet_name_workspace
COMET_EXPERIMENT_NAME=comet_experment_nameRemember to install Kaggle dependencies if you haven't already:
pip install kaggleDownload your preferred dataset by running:
python data.pyEdit competition_name to insert the specific dataset you wish to download from Kaggle. The dataset will be automatically downloaded into the data folder.