Skip to content

Unified solution for SemEval-2026 Task 13: GenAI Code Detection & Attribution. A modular framework covering Subtasks A (Detection), B (Authorship), and C (Mixed-Source Analysis), designed to distinguish and analyze LLM-generated vs. Human code with a centralized pipeline.

Notifications You must be signed in to change notification settings

GiovanniIacuzzo/SemEval-2026-Task-13

Repository files navigation

🏆 SemEval-2026 Task 13: GenAI Code Detection & Attribution


This repository contains the unified solution for SemEval-2026 Task 13, an international competition focused on the distinction, attribution, and analysis of source code generated by Large Language Models (LLMs) versus human-written code.

The project is structured in a modular way to address the 3 Subtasks of the competition, sharing a common execution environment and a centralized data analysis pipeline.


📌 Subtasks Overview

The project is divided into three main modules, each with specific objectives and architectures. Click on the task name to access detailed documentation.

Task Name Objective Problem Type
Subtask A Machine-Generated Code Detection Distinguish whether code is written by a Human or a Machine. Binary Classification
Subtask B Multi-Class Authorship Detection Identify the specific author model (e.g., GPT-4, Llama-3). Multi-class Classification
Subtask C Mixed-Source Analysis Analyze modifications, refactoring, and hybrid Human/AI code. Regression / Hybrid

📂 Repository Structure

The folder organization is designed to separate data, analysis visuals, and source code.

.
├── 📁 data/                    # Datasets (parquet) split by Task
├── 📁 img/                     # Visual outputs from analysis scripts (EDA)
│   ├── 📁 img_TaskA/           # Plots specific to Task A
│   ├── 📁 img_TaskB/           # Plots specific to Task B
│   └── 📁 img_TaskC/           # Plots specific to Task C
│
├── 📁 info_dataset/            # Scripts for statistical data analysis
│   ├── 🐍 info_dataset_subTaskA.py
│   ├── 🐍 info_dataset_subTaskB.py
│   └── 🐍 info_dataset_subtaskC.py
│
├── 📁 src/                     # Model source code
│   ├── 📁 src_TaskA/           # Complete pipeline for Subtask A
│   ├── 📁 src_TaskB/           # Complete pipeline for Subtask B
│   └── 📁 src_TaskC/           # Complete pipeline for Subtask C
│
├── 🐍 data.py                  # Downloads the chosen dataset from Kaggle
│
├── 📝 README.md
├── 📄 prepare.sh               # Setup automation script (folder creation & env)
├── ⚙️ environment.yml          # Shared Conda dependencies
└── ⚙️ .env                     # Environment variables (generated by prepare.sh)

Important

Remember to generate the kaggle.json file from your Kaggle account:

{"username":"your_username","key":"your_kaggle_key"}

🚀 Quick Installation Guide

Since the three tasks share the same base dependencies and structure, a centralized setup has been prepared to facilitate startup.

1. Prerequisites

  • Anaconda or Miniconda installed on your system.
  • An NVIDIA GPU with updated drivers (recommended for training).
  • Linux/Mac OS (or WSL for Windows) to run bash scripts.

2. Automatic Setup

Run the prepare.sh script from the project root. This script will:

  1. Create the output directory structure (results, checkpoints, etc.).
  2. Generate the .env file for environment variables.
  3. Create and install the Conda virtual environment defined in environment.yml.
chmod +x prepare.sh
./prepare.sh

3. Activating the Environment

Once the setup is complete, activate the environment:

conda activate semeval

4. Configurazione dei Dati

Open the .env file generated in the project root. Ensure that the DATA_PATH variable points to the directory containing the .parquet files (or the folder downloaded from Kaggle).

Example .env:

KAGGLE_USERNAME=Your_kaggle_username
KAGGLE_KEY=Your_kaggle_key

DATA_PATH=./data
IMG_PATH=./img

COMET_API_KEY=comet_api_key
COMET_PROJECT_NAME=comet_project_name
COMET_WORKSPACE=comet_name_workspace
COMET_EXPERIMENT_NAME=comet_experment_name

5. Download Dataset

Remember to install Kaggle dependencies if you haven't already:

pip install kaggle

Download your preferred dataset by running:

python data.py

Edit competition_name to insert the specific dataset you wish to download from Kaggle. The dataset will be automatically downloaded into the data folder.


✨ Autore ✨

Giovanni Giuseppe Iacuzzo
AI & Cybersecurity Engineering Student
University of Kore, Enna

GitHub Email

About

Unified solution for SemEval-2026 Task 13: GenAI Code Detection & Attribution. A modular framework covering Subtasks A (Detection), B (Authorship), and C (Mixed-Source Analysis), designed to distinguish and analyze LLM-generated vs. Human code with a centralized pipeline.

Topics

Resources

Stars

Watchers

Forks