🏆 SemEval-2026 Task 13: GenAI Code Detection & Attribution

This repository contains the unified solution for SemEval-2026 Task 13, an international competition focused on the distinction, attribution, and analysis of source code generated by Large Language Models (LLMs) versus human-written code.

The project is structured in a modular way to address the 3 Subtasks of the competition, sharing a common execution environment and a centralized data analysis pipeline.

📌 Subtasks Overview

The project is divided into three main modules, each with specific objectives and architectures. Click on the task name to access detailed documentation.

Task	Name	Objective	Problem Type
Subtask A	Machine-Generated Code Detection	Distinguish whether code is written by a Human or a Machine.	Binary Classification
Subtask B	Multi-Class Authorship Detection	Identify the specific author model (e.g., GPT-4, Llama-3).	Multi-class Classification
Subtask C	Mixed-Source Analysis	Analyze modifications, refactoring, and hybrid Human/AI code.	Regression / Hybrid

📂 Repository Structure

The folder organization is designed to separate data, analysis visuals, and source code.

.
├── 📁 data/                    # Datasets (parquet) split by Task
├── 📁 img/                     # Visual outputs from analysis scripts (EDA)
│   ├── 📁 img_TaskA/           # Plots specific to Task A
│   ├── 📁 img_TaskB/           # Plots specific to Task B
│   └── 📁 img_TaskC/           # Plots specific to Task C
│
├── 📁 info_dataset/            # Scripts for statistical data analysis
│   ├── 🐍 info_dataset_subTaskA.py
│   ├── 🐍 info_dataset_subTaskB.py
│   └── 🐍 info_dataset_subtaskC.py
│
├── 📁 src/                     # Model source code
│   ├── 📁 src_TaskA/           # Complete pipeline for Subtask A
│   ├── 📁 src_TaskB/           # Complete pipeline for Subtask B
│   └── 📁 src_TaskC/           # Complete pipeline for Subtask C
│
├── 🐍 data.py                  # Downloads the chosen dataset from Kaggle
│
├── 📝 README.md
├── 📄 prepare.sh               # Setup automation script (folder creation & env)
├── ⚙️ environment.yml          # Shared Conda dependencies
└── ⚙️ .env                     # Environment variables (generated by prepare.sh)

Important

Remember to generate the kaggle.json file from your Kaggle account:

{"username":"your_username","key":"your_kaggle_key"}

🚀 Quick Installation Guide

Since the three tasks share the same base dependencies and structure, a centralized setup has been prepared to facilitate startup.

1. Prerequisites

Anaconda or Miniconda installed on your system.
An NVIDIA GPU with updated drivers (recommended for training).
Linux/Mac OS (or WSL for Windows) to run bash scripts.

2. Automatic Setup

Run the prepare.sh script from the project root. This script will:

Create the output directory structure (results, checkpoints, etc.).
Generate the .env file for environment variables.
Create and install the Conda virtual environment defined in environment.yml.

chmod +x prepare.sh
./prepare.sh

3. Activating the Environment

Once the setup is complete, activate the environment:

conda activate semeval

4. Configurazione dei Dati

Open the .env file generated in the project root. Ensure that the DATA_PATH variable points to the directory containing the .parquet files (or the folder downloaded from Kaggle).

Example .env:

KAGGLE_USERNAME=Your_kaggle_username
KAGGLE_KEY=Your_kaggle_key

DATA_PATH=./data
IMG_PATH=./img

COMET_API_KEY=comet_api_key
COMET_PROJECT_NAME=comet_project_name
COMET_WORKSPACE=comet_name_workspace
COMET_EXPERIMENT_NAME=comet_experment_name

5. Download Dataset

Remember to install Kaggle dependencies if you haven't already:

pip install kaggle

Download your preferred dataset by running:

python data.py

Edit competition_name to insert the specific dataset you wish to download from Kaggle. The dataset will be automatically downloaded into the data folder.

✨ Autore ✨

Giovanni Giuseppe Iacuzzo
AI & Cybersecurity Engineering Student
University of Kore, Enna

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.vscode		.vscode
img		img
info_dataset		info_dataset
presentazione		presentazione
src		src
.gitignore		.gitignore
README.it.md		README.it.md
README.md		README.md
data.py		data.py
download_weights.py		download_weights.py
environment.yml		environment.yml
prepare.sh		prepare.sh
requirements.txt		requirements.txt
upload_weights.py		upload_weights.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏆 SemEval-2026 Task 13: GenAI Code Detection & Attribution

📌 Subtasks Overview

📂 Repository Structure

🚀 Quick Installation Guide

1. Prerequisites

2. Automatic Setup

3. Activating the Environment

4. Configurazione dei Dati

5. Download Dataset

✨ Autore ✨

About

Uh oh!

Languages

GiovanniIacuzzo/SemEval-2026-Task-13

Folders and files

Latest commit

History

Repository files navigation

🏆 SemEval-2026 Task 13: GenAI Code Detection & Attribution

📌 Subtasks Overview

📂 Repository Structure

🚀 Quick Installation Guide

1. Prerequisites

2. Automatic Setup

3. Activating the Environment

4. Configurazione dei Dati

5. Download Dataset

✨ Autore ✨

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages