Skip to content

Commit 2cbd4ea

Browse files
Updated README.md
1 parent 0f27946 commit 2cbd4ea

File tree

2 files changed

+39
-23
lines changed

2 files changed

+39
-23
lines changed

README.md

Lines changed: 35 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,63 @@
1+
# 🚀 Improving Source Code Similarity Detection with GraphCodeBERT and Additional Feature Integration
12

2-
# Improving Source Code Similarity Detection with GraphCodeBERT and Additional Feature Integration
3-
This repository contains the implementation of a novel approach for source code similarity detection that integrates an additional output feature into the classification process to enhance model performance. The approach is based on the GraphCodeBERT model, which has been extended with a custom output feature layer and a concatenation mechanism to improve feature representation. The model has been trained and evaluated on the IR-Plag dataset, demonstrating significant improvements in precision, recall, and f-measure. The full implementation, including model architecture, training strategies, and evaluation metrics, is available in this repository.
3+
This repository contains the implementation of a novel approach for source code similarity detection that integrates an additional output feature into the classification process to enhance model performance. The approach is based on the **GraphCodeBERT** model, which has been extended with a custom output feature layer and a concatenation mechanism to improve feature representation. The model has been trained and evaluated on the **IR-Plag dataset**, demonstrating significant improvements in precision, recall, and f-measure. The full implementation, including model architecture, training strategies, and evaluation metrics, is available in this repository.
44

55
[![arXiv](https://img.shields.io/badge/arXiv-2408.08903-b31b1b.svg)](https://arxiv.org/abs/2408.08903)
6+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
67

8+
---
79

810
## 🌍 Introduction
11+
912
Accurate and efficient detection of similar source code fragments is crucial for maintaining software quality, improving developer productivity, and ensuring code integrity. With the rise of deep learning (DL) and natural language processing (NLP) techniques, transformer-based models have become a preferred approach for understanding and processing source code.
1013

11-
In this project, we extend the capabilities of GraphCodeBERT—a transformer model specifically designed to process the structural and semantic properties of programming languages. By integrating an additional output feature layer and using a concatenation mechanism, our approach enhances the model's ability to represent source code, leading to better performance in similarity detection tasks.
14+
In this project, we extend the capabilities of **GraphCodeBERT**—a transformer model specifically designed to process the structural and semantic properties of programming languages. By integrating an additional output feature layer and using a concatenation mechanism, our approach enhances the model's ability to represent source code, leading to better performance in similarity detection tasks.
1215

13-
### Repository Contents
16+
### 📂 Repository Contents
1417

15-
- `graphcodebert_fint.ipynb`: Jupyter Notebook that includes the full implementation of the model, from data loading and preprocessing to training, evaluation, and results interpretation. Detailed comments and documentation are provided within the notebook. It is optimized to be used in Google Colab since the use of a GPU is highly recommended.
16-
- `fine-tunning-graphcodebert-karnalim-with-features.py`: The source code in the form of a standard python app.
18+
- **`graphcodebert_fint.ipynb`**: A Jupyter Notebook that includes the full implementation of the model, from data loading and preprocessing to training, evaluation, and results interpretation. Detailed comments and documentation are provided within the notebook. **It is optimized to be used in Google Colab since the use of a GPU is highly recommended.**
19+
- **`fine-tunning-graphcodebert-karnalim-with-features.py`**: The source code in the form of a standard Python app.
1720

21+
---
1822

1923
## 🛠️ Methodology
2024

21-
### Model Architecture
22-
The model is an extension of GraphCodeBERT, which is a transformer-based model pre-trained on large corpora of code and designed to capture both textual and structural properties of code. We introduce a custom output feature layer and concatenate the pooled output of the transformer with this processed feature, allowing the model to learn a richer representation of the source code.
25+
### 🔍 Model Architecture
26+
27+
The model is an extension of **GraphCodeBERT**, a transformer-based model pre-trained on large corpora of code and designed to capture both textual and structural properties of code. We introduce a custom output feature layer and concatenate the pooled output of the transformer with this processed feature, allowing the model to learn a richer representation of the source code.
28+
29+
### 📊 Dataset
30+
31+
We utilize the **IR-Plag dataset**, which is specifically designed for benchmarking source code similarity detection techniques, particularly in academic plagiarism contexts. The dataset contains 467 code files, with 355 labeled as plagiarized. The diversity in coding styles and structures within this dataset makes it ideal for evaluating the effectiveness of our model.
2332

24-
### Dataset
25-
We utilize the IR-Plag dataset, which is specifically designed for benchmarking source code similarity detection techniques, particularly in academic plagiarism contexts. The dataset contains 467 code files, with 355 labeled as plagiarized. The diversity in coding styles and structures within this dataset makes it ideal for evaluating the effectiveness of our model.
33+
### 🏋️ Training and Evaluation
2634

27-
### Training and Evaluation
2835
The training process included random splits of the dataset into training, validation, and test sets. Key metrics such as precision, recall, and f-measure were computed to evaluate the model's performance. The notebook documents the training arguments, including batch size, number of epochs, and learning rate adjustments.
2936

37+
---
3038

31-
## 📈 Results
32-
Our experimental results show that the integration of an additional output feature significantly enhances the model's performance. Specifically, our extended version of GraphCodeBERT achieved the highest precision, recall, and f-measure compared to other state-of-the-art techniques.
39+
## 📈 Results
40+
41+
Our experimental results show that the integration of an additional output feature significantly enhances the model's performance. Specifically, our extended version of **GraphCodeBERT** achieved the highest precision, recall, and f-measure compared to other state-of-the-art techniques.
3342

3443
The table below summarizes the performance of various approaches:
3544

36-
| Approach | Precision | Recall | F-Measure |
37-
|----------------------------------|-----------|--------|-----------|
38-
| CodeBERT | 0.72 | 1.00 | 0.84 |
39-
| Output Analysis | 0.88 | 0.93 | 0.90 |
40-
| Boosting (XGBoost) | 0.88 | 0.99 | 0.93 |
41-
| Bagging (Random Forest) | 0.95 | 0.97 | 0.96 |
42-
| GraphCodeBERT | 0.98 | 0.95 | 0.96 |
43-
| **Our GraphCodeBERT variant** | **0.98** | **1.00**| **0.99** |
45+
| **Approach** | **Precision** | **Recall** | **F-Measure** |
46+
|-----------------------------------|:-------------:|:----------:|:-------------:|
47+
| CodeBERT | 0.72 | 1.00 | 0.84 |
48+
| Output Analysis | 0.88 | 0.93 | 0.90 |
49+
| Boosting (XGBoost) | 0.88 | 0.99 | 0.93 |
50+
| Bagging (Random Forest) | 0.95 | 0.97 | 0.96 |
51+
| GraphCodeBERT | 0.98 | 0.95 | 0.96 |
52+
| **Our GraphCodeBERT variant** | **0.98** | **1.00** | **0.99** |
4453

54+
---
4555

4656
## 📚 Reference
4757

4858
If you use this work, please cite:
4959

50-
```
60+
```bibtex
5161
@misc{martinezgil2024graphcodebert,
5262
title={Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features},
5363
author={Jorge Martinez-Gil},
@@ -58,6 +68,8 @@ If you use this work, please cite:
5868
}
5969
```
6070

71+
---
72+
6173
## 📄 License
6274

63-
The project is provided under the MIT License.
75+
This project is licensed under the MIT License.

requirements.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
torch==2.0.1
2+
transformers==4.31.0
3+
scikit-learn==1.3.0
4+
jsonlib-python3==1.6.1

0 commit comments

Comments
 (0)