This is the repository for the experiments of the paper "Exploring NLP Techniques for Code Smell Detection: A Comparative Study." The study compares various NLP-based models for detecting code smells to a baseline model, highlighting their strengths and weaknesses.
This research was conducted as part of a CIFRE doctoral program (Convention Industrielle de Formation par la Recherche) in collaboration with Adservio, Universite Paris-Saclay and ISEP.
First, create a conda environment and install the dependencies
conda create -n mlcqenv python=3.10
conda activate mlcqenv
conda install -f requirements.txt
In order to recreate the json containing the code snippet according to the paths specified in the MLCQ dataset
- First set your github token to communicate with the api, see here for more information on setting your token
- Export your acquired token as an environment variable :
export GITHUB_TOKEN=<your_github_token> - Run the DataExtractor script :
python DataExtractor.py
The baseline here is j48, a decision tree-based algorithm widely considered state-of-the-art for code smell detection ( see 1 and 2 )
We need to first compute code metrics as they are the features needed for this model, to do so we use Designite, install it following the official repo
- Run
python baseline/MetricsExtractor.pyto prepare the code snippets in .java files. - Run
python baseline/DesigniteRun.pyto execute Designite on the java files producing a DesigniteOutput file. - Run
baseline/DatasetCreator.pyto prepare the final dataset to feed to the model. - Finally, run
train.pyto train and test the model.
Tip: You can speed up the Designite processing by specifying the number of workers when using
MetricsExtractor.py. This divides the dataset into batches, enabling parallel processing for faster execution.
There are different models each with different components, to train the final bilstm with attention model run :
python bilstm_attn_train.py --batch_size 16 --epochs 20 --learning_rate 0.0001 --hidden_dim 512 --num_layers 2
Whereas to run the CodeBert model run :
python bert.py
All the results will be stored to their corresponding log files.
Authors :
- Djamel Mesbah ( djamel.mesbah@adservio.fr/djamel.mesbah@universite-paris-saclay.fr)
- Nour El Madhoun ( nour.el-madhoun@isep.fr)
- Hani Chalouati ( hani.chalouati@adservio.fr )
- Khaldoun Al Agha (alagha@lri.fr)
This work relies on:
- The MLCQ dataset
- The Designite tool
- CodeBert pretrained model from huggingface