ML-Based Python Code Autocompletion

Authors

Zakir Hussain Nadir Hussain

Course: Machine Learning for Software Analysis

Professor: Fabio Pinelli

Submitted: Feb 14, 2025

1. Introduction and Objectives

AutoCode completion is a crucial feature in modern Integrated Development Environments (IDEs), helping developers by predicting and suggesting the next possible tokens as they write code. This project focuses on developing a machine learning-based autocompletion system for Python code. The goal is to predict the next token in a code sequence, similar to how modern IDEs assist developers. In this project, we trained different models using PyTorch and leveraged deep learning techniques such as LSTMs, GRUs, and Transformers.

Project Objectives:

To Implement a machine learning model to predict the next token in Python code
To Train the model using a dataset of Python code snippets
To Evaluate the model's performance using accuracy, perplexity, and cross-entropy loss
To Demonstrate inference for autocompletion

2. Methodology

2.1 Dataset Collection/Adaption

The dataset used for this project is the Py150k dataset, a collection of 150,000 open-source Python source files collected from an online website 1. Due to limited GPU and memory resources, 500, 1000, and 2000 Python files were selected for different runs, which was further split into 80% for training, and 20% for validation.

2.2 Data Preprocessing

The dataset was preprocessed to remove unnecessary elements, such as comments and docstrings, ensuring clean input data for training.

Preprocessing Pipeline Steps:

Loading Python Files:
- The dataset consists of raw Python (.py) files stored in a directory
- A script iterates through the directory structure, loads only valid Python files
- Reads their content into memory
- Ensures dataset contains only relevant Python programming code
Cleaning the Code:
- Removes comments (e.g., # This is a comment)
- Removes docstrings (multi-line strings used for documentation)
- Removes extra newlines and spaces
- Ensures consistent format for tokenization using the re library
- In some implementations, dataset was directly fed to tokenizer without pre-processing
Tokenization: The processed dataset is passed through multiple tokenizers in different implementations:

Tokenizer	# of Source Files	Vocab Size
Word-Level Tokenizer	1000	12847
Sentence Piece Tokenizer	500	130000
Sentence Piece Tokenizer	100000	8000
Custom	500	39901

Sequence Generation:
- Dataset converted into fixed-length sequences of tokens using trained tokenizer
- Each sequence consists of L=20, and L=50 consecutive tokens in different runs
- First L tokens serve as input
- L+1 token becomes target output (expected next token)
- Setup allows model to learn contextual relationships between tokens
Dataset Splitting:
- Generated sequences divided into 80-20 training and validation sets
- During inference, code pieces taken from training, validation and unknown sets

3. Model Architecture

Figure 1: Model Architecture showing LSTM and GRU implementations with tokenizer flow

We trained multiple models on varying sets of training data. Specifically, we trained LSTM and GRU models with different strategies for tokenization, hyper-parameters, and varying numbers of epochs.

4. Training

The training was carried out on Google Colab and laptop machines. Due to limited resources and the large dataset size, we selected subsets of 500, 1000, or 2000 source files. Training configuration and results:

Model	Tokenizer	Dataset Size	Epochs	Emb Dim	Hidden Dim	Sequence Length	Accuracy %	Perplexity	Cross-entropy
LSTM	Custom	500	10	128	256	20	84.47	2.19	0.7861
LSTM	Word-Level	1000	15	512	512	50	93.33	1.42	0.35
LSTM	SentencePiece	500	20	128	256	50	86.43	1.77	0.5705
GRU	Word-Level	1000	10	512	512	50	93.74	1.40	0.34
LSTM	SentencePiece	2000	20	128	256	50	48.71	9.00	2.199

5. Evaluation Results

Our top three models delivered promising results with varying configurations:

Best Performer - GRU with Word-Level tokenizer:
- 93.74% accuracy
- Dataset size: 1000
- Low perplexity: 1.40
- Cross-entropy loss: 0.34
- Only required 10 epochs
Runner-up - LSTM with Word-Level tokenizer:
- 93.33% accuracy
- Dataset size: 1000
- Required 15 epochs
- Similar stability metrics
Notable Mention - LSTM with SentencePiece tokenizer:
- 86.43% accuracy
- Smaller dataset: 500 samples
- 20 epochs training

6. Results and Discussion

The experiments show significant insights into different model architectures and tokenization approaches:

GRU model with Word-Level tokenizer emerged as top performer (93.74% accuracy)
Word-Level tokenization strategy (vocab size: 12847) proved particularly effective
512-dimensional embedding and hidden layers showed optimal results
Medium-sized datasets with sophisticated tokenization strategies more effective than larger datasets with simpler approaches

7. Setup and Execution Instructions

Setting up training

We provide different versions of source code in this repo

SentencePiece500-LSTM
SentencePiece100K-LSTM
Custom-LSTM

For Custom-LSTM and SentencePiece100K-LSTM, ensure you have a data folder containing the complete dataset. The dataset can be downloaded from Py150 Dataset.

Figure 5: Project folder structure showing key files and directories

Some models were trained directly on Google Colab:

https://colab.research.google.com/drive/17OFBm209293CkQC-KswJZO19_CVcIZ0o?usp=sharing

Execution/Inference

For testing or execution of trained models:

Open this Google Colab Notebook, Run the cells to load saved models and their configurations

https://colab.research.google.com/drive/1ZfokjtGddSwZSEbt7J-oCqlxsPjjicCC?usp=sharing

8. Code Attribution

Parts of code adapted from external sources:

Custom tokenizer: Built with help from ChatGPT and documentation 2
SentencePiece tokenizer: Adapted from Student Presentations, motivated from Original Research paper 3
LSTM and GRU architecture: Motivated from course notebooks 4
Metric calculations: Adapted with help from ChatGPT

References

Py150 Dataset
Python Regular Expressions Documentation
SentencePiece Paper
Deep Learning RNN Lecture Notebook

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Custom-LSTM		Custom-LSTM
SentencePiece100K-LSTM		SentencePiece100K-LSTM
SentencePiece500-LSTM		SentencePiece500-LSTM
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
architecture.png		architecture.png
structure.png		structure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML-Based Python Code Autocompletion

1. Introduction and Objectives

Project Objectives:

2. Methodology

2.1 Dataset Collection/Adaption

2.2 Data Preprocessing

Preprocessing Pipeline Steps:

3. Model Architecture

4. Training

5. Evaluation Results

6. Results and Discussion

7. Setup and Execution Instructions

Setting up training

Execution/Inference

8. Code Attribution

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML-Based Python Code Autocompletion

1. Introduction and Objectives

Project Objectives:

2. Methodology

2.1 Dataset Collection/Adaption

2.2 Data Preprocessing

Preprocessing Pipeline Steps:

3. Model Architecture

4. Training

5. Evaluation Results

6. Results and Discussion

7. Setup and Execution Instructions

Setting up training

Execution/Inference

8. Code Attribution

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages