GitHub

Match skills of contributors in open source projects to issues

Overview of the research project:

Domain labels can be generated for open issues reliably [santos et. al.] , however it is still on the onus of the developer to find a task they can work on
This project aims to automatically match contributors to issues they are capable of solving, based on their contribution history

Objectives:

Generate domain labels for closed issues using existing models
Generate developer profiles using contribution history of the developer, using data of issues they've solved in the past and the domain labels associated with them
Find open issues a particular developer is capable of solving, based on their profile
Or alternatively, find developers who can solve a particular open issue

Current capabilities

We can generate ground truth using spacy library, and use the random forest model to predict issues (both open and closed). Code is in this notebook. See example output.
We can generate frequency of domain labels, each developer engaged in. That is, we can generate mapping of:

developer -> domain -> frequency of engagement

for each developer and domain

This mapping can help us to determine weather a particular developer is fit for a particular task. Please see example output

Installation and running the code

To code here is currently based on modified version of ART-CoreEngine.

The following changes are made to the Core Engine:

Use of OPENAI API keys is no longer required to run the random forest model
Abstract syntax tree can now be generated on the fly, rather than first downloading the file from github and then running the AST generation function on the downloaded file. It can now be directly passed the AST generation function as a string.
Some of the functionalities might not work as expected

Running the notebooks

Linux

Generate virtual environment using:

python -m venv {name_of_your_virtual_environment}

Activate the virtual environment:

source {path_to_you_virutal_environment}/bin/activate

Install ipykernel and jupyter lab (for running noteboooks):

# run after activating your virtual environment
pip install ipykernel jupyter-lab

Register your virtual environment as a kernel in jupyter:

# run after activating your virtual environment. Name it as you like
python -m ipykernel install --user --name={any_name} --displayname "{any_name}"

In the project root directory, run:

# run after activating your virtual environment
jupyter-lab

To generate developer stastics

Either directly run the get_developer_stastics.py file with

# actuvate venv
python get_developer_stats.py /path_to_your_config.json

Here's an example config:

{
    "github_token": "#github token",
    "repo_owner": "JabRef",
    "repo_name": "JabRef",
    "openai_key": "#open ai api key",
    "limit" : "100"
}

or

Use as an API from CoreEngine

import src as CoreEngine

data_frame = CoreEngine.get_developer_stats( repo_owner = "jabref",
                                             repo_name = "jabref",
                                             access_token = "github_token",
                                             openai_key = "not required",
                                             limit = 100 )

# use the dataframe afterwords

To setup the the core engine:

Run poetry install -- this sets up the virtual environment
Run poetry run python3 -m spacy download en_core_web_md to download the language file
Create a GitHub Personal Access Token to use for downloading issues from GitHub and save it in a file
Set up a configuration file for training like below (see example pre-filled configuration for default)
Set an environment variable in a .env file with the OPENAI_API_KEY set to an OpenAI key ( Note: NOT REQUIRED FOR the RF MODEL )
Place a GitHub key in a file located at auth_path as specified in the config.json. (Default: input/mp_auth.txt)
Run poetry run python3 main.py path/to/config.json where the json is the one set up from step three. This will download, analyze, and train the model. It stores the results in a cache, preventing repeated calls.

SAVE ai_result_backup.db in the output directory as this keeps track of AI artifacts. Deleting this file can result in having to redo OpenAI calls, costing money!

ℹ️ Info

If you want to restart the analysis from a clean state, delete ONLY the main.db file in the output directory. You should rarely have to delete main.db, except when switching repositories. main.db caches all extracted data to prevent re-download.

Note: Some CSV files in the project might not open in your regular editor. This is because, they are delimited by "\a". It is recommended that you load them as dataframe.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
data_generation_laboratory		data_generation_laboratory
docs		docs
input		input
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RF_predictions_with_GT_by_spacy.ipynb		RF_predictions_with_GT_by_spacy.ipynb
find_developer_skills.ipynb		find_developer_skills.ipynb
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Match skills of contributors in open source projects to issues

Objectives:

Current capabilities

Installation and running the code

Running the notebooks

Linux

To generate developer stastics

To setup the the core engine:

Note: Some CSV files in the project might not open in your regular editor. This is because, they are delimited by "\a". It is recommended that you load them as dataframe.

About

Uh oh!

Releases

Packages

Languages

License

abdulrahim2002/matchskill

Folders and files

Latest commit

History

Repository files navigation

Match skills of contributors in open source projects to issues

Objectives:

Current capabilities

Installation and running the code

Running the notebooks

Linux

To generate developer stastics

To setup the the core engine:

Note: Some CSV files in the project might not open in your regular editor. This is because, they are delimited by "\a". It is recommended that you load them as dataframe.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages