Overview of the research project:
- Domain labels can be generated for open issues reliably [santos et. al.] , however it is still on the onus of the developer to find a task they can work on
- This project aims to automatically match contributors to issues they are capable of solving, based on their contribution history
- Generate domain labels for closed issues using existing models
- Generate developer profiles using contribution history of the developer, using data of issues they've solved in the past and the domain labels associated with them
- Find open issues a particular developer is capable of solving, based on their profile
- Or alternatively, find developers who can solve a particular open issue
-
We can generate ground truth using spacy library, and use the random forest model to predict issues (both open and closed). Code is in this notebook. See example output.
-
We can generate frequency of domain labels, each developer engaged in. That is, we can generate mapping of:
developer -> domain -> frequency of engagementfor each developer and domain
This mapping can help us to determine weather a particular developer is fit for a particular task. Please see example output
To code here is currently based on modified version of ART-CoreEngine.
The following changes are made to the Core Engine:
- Use of OPENAI API keys is no longer required to run the random forest model
- Abstract syntax tree can now be generated on the fly, rather than first downloading the file from github and then running the AST generation function on the downloaded file. It can now be directly passed the AST generation function as a string.
- Some of the functionalities might not work as expected
Generate virtual environment using:
python -m venv {name_of_your_virtual_environment}
Activate the virtual environment:
source {path_to_you_virutal_environment}/bin/activate
Install ipykernel and jupyter lab (for running noteboooks):
# run after activating your virtual environment
pip install ipykernel jupyter-lab
Register your virtual environment as a kernel in jupyter:
# run after activating your virtual environment. Name it as you like
python -m ipykernel install --user --name={any_name} --displayname "{any_name}"
In the project root directory, run:
# run after activating your virtual environment
jupyter-lab
- Either directly run the get_developer_stastics.py file with
# actuvate venv
python get_developer_stats.py /path_to_your_config.json
Here's an example config:
{
"github_token": "#github token",
"repo_owner": "JabRef",
"repo_name": "JabRef",
"openai_key": "#open ai api key",
"limit" : "100"
}or
- Use as an API from CoreEngine
import src as CoreEngine
data_frame = CoreEngine.get_developer_stats( repo_owner = "jabref",
repo_name = "jabref",
access_token = "github_token",
openai_key = "not required",
limit = 100 )
# use the dataframe afterwords- Run
poetry install-- this sets up the virtual environment - Run
poetry run python3 -m spacy download en_core_web_mdto download the language file - Create a GitHub Personal Access Token to use for downloading issues from GitHub and save it in a file
- Set up a configuration file for training like below (see example pre-filled configuration for default)
- Set an environment variable in a
.envfile with theOPENAI_API_KEYset to an OpenAI key ( Note: NOT REQUIRED FOR the RF MODEL ) - Place a GitHub key in a file located at
auth_pathas specified in theconfig.json. (Default:input/mp_auth.txt) - Run
poetry run python3 main.py path/to/config.jsonwhere the json is the one set up from step three. This will download, analyze, and train the model. It stores the results in a cache, preventing repeated calls.
SAVE ai_result_backup.db in the output directory as this keeps track
of AI artifacts. Deleting this file can result in having to redo OpenAI
calls, costing money!
ℹ️ Info
If you want to restart the analysis from a clean state, delete ONLY
the main.db file in the output directory. You should rarely have to
delete main.db, except when switching repositories. main.db caches
all extracted data to prevent re-download.