[RFC]: Automated Code Reviews and Fixes via LLM-powered stdlib-bot

### Full name

Prajjwal Bajpai

### University status

Yes

### University name

Indian Institute of Technology (BHU) Varanasi

### University program

Mechanical Engineering

### Expected graduation

2027

### Short biography

I am a sophomore at IIT BHU Varanasi in India. I am doing B. Eng with major in Mechanical Engineering. I am very keen in programming since school time and have some knowledge about ML and AI and currently learning dev. I have done several mathematics courses in college related to linear algebra, statistics and probablity. I am currently comfortable to work in Python, C++ and JS. I always like to learn new things and challenge myself by taking up hard tasks. I have just started my open source journey and my first merged PR is with stdlib.  

### Timezone

Indian Standard Time (UTC +5:30)

### Contact details

email:prajjwal8166@gmail.com, github:prajjwalbajpai

### Platform

Linux

### Editor

I use VSCode because of its popularity and highly customizable experience.

### Programming experience

My programming experience is mostly in machine learning and python includes the following

- I have worked mainly in python and have knowledge of Data Science and Computer Vision
- I have worked with python libraries such as python, pytorch, numpy and more.
- I have made a hackathon webscraper with uses selenium and gemini API to scrape hackathons from the internet.
- I have knowledge of javascript and react.js.

### JavaScript experience

I have used javascript for frontend application along with react.js.

I like javascript for its large community which means I am never stuck anywhere and even larger number of libraries available. Also as I transitioned from C++ to JS, I didn't face any syntax issues because of its simplicity.

I don't like javascript because of its bad error handling system. Some other languages handle errors better.

### Node.js experience

I am currently new to node.js and learning as I go. But, I have learnt a lot which contributing to stdlib-js.

### C/Fortran experience

I have completed `CSO-101: Introduction to Computer Programming` course in my college which is taught in C language.  It focuses on C programming features like pointers and data structures. 

### Interest in stdlib

I find stdlib very important for javascript as it brings optimised computing to javascript. Having worked in python, where there are more than required libraries for mathematical operations (sympy being a great example), I realised that there are not many libraries that extensively supports mathemical functions.

I really like the data visualisation support and browser compatibility which makes it very easy to integrate with frontend. I am planning to create a dashboard for an ML project with stdlib.

I have also started my opensource journey here from stdlib.

### Version control

Yes

### Contributions to stdlib

I have made PRs related to refactoring and adding accessors support to the functions in `stats/base/` . 
I have worked in the following functions - 
`variancetk`, `nanvariancetk`, `nanmeanwd`, `nanvariancewd`, `nanrange` and `meanwd` with 3 PRs being merged.
I am currently working on issues related to C, to get my hands dirty with c as well. And, I will be regularly contributing to stdlib-js as I really like its purpose and even more the maintainers who are really helping.



### stdlib showcase

My showcase for stdlib is [Neural Network Implementation in JS using stdlib-js](https://github.com/prajjwalbajpai/stdlib-showcase).

Implementing a neural network from scratch without using any deep learning libraries is the first project people make in their deep learning journey. I have implemented a simple neural network with 1 hidden layer and did all the mathematical work i.e, calculating dot products, exponentials and log etc. using stdlib. Although, I did find some features that were missing and could be added, it was a good project for using stdlib 1st hand and getting to know it better.

I have used a wine-quality dataset which does not have nan values or skewed distribution. 

### Goals

I want to complete and enable the new stdlib-bot which can do the following things - 
- The bot can automatically recommend changes for things like linting errors or remove unused function that are generally made by newcomers and require maintainers to recommend changes. This can also include automatically adding changes to the pull request.
- Add details about the work needed to be done to fix an issue such as, files involved and functions involved and more. This would be really helpful for new contributors.
- Run the tests for the changed files in a PR, and recommend changes to pass the tests.
- Produce a confidence score for LLM generated fixes and PR reviews based on which it can be decided if human intervention is needed.

For the project to work I have in mind the following tasks (not an exhaustive list) - 
- Fine tune RAG model or LLM to do PR reviews by training it on older PRs and conversations made in the issues to make it understand context.
- Use Cursor AI or any other platform to integrate AI for automation and efficiency in the workflow.
- Integration with GitHub Actions for seamless automation.

### Why this project?

As I started contributing to stdlib I found that I have very small doubts regarding the codebase and issues and as maintainers are not always available to help me out, I get stuck. This problem can be very easily fixed by using AI. Also, at times maintainers have to manually do small changes like remove unused functions and spacing issues which can easily be automated to improve efficiency. As my background is from ML I would really like if my skills can come to use for stdlib.

### Qualifications

I am fairly confident that I can work on this project. I have made multiple projects which involve using LLMs like gemini and RAG models. Being a contributor, I know what problems can be faced while contributing and also problems that can be faced while training RAG models.


### Prior art

Many Github orgs have implemented similar automation features. [Ansys review bot](https://github.com/ansys/review-bot) catches my eye. This uses OpenAI (chat-gpt) model to automatically generate suggestions for improving GitHub pull requests. This is implemented in python but i don't see any issue doing this in JS. This convers less features than what is required in the project but it is a very good starting point.

[This](https://dev.to/sunilkumrdash/automate-github-pr-reviews-with-langchain-agents-444p) is a tutorial that implements the similar thing with a detailed explanation.

### Commitment

I don't have any engagements after 9th May 2025 as my college summer break is starting then. I would need 3-4 days break around 15th may as I am planning a vacation but I will recover lost time by working extra. I can easily dedicate 30-35 hours per week on weekdays and weekends. I can increase my hours to complete any critical work to stay on track with the timeline.

### Schedule

Assuming a 12 week schedule,

- **Community Bonding Period**: Discussing the exact workflow with mentor and finalising the plan. Getting to know software new to me like Github actions and Github REST etc.

- **Week 1**: Choosing the LLM/RAG model and setting up fine tuning environment configuration and defining hyperparameters.
Some of the required libraries include hugging face for transformers, PEFT (Parameter Efficient Fine Tuning) and LoRA (Low rank adaptation to reduce memory usage.

- **Week 2**: Extracting the data required to fine tune models using Github Rest etc.
I would mainly focus on changed lines the previous Pull requests and comments made to make LLM understand the context.

- **Week 3**: Processing the data required to fine tune models (if takes more time could be adjusted in later weeks) it includes work like cleaning the data.
This week would include steps like removing stopwords, tokenizing sentences and removing irrelevant data.

- **Week 4**: Training a model on a small part of extracted data.
An important part of training the model would be to train the model to provide a confidence score for the changes recommended.

- **Week 5**: Evaluating the model and deploying it to a github test repo (not the final model).
Deployment of the model and integrating it to the stdlib-bot is very important. I am thinking about using azure for model deployment and will be confirmed after talking to a mentor.

- **Week 6**: (midterm) I will test the prototype model and submit the same for evaluation.
This would be done by both the extracted dataset testing part and manually testing the model. 

- **Week 7**: Scaling the model on the whole data

- **Week 8**: Adding more features like automating issue creation and adding comments to issues etc.
I have some features like these in mind. If time permits, I can add these features after discussing with the mentor.

- **Week 9**: Adding more features.

- **Week 10**: Deploying the model with github actions.

- **Week 11**: Testing the model and fixing related issues.

- **Week 12**: Complete the documentation related to the project.

- **Final Week**: Intensive testing of the project to ensure no issues are faced.

Notes:

- The community bonding period is a 3 week period built into GSoC to help you get to know the project community and participate in project discussion. This is an opportunity for you to setup your local development environment, learn how the project's source control works, refine your project plan, read any necessary documentation, and otherwise prepare to execute on your project project proposal.
- Usually, even week 1 deliverables include some code.
- By week 6, you need enough done at this point for your mentor to evaluate your progress and pass you. Usually, you want to be a bit more than halfway done.
- By week 11, you may want to "code freeze" and focus on completing any tests and/or documentation.
- During the final week, you'll be submitting your project.


### Related issues

[Project Idea](https://github.com/stdlib-js/google-summer-of-code/issues/103)

### Checklist

- [x] I have read and understood the [Code of Conduct](https://github.com/stdlib-js/stdlib/blob/develop/CODE_OF_CONDUCT.md).
- [x] I have read and understood the application materials found in this repository.
- [x] I understand that plagiarism will not be tolerated, and I have authored this application in my own words.
- [x] I have read and understood the [patch requirement](https://github.com/stdlib-js/google-summer-of-code/blob/main/README.md#patch-requirement) which is necessary for my application to be considered for acceptance.
- [x] I have read and understood the [stdlib showcase requirement](https://github.com/stdlib-js/google-summer-of-code/blob/main/README.md#showcase-requirement) which is necessary for my application to be considered for acceptance.
- [x] The issue name begins with `[RFC]:` and succinctly describes your proposal.
- [x] I understand that, in order to apply to be a GSoC contributor, I must submit my final application to <https://summerofcode.withgoogle.com/> **before** the submission deadline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: Automated Code Reviews and Fixes via LLM-powered stdlib-bot #138

Full name

University status

University name

University program

Expected graduation

Short biography

Timezone

Contact details

Platform

Editor

Programming experience

JavaScript experience

Node.js experience

C/Fortran experience

Interest in stdlib

Version control

Contributions to stdlib

stdlib showcase

Goals

Why this project?

Qualifications

Prior art

Commitment

Schedule

Related issues

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Automated Code Reviews and Fixes via LLM-powered stdlib-bot #138

Description

Full name

University status

University name

University program

Expected graduation

Short biography

Timezone

Contact details

Platform

Editor

Programming experience

JavaScript experience

Node.js experience

C/Fortran experience

Interest in stdlib

Version control

Contributions to stdlib

stdlib showcase

Goals

Why this project?

Qualifications

Prior art

Commitment

Schedule

Related issues

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions