Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 47 additions & 1 deletion docs/2025/data-pipeline/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,50 @@
sidebar_position: 2
title: Introduction
slug: /2025/data-pipeline/
---
---
<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

## Author

[Abdulsobur Oyewale](https://github.com/smilingprogrammer)

## Contact info

- [Email](mailto:oyewaleabdulsobur@gmail.com)

## Project title

Data Pipelining For Safaa

## What's the project about?

Currently, Safaa provides a strong framework designed to deal with copyright notices particularly focusing on the identification and reduction of false positives, as well as streamlining the decluttering procedure to remove unnecessary content. Key features of Safaa include:
1. Model Flexibility
2. Integration with scikit-learn
3. spaCy Integration
4. Preprocessing Tools

However, Currently in the Safaa Project, data is manually curated And we see that most of the things are manual here.
This project will concentrate on creating a pipeline, Utilizing LLMs if required to increase the accuracy, or use deep learning techniques to improve.

Writing scripts to copy copyright data automatically(group's data or some users data) from fossology instance to train the model.


## What should be done?

Here are the key tasks planned for the project:

1. Create Scripts to fetch the copyright data from FOSSology Server copyright table (localhost)
2. Clean and preprocess fetched copyright data (utilize prewritten processing functions)
- Preprocess data should have label and clean text.
3. Split data for training/validation/test.
4. Train false/positive model as well as declutter model (utilize prewritten train functions)
5. Model evaluation (check for precision, recall etc..)
6. Model versioning and release.
7. Should work for both Gitlab and Github.
- Manual trigger.
- Should also have a functionality to work as cron job.
42 changes: 42 additions & 0 deletions docs/2025/data-pipeline/updates/2025-05-30.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Community bonding
author: Abdulsobur Oyewale
---
<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->
# **Meeting Summary for GSoC Community Bonding Period**

# Introduction Meeting
*(May 29,2025)*

This was the inaugural meeting of the community bonding period for GSoC 2025.
* A general introduction of mentors and contributors took place.
* We were giving an introduction about the FOSSology community.
* Time and platform for the weekly general meeting were discussed.
* We were also engaged on the expectations for the GSoC program.
* The Mentors also emphasized the importance of communication in open source projects.
* At the end there was a Q&A session to address any queries we may have.

# Personal Meeting With The Mentors
*(May 30,2025)*


* They emphasize on the importance of documentation in this project.
* I was encouraged on the practice of regular updates.
* We discussed about the projects and what the targets and expectations are.
* We also discussed about timings for weekly technical calls but didn't make the final decisions since one of the mentors wasn't available with us on the call.
* There was also discussion with my mentor on reviewing last year works.
* We also discussed about adding my documentation to the fossology GSoC page, and submitting a pull request.
* Lastly, I engaged with mentors on how to start my coding period by locally installing Fossology, and trying out different test to understand how it works.


### Engagements

* Explored Fossology local setup installation process
* Treated some crucial pipeline requirements essential for Safaa's automation efforts.


**This report summarizes my activities and interactions during the GSoC community bonding period.**
31 changes: 31 additions & 0 deletions docs/2025/data-pipeline/updates/2025-06-04.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: Week 1
author: Abdulsobur Oyewale
tags: [gsoc25, Data Pipeline for Safaa]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

# WEEK 1
*(June 4, 2025)*

## Attendees:
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)

### Engagements
* I engaged in the installation of Fossology locally, and solved the obstacle of working with Windows. Since Fossology installation guide works best with Linux, I was able to achieve this installation with WSL2.
* I also conducted various examples on the Safaa agent to tests out it features and functionalities which also gives me the insight of how it currently works. You can find this here.

## Discussion:
* I discoursed about how I installed Fossology with the link provided for me by my mentors and familiarized myself with its features.
* Furthermore, I discussed with them about the test I conducted with Safaa current copyright detection agent and then experimented with false positive deactivation agent to assess its features and functionalities by playing around it with examples.
* Lastly, Safaa's performance was critically evaluated, and strategies for acquiring data for my Copyright script was discussed with me


## Subsequent Steps
* I was tasked to begin with the first task in the project list which is about the creation of script to get copyright data from a fossology instance.
37 changes: 37 additions & 0 deletions docs/2025/data-pipeline/updates/2025-06-11.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
title: Week 2
author: Abdulsobur Oyewale
tags: [gsoc25, Data Pipeline for Safaa]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

# WEEK 2
*(June 11, 2024)*

## Attendees:
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)
- [Kaushlendra Pratap](https://github.com/Kaushl2208)

### Engagements
* This week i started full engagement with this year project. And the first task on the list to achieve this goal is the creation of a script to fetch copyright content from the fossology server.
* I started by trying to write out SQL codes to fetch this content from the fossology server, and after different tweaking i was able to achieve this goal.
* After a successful writing of the SQL script to fetch the required content from the fossology server, I proceeded to write a python program to embed the PostgreSQL script into the program using the psycog library to achieve the connection to the Postgres database server.
* With this, i was able to automate the collection of copyright content data from the fossology server running in the local host.


## Meeting Discussion:
* I discuss with the mentors about the progress of the week and how the project s going, including if there was any obstacle.
* We discussed about the current progress which is the content fetching script from the fossology localhost server.
* I also gave them a demo to show them how it works and the expected output from the script.


## Subsequent Steps
* I was tasked to write include timestamp with the generated data, so as to track the sequence data update
* I was also told to make some changes for the script to accommodate various sever configuration by placing the server configuration in a `.env` file.
* And I will also continue with the preprocessing script which will allow us to preprocess the data we got from the script fetched from the fossology server.
36 changes: 36 additions & 0 deletions docs/2025/data-pipeline/updates/2025-06-18.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
title: Week 3
author: Abdulsobur Oyewale
tags: [gsoc25, Data Pipeline for Safaa]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

# WEEK 3
*(June 18, 2025)*

## Attendees:
- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)
- [Kaushlendra Pratap](https://github.com/Kaushl2208)

### Engagements
* This week I began with the second task on the list, which is the creation of a script to preprocess copyright content from the fossology server.
* I was informed last week of the available pre-written script available on the Safaa codebase which I can utilize to make this task faster to complete.
* I then began by starting to write out this pre-written script, reading the code, understanding it, then before modifying it to suit our intent.
* After completing the above task, I modified the script to match our int. With this, I was able to preprocess the data we retrieved from the fossology server running in the local host.


## Meeting Discussion:
* I discuss with the mentors about the progress of the week and how the project s going, including if there was any obstacle.
* We discussed the current progress which is the preprocessing of data fetched from the fossology localhost server using available pre-written script.
* I also gave them a demo to show them how it works and the expected output from the script.
* I was told the task needs to be modified to so that it can be triggered using GitHub actions, and not manually via coding script.


## Subsequent Steps
* Given that we already have a working preprocessing, I was tasked to modify this to be triggered with GitHub Actions.
* I will be continuing with the task above for the next week task achievements.
38 changes: 38 additions & 0 deletions docs/2025/data-pipeline/updates/2025.06.25.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: Week 4
author: Abdulsobur Oyewale
tags: [gsoc25, Data Pipeline for Safaa]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

# WEEK 4
*(June 25, 2025)*

## Attendees:
- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)

### Engagements
* This week i began the migration of the scripts into a pipeline that can be triggered through Github actions. As at the time of writing this, The available (prepared) script are;
- Fossology server fetching script
- Applied preprocessing script
- Splitting of data

* I created a `pipeline.yml` file and applied the above script preprocessing and Data spliting script into the pipeline, and included the ability to download
the output from each script from the logs while it's performing the triggered GitHub actions.
* I was able to deploy this into my GitHub repository to allow me to test this feature and changes separately on my own GitHub Actions before going ahead to create a Pull Request.

## Meeting Discussion:
* In the meeting, i had the opportunity to discuss with the mentor on how i went through the task to achieve this Goal
* I had the opportunity to show them the pipeline related script, its code, it applications in my repository, and how the code works sequentially.
* I also got the opportunity to show them the demo of the pipeline line on GitHub, ranging from starting the pipeline through GitHub action, to checking out the logs, downloading the output artifacts, till all processes finishes.
* We also talked about getting a dataset for us to train our model for modification and improvement.


## Subsequent Steps
* I will continue with the subsequent task on the list, which is the application of the pre-written script to the pipeline.
35 changes: 35 additions & 0 deletions docs/2025/data-pipeline/updates/2025.07.02.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: Week 5
author: Abdulsobur Oyewale
tags: [gsoc25, Data Pipeline for Safaa]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

# WEEK 5
*(July 02, 2025)*

## Attendees:
- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)

### Engagements
* This week I went ahead with the pre-written decluttering script application into the pipeline.
* I started by trying out the script to check what our expected output should be, as this will give us an insight of our decluttered output
* While experimenting with examples and trying out the above, I realized the decluttering script and features isn't much effective in general.
* I set out to include some little regex into the decluttering script to increase its effectiveness a little, but this is constrained as copyright text doesn't have a set of rules they follow, there by the text we used to try this out might not apply to other text.

* I integrated the available pre-written script additionally into the pipeline

## Meeting Discussion:
* In this week meeting I discussed with the mentors about the task and what I discovered from my applications of the declutter script.
* I informed them and showed them the output I got from an example text which I also test it for them live for confirmation
* I also went ahead to show them the output I got from the modified script I wrote with additional regex rules, and we compared this against each other.
* I was instructed to now create a PR on the work I have done so far on the ongoing project, and submit it on the Safaa main repository.

## Subsequent Steps
* I will be proceeding on getting some few dataset from the mentors to have data for training and carry out some experiment about our model modifications and improvement task.
4 changes: 4 additions & 0 deletions docs/2025/data-pipeline/updates/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"label": "Weekly Updates",
"position": 2
}