Skip to content

Commit f262b91

Browse files
authored
Merge pull request #307 from rajuljha/main
chore(docs): Add Atarashi week1 report
2 parents c9bf1a2 + b219481 commit f262b91

File tree

4 files changed

+184
-1
lines changed

4 files changed

+184
-1
lines changed

docs/2025/atarashi-enhancement/index.md

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,62 @@
22
sidebar_position: 1
33
title: Introduction
44
slug: /2025/atarashi-enhancement/
5-
---
5+
---
6+
7+
<!--
8+
SPDX-License-Identifier: CC-BY-SA-4.0
9+
10+
SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com>
11+
-->
12+
13+
## Author
14+
15+
[Rajul Jha](https://github.com/rajuljha)
16+
17+
## Contact info
18+
19+
- [Email](mailto:rajuljha49@gmail.com)
20+
- [LinkedIn](https://linkedin.com/in/rajuljha)
21+
22+
## Project title
23+
24+
Enhancing Atarashi License Scanner
25+
26+
## What's the project about?
27+
28+
[Atarashi](https://github.com/fossology/atarashi) is a modern, information-retrieval-based license scanner integrated into the FOSSology ecosystem. It utilizes statistical techniques such as TF-IDF, cosine similarity, Damerau-Levenshtein distance, and N-gram distance to identify licenses in source code files. While Atarashi demonstrates promising performance with an accuracy of around 80%, this project aims to significantly improve both the accuracy and robustness of its predictions.
29+
30+
The main objectives of this project include:
31+
- Adding a keyword-based pre-filtering mechanism to improve match precision and reduce the redundant time spent by the agents scanning.
32+
- Enhancing the existing classifier with better similarity metrics and model tuning.
33+
- Incorporating fallback logic to handle ambiguous or low-confidence license predictions.
34+
- Utilizing the Minerva license dataset to train and evaluate the model more effectively.
35+
- Ensuring seamless integration of improvements into the existing open pull request [#1634](https://github.com/fossology/fossology/pull/1634).
36+
37+
## What should be done?
38+
39+
### Integrating a keyword-based pre-filtering model
40+
- Develop a pre-filtering module that leverages a configurable keyword list.
41+
- This filter will help reduce candidate licenses for better focus in classification.
42+
- Document the keyword matching logic and make the keywords configurable.
43+
- Move towards ML based approach for keyword prefiltering.
44+
45+
### Improving the classifier
46+
- Analyze the current classifier’s performance using Minerva as a benchmark.
47+
- Explore enhancements to the similarity metrics or switching to more robust statistical models.
48+
- Retrain and validate the model with improved datasets and parameters.
49+
50+
### Fallback mechanism for ambiguous predictions
51+
- Define thresholds for low-confidence matches.
52+
- In cases where confidence is below the threshold, add a secondary mechanism such as fuzzy match fallback or keyword-only fallback.
53+
- Clearly log fallback occurrences for later analysis.
54+
55+
### Utilize Minerva dataset for training and evaluation
56+
- Integrate the Minerva dataset into the Atarashi pipeline for model refinement.
57+
- Apply data pre-processing and augmentation where necessary.
58+
- Compare performance with and without Minerva enhancement.
59+
60+
### Seamless integration with FOSSology pull request
61+
- All changes must be backward compatible and align with the architecture in [PR #1634](https://github.com/fossology/fossology/pull/1634).
62+
- Create a Atarashi wrapper for FOSSology and introduce it as a FOSSology agent.
63+
- Write tests with good test coverage.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
---
2+
title: Community bonding
3+
author: Rajul Jha
4+
---
5+
<!--
6+
SPDX-License-Identifier: CC-BY-SA-4.0
7+
8+
SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49gmail.com>
9+
-->
10+
11+
## Meeting 1 (Introductory Call)
12+
13+
*(May 12, 2025)*
14+
15+
### Attendes:
16+
17+
- Whole FOSSology community.
18+
19+
### Discussion:
20+
- Held the first GSoC community bonding meeting.
21+
- Everyone introduced themselves — org admins, mentors, and fellow contributors.
22+
- Discussed the general expectations, goals for the summer, communication platforms, and how we’ll collaborate during the project period.
23+
24+
25+
> *Note: I was having my end semester exams during this week and the next, so contribution was minimal during this period.*
26+
27+
28+
### Meeting 2
29+
30+
*(May 23, 2025)*
31+
32+
### Attendees:
33+
- [Rajul Jha](https://github.com/rajuljha)
34+
- [Kaushlendra](https://github.com/Kaushl2208)
35+
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
36+
37+
### Discussion:
38+
- Had a focused conversation with Shaheem and Kaushal regarding the Atarashi project.
39+
- Talked about expected outcomes, the scope of enhancements, and how to approach improvements in Atarashi.
40+
- Discussed potential integration challenges and goals related to the classifier and fallback logic.
41+
42+
43+
> *Exams ended on May 26, hurray!*
44+
45+
46+
### Meeting 3
47+
48+
*(May 30, 2025)*
49+
50+
### Attendees:
51+
- [Rajul Jha](https://github.com/rajuljha)
52+
- [Kaushlendra](https://github.com/Kaushl2208)
53+
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
54+
55+
### Discussion:
56+
- Had a follow-up discussion with Kaushal and Shaheem.
57+
- Cleared doubts regarding Atarashi and the Minerva dataset.
58+
- Shared initial challenges I faced during local setup, including bugs and configuration issues.
59+
- Highlighted parts of the codebase that may require fixes or improvements before full development begins.
60+
61+
### Work Done:
62+
- Started working on fixing the bugs in the codebase.
63+
- Started refactoring some parts of the code.
64+
65+
---
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
---
2+
title: Week 1
3+
author: Rajul Jha
4+
tags: [gsoc25, Atarashi]
5+
---
6+
<!--
7+
SPDX-License-Identifier: CC-BY-SA-4.0
8+
9+
SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com>
10+
-->
11+
12+
# Week 1
13+
14+
*(June 2, 2025 - June 9, 2025)*
15+
16+
## Meeting 1
17+
18+
*(June 4, 2025)*
19+
20+
## Attendees
21+
22+
* [Rajul Jha](https://github.com/rajuljha)
23+
* [Kaushlendra](https://github.com/Kaushl2208)
24+
* [Ayush](https://github.com/hastagAB)
25+
* [Sushant](https://github.com/its-sushant)
26+
27+
## Discussions
28+
29+
* Shared updates on implementing a **keyword-based prefiltering mechanism** similar to the Nomos scanner.
30+
* The goal of the approach is to reduce the candidate license set before passing it to the Atarashi similarity-based agents.
31+
* Discussed the limitations of keyword-based models and explored the need to move toward **ML-based pre-filtering**.
32+
* Talked about how this new KeywordAgent integrates with Atarashi’s architecture and potential enhancements going forward.
33+
34+
## Updates
35+
36+
* Implemented a new `KeywordAgent` which performs keyword-based filtering before running Atarashi's similarity-based scanners.
37+
* Created a **keyword set** that likely appears in licenses to act as early indicators.
38+
* Used **GPT-4o** to help generate a broad list of licenses and associated keyword groups.
39+
* Integrated the agent to mark a license candidate when **more than 75% of keywords** are found in a file’s content.
40+
* Forwarded positively matched files to Atarashi’s agents like:
41+
* `TfIdfAgent`
42+
* `DamerauLevenshteinDistance`
43+
* `WordFrequencySimilarity`
44+
* `NgramSimilarity`
45+
46+
## Problems Identified
47+
48+
* The keyword list is still **static** — it must be updated manually as new licenses appear.
49+
* Since this is rule-based, it cannot generalize well to unseen licenses or variations in text.
50+
* Identified this phase as **exploratory**; will now begin transitioning toward an **ML-based prefiltering model** for robustness and better generalization.
51+
52+
## Planning for next week
53+
54+
* Start prototyping an ML-based model for prefiltering using the Minerva dataset.
55+
* Setup the Minerva Dataset locally, run the augmentation steps and create the database from scratch.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"label": "Weekly Updates",
3+
"position": 2
4+
}
5+

0 commit comments

Comments
 (0)