Merge pull request #307 from rajuljha/main

GMishx · web-flow · commit f262b9128dee · 2025-06-11T18:10:36.000+05:30
chore(docs): Add Atarashi week1 report
diff --git a/docs/2025/atarashi-enhancement/index.md b/docs/2025/atarashi-enhancement/index.md
@@ -2,4 +2,62 @@
 sidebar_position: 1
 title: Introduction
 slug: /2025/atarashi-enhancement/
----
+---
+
+<!--
+SPDX-License-Identifier: CC-BY-SA-4.0
+
+SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com>
+-->
+
+## Author
+
+[Rajul Jha](https://github.com/rajuljha)
+
+## Contact info
+
+- [Email](mailto:rajuljha49@gmail.com)
+- [LinkedIn](https://linkedin.com/in/rajuljha)
+
+## Project title
+
+Enhancing Atarashi License Scanner
+
+## What's the project about?
+
+[Atarashi](https://github.com/fossology/atarashi) is a modern, information-retrieval-based license scanner integrated into the FOSSology ecosystem. It utilizes statistical techniques such as TF-IDF, cosine similarity, Damerau-Levenshtein distance, and N-gram distance to identify licenses in source code files. While Atarashi demonstrates promising performance with an accuracy of around 80%, this project aims to significantly improve both the accuracy and robustness of its predictions.
+
+The main objectives of this project include:
+- Adding a keyword-based pre-filtering mechanism to improve match precision and reduce the redundant time spent by the agents scanning.
+- Enhancing the existing classifier with better similarity metrics and model tuning.
+- Incorporating fallback logic to handle ambiguous or low-confidence license predictions.
+- Utilizing the Minerva license dataset to train and evaluate the model more effectively.
+- Ensuring seamless integration of improvements into the existing open pull request [#1634](https://github.com/fossology/fossology/pull/1634).
+
+## What should be done?
+
+### Integrating a keyword-based pre-filtering model
+- Develop a pre-filtering module that leverages a configurable keyword list.
+- This filter will help reduce candidate licenses for better focus in classification.
+- Document the keyword matching logic and make the keywords configurable.
+- Move towards ML based approach for keyword prefiltering.
+
+### Improving the classifier
+- Analyze the current classifier’s performance using Minerva as a benchmark.
+- Explore enhancements to the similarity metrics or switching to more robust statistical models.
+- Retrain and validate the model with improved datasets and parameters.
+
+### Fallback mechanism for ambiguous predictions
+- Define thresholds for low-confidence matches.
+- In cases where confidence is below the threshold, add a secondary mechanism such as fuzzy match fallback or keyword-only fallback.
+- Clearly log fallback occurrences for later analysis.
+
+### Utilize Minerva dataset for training and evaluation
+- Integrate the Minerva dataset into the Atarashi pipeline for model refinement.
+- Apply data pre-processing and augmentation where necessary.
+- Compare performance with and without Minerva enhancement.
+
+### Seamless integration with FOSSology pull request
+- All changes must be backward compatible and align with the architecture in [PR #1634](https://github.com/fossology/fossology/pull/1634).
+- Create a Atarashi wrapper for FOSSology and introduce it as a FOSSology agent. 
+- Write tests with good test coverage.
diff --git a/docs/2025/atarashi-enhancement/updates/2025-05-29.md b/docs/2025/atarashi-enhancement/updates/2025-05-29.md
@@ -0,0 +1,65 @@
+---
+title: Community bonding
+author: Rajul Jha
+---
+<!--
+SPDX-License-Identifier: CC-BY-SA-4.0
+
+SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49gmail.com>
+-->
+
+## Meeting 1 (Introductory Call)
+
+*(May 12, 2025)*
+
+### Attendes:
+
+- Whole FOSSology community.
+
+### Discussion:
+- Held the first GSoC community bonding meeting.
+- Everyone introduced themselves — org admins, mentors, and fellow contributors.
+- Discussed the general expectations, goals for the summer, communication platforms, and how we’ll collaborate during the project period.
+
+
+> *Note: I was having my end semester exams during this week and the next, so contribution was minimal during this period.*
+
+
+### Meeting 2
+
+*(May 23, 2025)*
+
+### Attendees:
+- [Rajul Jha](https://github.com/rajuljha)
+- [Kaushlendra](https://github.com/Kaushl2208)
+- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
+
+### Discussion:
+- Had a focused conversation with Shaheem and Kaushal regarding the Atarashi project.
+- Talked about expected outcomes, the scope of enhancements, and how to approach improvements in Atarashi.
+- Discussed potential integration challenges and goals related to the classifier and fallback logic.
+
+
+> *Exams ended on May 26, hurray!*
+
+
+### Meeting 3
+
+*(May 30, 2025)*
+
+### Attendees:
+- [Rajul Jha](https://github.com/rajuljha)
+- [Kaushlendra](https://github.com/Kaushl2208)
+- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
+
+### Discussion:
+- Had a follow-up discussion with Kaushal and Shaheem.
+- Cleared doubts regarding Atarashi and the Minerva dataset.
+- Shared initial challenges I faced during local setup, including bugs and configuration issues.
+- Highlighted parts of the codebase that may require fixes or improvements before full development begins.
+
+### Work Done:
+- Started working on fixing the bugs in the codebase.
+- Started refactoring some parts of the code.
+
+---
diff --git a/docs/2025/atarashi-enhancement/updates/2025-06-04.md b/docs/2025/atarashi-enhancement/updates/2025-06-04.md
@@ -0,0 +1,55 @@
+---
+title: Week 1
+author: Rajul Jha
+tags: [gsoc25, Atarashi]
+---
+<!--
+SPDX-License-Identifier: CC-BY-SA-4.0
+
+SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com>
+-->
+
+# Week 1
+
+*(June 2, 2025 - June 9, 2025)*
+
+## Meeting 1
+
+*(June 4, 2025)*
+
+## Attendees
+
+* [Rajul Jha](https://github.com/rajuljha)
+* [Kaushlendra](https://github.com/Kaushl2208)
+* [Ayush](https://github.com/hastagAB)
+* [Sushant](https://github.com/its-sushant)
+
+## Discussions
+
+* Shared updates on implementing a **keyword-based prefiltering mechanism** similar to the Nomos scanner.
+* The goal of the approach is to reduce the candidate license set before passing it to the Atarashi similarity-based agents.
+* Discussed the limitations of keyword-based models and explored the need to move toward **ML-based pre-filtering**.
+* Talked about how this new KeywordAgent integrates with Atarashi’s architecture and potential enhancements going forward.
+
+## Updates
+
+* Implemented a new `KeywordAgent` which performs keyword-based filtering before running Atarashi's similarity-based scanners.
+* Created a **keyword set** that likely appears in licenses to act as early indicators.
+* Used **GPT-4o** to help generate a broad list of licenses and associated keyword groups.
+* Integrated the agent to mark a license candidate when **more than 75% of keywords** are found in a file’s content.
+* Forwarded positively matched files to Atarashi’s agents like:
+  * `TfIdfAgent`
+  * `DamerauLevenshteinDistance`
+  * `WordFrequencySimilarity`
+  * `NgramSimilarity`
+
+## Problems Identified
+
+* The keyword list is still **static** — it must be updated manually as new licenses appear.
+* Since this is rule-based, it cannot generalize well to unseen licenses or variations in text.
+* Identified this phase as **exploratory**; will now begin transitioning toward an **ML-based prefiltering model** for robustness and better generalization.
+
+## Planning for next week
+
+* Start prototyping an ML-based model for prefiltering using the Minerva dataset.
+* Setup the Minerva Dataset locally, run the augmentation steps and create the database from scratch.
diff --git a/docs/2025/atarashi-enhancement/updates/_category_.json b/docs/2025/atarashi-enhancement/updates/_category_.json
@@ -0,0 +1,5 @@
+{
+    "label": "Weekly Updates",
+    "position": 2
+}
+  

-Original file line number
+Diff line change
@@ @@ -0,0 +1,5 @@ @@
 +{
 +    "label": "Weekly Updates",
 +    "position": 2
 +}
++