diff --git a/docs/2025/atarashi-enhancement/updates/2025-06-25.md b/docs/2025/atarashi-enhancement/updates/2025-06-25.md new file mode 100644 index 0000000000..abee131ff0 --- /dev/null +++ b/docs/2025/atarashi-enhancement/updates/2025-06-25.md @@ -0,0 +1,94 @@ +--- +title: Week 4 +author: Rajul Jha +tags: [gsoc25, Atarashi] +--- + + + +# Week 4 + +*(June 17, 2025 - June 25, 2025)* + +## Meeting 1 + +*(June 25, 2025)* + +## Attendees + +* [Rajul Jha](https://github.com/rajuljha) +* [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) +* [Ayush](https://github.com/hastagAB) +* [Sushant](https://github.com/its-sushant) + +## Discussions + +* Demonstrated the working **KeywordAgent**, built using `SRINGS.in` from the Nomos agent. +* Shared the **evaluator script results** tested on NomosTestFiles — yielded **~99.5% accuracy**. +* Debugged cases with true negatives and realized the need for **additional keyword sources**. +* Finalized a **two-stage filtering architecture** for license detection pre-check. +* Walked through **code improvements and refactoring** done on the Atarashi base code. + +## KeywordAgent Implementation + +Implemented a rule-based keyword detection agent using regex patterns derived from `SRINGS.in`. Keywords included: + +```yaml +acknowledg(e|ement|ements)? +agreement +as[\s-]is +copyright +damages +deriv(e|ed|ation|ative|es|ing) +redistribut(e|ion|able|ing)?|distribut(e|ion|able|ing)? +free software +grant +indemnif(i|y|ied|ication|ying)? +intellectual propert(y|ies)? +[^e]liabilit(y|ies)? +licencs? +mis[- ]?represent +open source +patent +permission +public[\s-]domain +require(s|d|ment|ments)? +same terms +see[\s:-]*(https?://|file://|www.|[A-Za-z0-9._/-]+) +source (and|or)? ?binary +source code +subject to +terms and conditions +warrant(y|ies|ed|ing)? +without (fee|restrict(ion|ed)?|limit(ation|ed)?) +severability clause +``` + + +## Evaluation Results + +* Ran the KeywordAgent on **NomosTestFiles**. +* Achieved **~99.5% accuracy**, confirming robustness of regex pattern matching. +* Detected minor edge cases (true negatives) which informed the next steps for keyword expansion. + +## Code Improvements + +* Refactored parts of the Atarashi codebase: + * Applied Python best practices (docstrings, function decomposition, consistent naming). + * Improved error handling and logging in preprocessing and agent workflows. + +- **Commit:** https://github.com/fossology/atarashi/compare/master...rajuljha:atarashi:feat/newagent/Keyword + +## Two-Stage Detection Plan + +![image](/img/atarashi/atarashi-decision-tree.png) + +Decided on a layered pre-check system for Atarashi Scanner: + +1. **Stage 1:** Match against the **initial keyword list** (from SRINGS.in). +2. **Stage 2:** If Stage 1 fails, match against **license shortnames** and **FOSSology license_ref** strings. +3. If neither stage matches, the file is **skipped** from scanning. +4. If either matches, it is **scanned using Atarashi**. diff --git a/static/img/atarashi/atarashi-decision-tree.png b/static/img/atarashi/atarashi-decision-tree.png new file mode 100644 index 0000000000..9b534155ca Binary files /dev/null and b/static/img/atarashi/atarashi-decision-tree.png differ