|
| 1 | +--- |
| 2 | +title: Week 4 |
| 3 | +author: Rajul Jha |
| 4 | +tags: [gsoc25, Atarashi] |
| 5 | +--- |
| 6 | + |
| 7 | +<!-- |
| 8 | +SPDX-License-Identifier: CC-BY-SA-4.0 |
| 9 | +SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com> |
| 10 | +--> |
| 11 | + |
| 12 | +# Week 4 |
| 13 | + |
| 14 | +*(June 17, 2025 - June 25, 2025)* |
| 15 | + |
| 16 | +## Meeting 1 |
| 17 | + |
| 18 | +*(June 25, 2025)* |
| 19 | + |
| 20 | +## Attendees |
| 21 | + |
| 22 | +* [Rajul Jha](https://github.com/rajuljha) |
| 23 | +* [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) |
| 24 | +* [Ayush](https://github.com/hastagAB) |
| 25 | +* [Sushant](https://github.com/its-sushant) |
| 26 | + |
| 27 | +## Discussions |
| 28 | + |
| 29 | +* Demonstrated the working **KeywordAgent**, built using `SRINGS.in` from the Nomos agent. |
| 30 | +* Shared the **evaluator script results** tested on NomosTestFiles — yielded **~99.5% accuracy**. |
| 31 | +* Debugged cases with true negatives and realized the need for **additional keyword sources**. |
| 32 | +* Finalized a **two-stage filtering architecture** for license detection pre-check. |
| 33 | +* Walked through **code improvements and refactoring** done on the Atarashi base code. |
| 34 | + |
| 35 | +## KeywordAgent Implementation |
| 36 | + |
| 37 | +Implemented a rule-based keyword detection agent using regex patterns derived from `SRINGS.in`. Keywords included: |
| 38 | + |
| 39 | +```yaml |
| 40 | +acknowledg(e|ement|ements)? |
| 41 | +agreement |
| 42 | +as[\s-]is |
| 43 | +copyright |
| 44 | +damages |
| 45 | +deriv(e|ed|ation|ative|es|ing) |
| 46 | +redistribut(e|ion|able|ing)?|distribut(e|ion|able|ing)? |
| 47 | +free software |
| 48 | +grant |
| 49 | +indemnif(i|y|ied|ication|ying)? |
| 50 | +intellectual propert(y|ies)? |
| 51 | +[^e]liabilit(y|ies)? |
| 52 | +licencs? |
| 53 | +mis[- ]?represent |
| 54 | +open source |
| 55 | +patent |
| 56 | +permission |
| 57 | +public[\s-]domain |
| 58 | +require(s|d|ment|ments)? |
| 59 | +same terms |
| 60 | +see[\s:-]*(https?://|file://|www.|[A-Za-z0-9._/-]+) |
| 61 | +source (and|or)? ?binary |
| 62 | +source code |
| 63 | +subject to |
| 64 | +terms and conditions |
| 65 | +warrant(y|ies|ed|ing)? |
| 66 | +without (fee|restrict(ion|ed)?|limit(ation|ed)?) |
| 67 | +severability clause |
| 68 | +``` |
| 69 | + |
| 70 | + |
| 71 | +## Evaluation Results |
| 72 | + |
| 73 | +* Ran the KeywordAgent on **NomosTestFiles**. |
| 74 | +* Achieved **~99.5% accuracy**, confirming robustness of regex pattern matching. |
| 75 | +* Detected minor edge cases (true negatives) which informed the next steps for keyword expansion. |
| 76 | + |
| 77 | +## Code Improvements |
| 78 | + |
| 79 | +* Refactored parts of the Atarashi codebase: |
| 80 | + * Applied Python best practices (docstrings, function decomposition, consistent naming). |
| 81 | + * Improved error handling and logging in preprocessing and agent workflows. |
| 82 | + |
| 83 | +- **Commit:** https://github.com/fossology/atarashi/compare/master...rajuljha:atarashi:feat/newagent/Keyword |
| 84 | + |
| 85 | +## Two-Stage Detection Plan |
| 86 | + |
| 87 | + |
| 88 | + |
| 89 | +Decided on a layered pre-check system for Atarashi Scanner: |
| 90 | + |
| 91 | +1. **Stage 1:** Match against the **initial keyword list** (from SRINGS.in). |
| 92 | +2. **Stage 2:** If Stage 1 fails, match against **license shortnames** and **FOSSology license_ref** strings. |
| 93 | +3. If neither stage matches, the file is **skipped** from scanning. |
| 94 | +4. If either matches, it is **scanned using Atarashi**. |
0 commit comments