Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions docs/2025/atarashi-enhancement/updates/2025-06-25.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
title: Week 4
author: Rajul Jha
tags: [gsoc25, Atarashi]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0
SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com>
-->

# Week 4

*(June 17, 2025 - June 25, 2025)*

## Meeting 1

*(June 25, 2025)*

## Attendees

* [Rajul Jha](https://github.com/rajuljha)
* [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
* [Ayush](https://github.com/hastagAB)
* [Sushant](https://github.com/its-sushant)

## Discussions

* Demonstrated the working **KeywordAgent**, built using `SRINGS.in` from the Nomos agent.
* Shared the **evaluator script results** tested on NomosTestFiles — yielded **~99.5% accuracy**.
* Debugged cases with true negatives and realized the need for **additional keyword sources**.
* Finalized a **two-stage filtering architecture** for license detection pre-check.
* Walked through **code improvements and refactoring** done on the Atarashi base code.

## KeywordAgent Implementation

Implemented a rule-based keyword detection agent using regex patterns derived from `SRINGS.in`. Keywords included:

```yaml
acknowledg(e|ement|ements)?
agreement
as[\s-]is
copyright
damages
deriv(e|ed|ation|ative|es|ing)
redistribut(e|ion|able|ing)?|distribut(e|ion|able|ing)?
free software
grant
indemnif(i|y|ied|ication|ying)?
intellectual propert(y|ies)?
[^e]liabilit(y|ies)?
licencs?
mis[- ]?represent
open source
patent
permission
public[\s-]domain
require(s|d|ment|ments)?
same terms
see[\s:-]*(https?://|file://|www.|[A-Za-z0-9._/-]+)
source (and|or)? ?binary
source code
subject to
terms and conditions
warrant(y|ies|ed|ing)?
without (fee|restrict(ion|ed)?|limit(ation|ed)?)
severability clause
```


## Evaluation Results

* Ran the KeywordAgent on **NomosTestFiles**.
* Achieved **~99.5% accuracy**, confirming robustness of regex pattern matching.
* Detected minor edge cases (true negatives) which informed the next steps for keyword expansion.

## Code Improvements

* Refactored parts of the Atarashi codebase:
* Applied Python best practices (docstrings, function decomposition, consistent naming).
* Improved error handling and logging in preprocessing and agent workflows.

- **Commit:** https://github.com/fossology/atarashi/compare/master...rajuljha:atarashi:feat/newagent/Keyword

## Two-Stage Detection Plan

![image](/img/atarashi/atarashi-decision-tree.png)

Decided on a layered pre-check system for Atarashi Scanner:

1. **Stage 1:** Match against the **initial keyword list** (from SRINGS.in).
2. **Stage 2:** If Stage 1 fails, match against **license shortnames** and **FOSSology license_ref** strings.
3. If neither stage matches, the file is **skipped** from scanning.
4. If either matches, it is **scanned using Atarashi**.
Binary file added static/img/atarashi/atarashi-decision-tree.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.