Skip to content

Commit b6fd036

Browse files
Merge pull request #328 from rajuljha/updates/week4
chore(docs): Add atarashi week4 report Reviewed-by: shaheem.azmal@siemens.com
2 parents bffc0cd + 697beb6 commit b6fd036

File tree

2 files changed

+94
-0
lines changed

2 files changed

+94
-0
lines changed
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
title: Week 4
3+
author: Rajul Jha
4+
tags: [gsoc25, Atarashi]
5+
---
6+
7+
<!--
8+
SPDX-License-Identifier: CC-BY-SA-4.0
9+
SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com>
10+
-->
11+
12+
# Week 4
13+
14+
*(June 17, 2025 - June 25, 2025)*
15+
16+
## Meeting 1
17+
18+
*(June 25, 2025)*
19+
20+
## Attendees
21+
22+
* [Rajul Jha](https://github.com/rajuljha)
23+
* [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
24+
* [Ayush](https://github.com/hastagAB)
25+
* [Sushant](https://github.com/its-sushant)
26+
27+
## Discussions
28+
29+
* Demonstrated the working **KeywordAgent**, built using `SRINGS.in` from the Nomos agent.
30+
* Shared the **evaluator script results** tested on NomosTestFiles — yielded **~99.5% accuracy**.
31+
* Debugged cases with true negatives and realized the need for **additional keyword sources**.
32+
* Finalized a **two-stage filtering architecture** for license detection pre-check.
33+
* Walked through **code improvements and refactoring** done on the Atarashi base code.
34+
35+
## KeywordAgent Implementation
36+
37+
Implemented a rule-based keyword detection agent using regex patterns derived from `SRINGS.in`. Keywords included:
38+
39+
```yaml
40+
acknowledg(e|ement|ements)?
41+
agreement
42+
as[\s-]is
43+
copyright
44+
damages
45+
deriv(e|ed|ation|ative|es|ing)
46+
redistribut(e|ion|able|ing)?|distribut(e|ion|able|ing)?
47+
free software
48+
grant
49+
indemnif(i|y|ied|ication|ying)?
50+
intellectual propert(y|ies)?
51+
[^e]liabilit(y|ies)?
52+
licencs?
53+
mis[- ]?represent
54+
open source
55+
patent
56+
permission
57+
public[\s-]domain
58+
require(s|d|ment|ments)?
59+
same terms
60+
see[\s:-]*(https?://|file://|www.|[A-Za-z0-9._/-]+)
61+
source (and|or)? ?binary
62+
source code
63+
subject to
64+
terms and conditions
65+
warrant(y|ies|ed|ing)?
66+
without (fee|restrict(ion|ed)?|limit(ation|ed)?)
67+
severability clause
68+
```
69+
70+
71+
## Evaluation Results
72+
73+
* Ran the KeywordAgent on **NomosTestFiles**.
74+
* Achieved **~99.5% accuracy**, confirming robustness of regex pattern matching.
75+
* Detected minor edge cases (true negatives) which informed the next steps for keyword expansion.
76+
77+
## Code Improvements
78+
79+
* Refactored parts of the Atarashi codebase:
80+
* Applied Python best practices (docstrings, function decomposition, consistent naming).
81+
* Improved error handling and logging in preprocessing and agent workflows.
82+
83+
- **Commit:** https://github.com/fossology/atarashi/compare/master...rajuljha:atarashi:feat/newagent/Keyword
84+
85+
## Two-Stage Detection Plan
86+
87+
![image](/img/atarashi/atarashi-decision-tree.png)
88+
89+
Decided on a layered pre-check system for Atarashi Scanner:
90+
91+
1. **Stage 1:** Match against the **initial keyword list** (from SRINGS.in).
92+
2. **Stage 2:** If Stage 1 fails, match against **license shortnames** and **FOSSology license_ref** strings.
93+
3. If neither stage matches, the file is **skipped** from scanning.
94+
4. If either matches, it is **scanned using Atarashi**.
161 KB
Loading

0 commit comments

Comments
 (0)