|
2 | 2 | sidebar_position: 1 |
3 | 3 | title: Introduction |
4 | 4 | slug: /2025/atarashi-enhancement/ |
5 | | ---- |
| 5 | +--- |
| 6 | + |
| 7 | +<!-- |
| 8 | +SPDX-License-Identifier: CC-BY-SA-4.0 |
| 9 | +
|
| 10 | +SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com> |
| 11 | +--> |
| 12 | + |
| 13 | +## Author |
| 14 | + |
| 15 | +[Rajul Jha](https://github.com/rajuljha) |
| 16 | + |
| 17 | +## Contact info |
| 18 | + |
| 19 | +- [Email](mailto:rajuljha49@gmail.com) |
| 20 | +- [LinkedIn](https://linkedin.com/in/rajuljha) |
| 21 | + |
| 22 | +## Project title |
| 23 | + |
| 24 | +Enhancing Atarashi License Scanner |
| 25 | + |
| 26 | +## What's the project about? |
| 27 | + |
| 28 | +[Atarashi](https://github.com/fossology/atarashi) is a modern, information-retrieval-based license scanner integrated into the FOSSology ecosystem. It utilizes statistical techniques such as TF-IDF, cosine similarity, Damerau-Levenshtein distance, and N-gram distance to identify licenses in source code files. While Atarashi demonstrates promising performance with an accuracy of around 80%, this project aims to significantly improve both the accuracy and robustness of its predictions. |
| 29 | + |
| 30 | +The main objectives of this project include: |
| 31 | +- Adding a keyword-based pre-filtering mechanism to improve match precision and reduce the redundant time spent by the agents scanning. |
| 32 | +- Enhancing the existing classifier with better similarity metrics and model tuning. |
| 33 | +- Incorporating fallback logic to handle ambiguous or low-confidence license predictions. |
| 34 | +- Utilizing the Minerva license dataset to train and evaluate the model more effectively. |
| 35 | +- Ensuring seamless integration of improvements into the existing open pull request [#1634](https://github.com/fossology/fossology/pull/1634). |
| 36 | + |
| 37 | +## What should be done? |
| 38 | + |
| 39 | +### Integrating a keyword-based pre-filtering model |
| 40 | +- Develop a pre-filtering module that leverages a configurable keyword list. |
| 41 | +- This filter will help reduce candidate licenses for better focus in classification. |
| 42 | +- Document the keyword matching logic and make the keywords configurable. |
| 43 | +- Move towards ML based approach for keyword prefiltering. |
| 44 | + |
| 45 | +### Improving the classifier |
| 46 | +- Analyze the current classifier’s performance using Minerva as a benchmark. |
| 47 | +- Explore enhancements to the similarity metrics or switching to more robust statistical models. |
| 48 | +- Retrain and validate the model with improved datasets and parameters. |
| 49 | + |
| 50 | +### Fallback mechanism for ambiguous predictions |
| 51 | +- Define thresholds for low-confidence matches. |
| 52 | +- In cases where confidence is below the threshold, add a secondary mechanism such as fuzzy match fallback or keyword-only fallback. |
| 53 | +- Clearly log fallback occurrences for later analysis. |
| 54 | + |
| 55 | +### Utilize Minerva dataset for training and evaluation |
| 56 | +- Integrate the Minerva dataset into the Atarashi pipeline for model refinement. |
| 57 | +- Apply data pre-processing and augmentation where necessary. |
| 58 | +- Compare performance with and without Minerva enhancement. |
| 59 | + |
| 60 | +### Seamless integration with FOSSology pull request |
| 61 | +- All changes must be backward compatible and align with the architecture in [PR #1634](https://github.com/fossology/fossology/pull/1634). |
| 62 | +- Create a Atarashi wrapper for FOSSology and introduce it as a FOSSology agent. |
| 63 | +- Write tests with good test coverage. |
0 commit comments