fossology
diff --git a/‎docs/2025/atarashi-enhancement/updates/2025-06-11.md‎
Lines changed: 95 additions & 0 deletions b/‎docs/2025/atarashi-enhancement/updates/2025-06-11.md‎
Lines changed: 95 additions & 0 deletions
diff --git a/‎static/img/atarashi/cumulative_distribution.png‎
27.1 KB b/‎static/img/atarashi/cumulative_distribution.png‎
27.1 KB
diff --git a/‎static/img/atarashi/file_count_boxplot_by_source.png‎
16.2 KB b/‎static/img/atarashi/file_count_boxplot_by_source.png‎
16.2 KB
diff --git a/‎static/img/atarashi/file_count_distribution.png‎
24.3 KB b/‎static/img/atarashi/file_count_distribution.png‎
24.3 KB
diff --git a/‎static/img/atarashi/source_pie_chart.png‎
24.3 KB b/‎static/img/atarashi/source_pie_chart.png‎
24.3 KB
diff --git a/‎static/img/atarashi/top_licenses.png‎
24.1 KB b/‎static/img/atarashi/top_licenses.png‎
24.1 KB
@@ -0,0 +1,95 @@
+---
+title: Week 2  
+author: Rajul Jha  
+tags: [gsoc25, Atarashi]
+---
+
+<!--  
+SPDX-License-Identifier: CC-BY-SA-4.0  
+SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com>  
+-->
+
+# Week 2
+
+*(June 10, 2025 - June 16, 2025)*
+
+## Meeting 1
+
+*(June 11, 2025)*
+
+## Attendees
+
+* [Rajul Jha](https://github.com/rajuljha)
+* [Kaushlendra](https://github.com/Kaushl2208)
+* [Ayush](https://github.com/hastagAB)
+* [Sushant](https://github.com/its-sushant)
+
+## Discussions
+
+* Shared findings from a **comprehensive analysis of the Minerva Dataset**.
+* Reviewed dataset characteristics like class imbalance, license frequency, and dataset composition by source.
+* Talked about integrating **negative samples** into the dataset, which are currently missing but critical for training robust ML models.
+* Presented a **proof of concept for Locality Sensitive Hashing (LSH)** and discussed its behavior with varying input lengths.
+
+## Minerva Dataset Analysis
+
+Analyzed multiple dimensions of the Minerva dataset using visualizations. Key insights:
+
+![image](/img/atarashi/cumulative_distribution.png)
+* **Long-tail Distribution**:
+  * The cumulative distribution shows that **a small subset of licenses accounts for the majority of files**.
+  * Around **200 licenses cover ~80% of the dataset**, confirming a significant skew in class distribution.
+
+---
+
+![image](/img/atarashi/file_count_boxplot_by_source.png)
+* **Boxplots by Source**:
+  * Both `Split-DB-Foss-Licenses` and `Split-SPDX-licenses` show similar distributions, though the median file counts differ slightly.
+  * Outliers are present in both, indicating a few licenses are over-represented.
+
+---
+
+![image](/img/atarashi/file_count_distribution.png)
+* **Heavy Tail in File Counts**:
+  * Histogram and KDE of file counts per license shows **a large number of licenses with very few associated files**, while only a few have 500+ files.
+  * This reveals severe **class imbalance** which can bias any learning model.
+
+---
+
+![image](/img/atarashi/source_pie_chart.png)
+* **Source Composition**:
+  * The dataset is **split nearly evenly**: ~54% from `Split-DB-Foss-Licenses`, ~46% from `Split-SPDX-licenses`.
+
+---
+
+![image](/img/atarashi/top_licenses.png)
+* **Top 15 Licenses by File Count**:
+  * Some licenses (e.g., `Hacktivismo`, `Zimbra-1.2`) dominate the dataset.
+  * These must be considered while sampling or designing the pre-filtering ML models to **avoid model bias**.
+
+---
+
+## Updates
+
+* Spent time **exploring and analyzing the Minerva Dataset**.
+* Prepared and shared **insightful visualizations** for license frequency, source distribution, and class balance.
+* Identified key limitations:
+  * Severe **data imbalance** across license classes.
+  * **Lack of negative samples** — currently all samples are associated with known licenses, with no explicit “no license” or “non-license” examples.
+
+* Implemented a **proof of concept for Locality Sensitive Hashing (LSH)**:
+  * Works well when **input text length is similar** to license texts.
+  * Struggles when input is **shorter or a subquery**, highlighting the need for **preprocessing strategies** or **text padding**.
+
+## Problems Identified
+
+* Minerva lacks **negative (non-license) samples**, which will hinder the performance of classifiers in real-world noisy environments.
+* LSH-based similarity is **length-sensitive**; needs improvement for **partial or paraphrased** inputs.
+* Dataset imbalance could lead to **model overfitting** on high-frequency licenses.
+
+## Planning for Next Week
+
+* Begin designing a **dataset augmentation strategy** to introduce negative samples.
+* Study in more detail about **Locality Sensitive Hashing**.
+* Experiment with **segmenting license texts** and input queries to better suit LSH-based techniques.
+* Continue refining the dataset pipeline and class balancing.