Skip to content

Commit 920685c

Browse files
Merge pull request #311 from rajuljha/updates/2025/week2
chore(docs): Add Atarashi project week2 report Reviewed-By: shaheem.azmal@siemens.com
2 parents d33e95e + 7ef5333 commit 920685c

File tree

6 files changed

+95
-0
lines changed

6 files changed

+95
-0
lines changed
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
title: Week 2
3+
author: Rajul Jha
4+
tags: [gsoc25, Atarashi]
5+
---
6+
7+
<!--
8+
SPDX-License-Identifier: CC-BY-SA-4.0
9+
SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com>
10+
-->
11+
12+
# Week 2
13+
14+
*(June 10, 2025 - June 16, 2025)*
15+
16+
## Meeting 1
17+
18+
*(June 11, 2025)*
19+
20+
## Attendees
21+
22+
* [Rajul Jha](https://github.com/rajuljha)
23+
* [Kaushlendra](https://github.com/Kaushl2208)
24+
* [Ayush](https://github.com/hastagAB)
25+
* [Sushant](https://github.com/its-sushant)
26+
27+
## Discussions
28+
29+
* Shared findings from a **comprehensive analysis of the Minerva Dataset**.
30+
* Reviewed dataset characteristics like class imbalance, license frequency, and dataset composition by source.
31+
* Talked about integrating **negative samples** into the dataset, which are currently missing but critical for training robust ML models.
32+
* Presented a **proof of concept for Locality Sensitive Hashing (LSH)** and discussed its behavior with varying input lengths.
33+
34+
## Minerva Dataset Analysis
35+
36+
Analyzed multiple dimensions of the Minerva dataset using visualizations. Key insights:
37+
38+
![image](/img/atarashi/cumulative_distribution.png)
39+
* **Long-tail Distribution**:
40+
* The cumulative distribution shows that **a small subset of licenses accounts for the majority of files**.
41+
* Around **200 licenses cover ~80% of the dataset**, confirming a significant skew in class distribution.
42+
43+
---
44+
45+
![image](/img/atarashi/file_count_boxplot_by_source.png)
46+
* **Boxplots by Source**:
47+
* Both `Split-DB-Foss-Licenses` and `Split-SPDX-licenses` show similar distributions, though the median file counts differ slightly.
48+
* Outliers are present in both, indicating a few licenses are over-represented.
49+
50+
---
51+
52+
![image](/img/atarashi/file_count_distribution.png)
53+
* **Heavy Tail in File Counts**:
54+
* Histogram and KDE of file counts per license shows **a large number of licenses with very few associated files**, while only a few have 500+ files.
55+
* This reveals severe **class imbalance** which can bias any learning model.
56+
57+
---
58+
59+
![image](/img/atarashi/source_pie_chart.png)
60+
* **Source Composition**:
61+
* The dataset is **split nearly evenly**: ~54% from `Split-DB-Foss-Licenses`, ~46% from `Split-SPDX-licenses`.
62+
63+
---
64+
65+
![image](/img/atarashi/top_licenses.png)
66+
* **Top 15 Licenses by File Count**:
67+
* Some licenses (e.g., `Hacktivismo`, `Zimbra-1.2`) dominate the dataset.
68+
* These must be considered while sampling or designing the pre-filtering ML models to **avoid model bias**.
69+
70+
---
71+
72+
## Updates
73+
74+
* Spent time **exploring and analyzing the Minerva Dataset**.
75+
* Prepared and shared **insightful visualizations** for license frequency, source distribution, and class balance.
76+
* Identified key limitations:
77+
* Severe **data imbalance** across license classes.
78+
* **Lack of negative samples** — currently all samples are associated with known licenses, with no explicit “no license” or “non-license” examples.
79+
80+
* Implemented a **proof of concept for Locality Sensitive Hashing (LSH)**:
81+
* Works well when **input text length is similar** to license texts.
82+
* Struggles when input is **shorter or a subquery**, highlighting the need for **preprocessing strategies** or **text padding**.
83+
84+
## Problems Identified
85+
86+
* Minerva lacks **negative (non-license) samples**, which will hinder the performance of classifiers in real-world noisy environments.
87+
* LSH-based similarity is **length-sensitive**; needs improvement for **partial or paraphrased** inputs.
88+
* Dataset imbalance could lead to **model overfitting** on high-frequency licenses.
89+
90+
## Planning for Next Week
91+
92+
* Begin designing a **dataset augmentation strategy** to introduce negative samples.
93+
* Study in more detail about **Locality Sensitive Hashing**.
94+
* Experiment with **segmenting license texts** and input queries to better suit LSH-based techniques.
95+
* Continue refining the dataset pipeline and class balancing.
27.1 KB
Loading
16.2 KB
Loading
24.3 KB
Loading
24.3 KB
Loading
24.1 KB
Loading

0 commit comments

Comments
 (0)