|
| 1 | +--- |
| 2 | +title: Week 2 |
| 3 | +author: Rajul Jha |
| 4 | +tags: [gsoc25, Atarashi] |
| 5 | +--- |
| 6 | + |
| 7 | +<!-- |
| 8 | +SPDX-License-Identifier: CC-BY-SA-4.0 |
| 9 | +SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com> |
| 10 | +--> |
| 11 | + |
| 12 | +# Week 2 |
| 13 | + |
| 14 | +*(June 10, 2025 - June 16, 2025)* |
| 15 | + |
| 16 | +## Meeting 1 |
| 17 | + |
| 18 | +*(June 11, 2025)* |
| 19 | + |
| 20 | +## Attendees |
| 21 | + |
| 22 | +* [Rajul Jha](https://github.com/rajuljha) |
| 23 | +* [Kaushlendra](https://github.com/Kaushl2208) |
| 24 | +* [Ayush](https://github.com/hastagAB) |
| 25 | +* [Sushant](https://github.com/its-sushant) |
| 26 | + |
| 27 | +## Discussions |
| 28 | + |
| 29 | +* Shared findings from a **comprehensive analysis of the Minerva Dataset**. |
| 30 | +* Reviewed dataset characteristics like class imbalance, license frequency, and dataset composition by source. |
| 31 | +* Talked about integrating **negative samples** into the dataset, which are currently missing but critical for training robust ML models. |
| 32 | +* Presented a **proof of concept for Locality Sensitive Hashing (LSH)** and discussed its behavior with varying input lengths. |
| 33 | + |
| 34 | +## Minerva Dataset Analysis |
| 35 | + |
| 36 | +Analyzed multiple dimensions of the Minerva dataset using visualizations. Key insights: |
| 37 | + |
| 38 | + |
| 39 | +* **Long-tail Distribution**: |
| 40 | + * The cumulative distribution shows that **a small subset of licenses accounts for the majority of files**. |
| 41 | + * Around **200 licenses cover ~80% of the dataset**, confirming a significant skew in class distribution. |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | + |
| 46 | +* **Boxplots by Source**: |
| 47 | + * Both `Split-DB-Foss-Licenses` and `Split-SPDX-licenses` show similar distributions, though the median file counts differ slightly. |
| 48 | + * Outliers are present in both, indicating a few licenses are over-represented. |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | + |
| 53 | +* **Heavy Tail in File Counts**: |
| 54 | + * Histogram and KDE of file counts per license shows **a large number of licenses with very few associated files**, while only a few have 500+ files. |
| 55 | + * This reveals severe **class imbalance** which can bias any learning model. |
| 56 | + |
| 57 | +--- |
| 58 | + |
| 59 | + |
| 60 | +* **Source Composition**: |
| 61 | + * The dataset is **split nearly evenly**: ~54% from `Split-DB-Foss-Licenses`, ~46% from `Split-SPDX-licenses`. |
| 62 | + |
| 63 | +--- |
| 64 | + |
| 65 | + |
| 66 | +* **Top 15 Licenses by File Count**: |
| 67 | + * Some licenses (e.g., `Hacktivismo`, `Zimbra-1.2`) dominate the dataset. |
| 68 | + * These must be considered while sampling or designing the pre-filtering ML models to **avoid model bias**. |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## Updates |
| 73 | + |
| 74 | +* Spent time **exploring and analyzing the Minerva Dataset**. |
| 75 | +* Prepared and shared **insightful visualizations** for license frequency, source distribution, and class balance. |
| 76 | +* Identified key limitations: |
| 77 | + * Severe **data imbalance** across license classes. |
| 78 | + * **Lack of negative samples** — currently all samples are associated with known licenses, with no explicit “no license” or “non-license” examples. |
| 79 | + |
| 80 | +* Implemented a **proof of concept for Locality Sensitive Hashing (LSH)**: |
| 81 | + * Works well when **input text length is similar** to license texts. |
| 82 | + * Struggles when input is **shorter or a subquery**, highlighting the need for **preprocessing strategies** or **text padding**. |
| 83 | + |
| 84 | +## Problems Identified |
| 85 | + |
| 86 | +* Minerva lacks **negative (non-license) samples**, which will hinder the performance of classifiers in real-world noisy environments. |
| 87 | +* LSH-based similarity is **length-sensitive**; needs improvement for **partial or paraphrased** inputs. |
| 88 | +* Dataset imbalance could lead to **model overfitting** on high-frequency licenses. |
| 89 | + |
| 90 | +## Planning for Next Week |
| 91 | + |
| 92 | +* Begin designing a **dataset augmentation strategy** to introduce negative samples. |
| 93 | +* Study in more detail about **Locality Sensitive Hashing**. |
| 94 | +* Experiment with **segmenting license texts** and input queries to better suit LSH-based techniques. |
| 95 | +* Continue refining the dataset pipeline and class balancing. |
0 commit comments