Skip to content

Commit 62d5bc3

Browse files
authored
Merge pull request #318 from rajuljha/updates/week3
chore(docs): Add week3 atarashi docs.
2 parents d27e518 + d729ca1 commit 62d5bc3

File tree

1 file changed

+98
-0
lines changed

1 file changed

+98
-0
lines changed
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
---
2+
title: Week 3
3+
author: Rajul Jha
4+
tags: [gsoc25, Atarashi]
5+
---
6+
7+
<!--
8+
SPDX-License-Identifier: CC-BY-SA-4.0
9+
SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com>
10+
-->
11+
12+
# Week 3
13+
14+
*(June 17, 2025 - June 18, 2025)*
15+
16+
## Meeting 1
17+
18+
*(June 18, 2025)*
19+
20+
## Attendees
21+
22+
* [Rajul Jha](https://github.com/rajuljha)
23+
* [Kaushlendra](https://github.com/Kaushl2208)
24+
* [Ayush](https://github.com/hastagAB)
25+
* [Sushant](https://github.com/its-sushant)
26+
* [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
27+
28+
## Discussions
29+
30+
* Presented progress on improving the Locality Sensitive Hashing (LSH) approach for license detection.
31+
* Compared MinHash (Jaccard-based) vs SimHash (cosine-based) algorithms.
32+
* Shared insights from experimenting with different vectorization techniques (TF-IDF vs. Sentence Transformers).
33+
* Discussed handling large-scale corpora with caching and sampling strategies.
34+
* Mentors proposed a **3-step architecture for Atarashi** involving:
35+
1. Initial keyword detection using `STRINGS.in`.
36+
2. License prediction via LSH-based classifier.
37+
3. Final license verification for correctness.
38+
39+
## LSH Algorithm and Implementation Updates
40+
41+
### From MinHash to SimHash
42+
43+
* Initial implementation using **MinHash with character shingles** and Jaccard similarity yielded poor results — lacked robustness against paraphrased or partial text.
44+
* Switched to **SimHash**, which is more suitable for high-dimensional dense vector spaces and performs well with cosine similarity.
45+
46+
### SimHash Overview
47+
48+
* SimHash works by projecting high-dimensional vectors into binary hash codes based on weighted sign projections.
49+
* Vectors that are **closer in cosine distance** map to hash codes with **small Hamming distances**.
50+
* Enables fast similarity search using **hash buckets**, significantly reducing lookup time.
51+
52+
### Vectorization Techniques
53+
54+
* **TF-IDF Vectorizer** was initially used but resulted in **sparse vectors**, which are incompatible with SimHash.
55+
* Transitioned to **`sentence-transformers`** — used **all-MiniLM-L6-v2** model, which generates **dense sentence embeddings** suitable for SimHash.
56+
57+
### Performance Optimizations
58+
59+
* Implemented **caching** to avoid repeated vector generation for the same files.
60+
* Due to dataset size (~162k files), limited vectorization to a representative subset of **10,000 files** for faster experimentation.
61+
62+
## Experimental Results
63+
64+
* **Combined all Minerva files into a single corpus** and indexed using SimHash-based LSH.
65+
* Indexed 10,000 sample files, including:
66+
* 46 unique licenses (out of 654 total)
67+
* 20 known non-license texts
68+
* Total -> 674 queries.
69+
70+
**Key Metrics:**
71+
72+
| Metric | Value |
73+
|--------|-------|
74+
| Indexed licenses | 46 / 654 |
75+
| Correctly retrieved licenses | All 46 |
76+
| Correctly rejected non-license text | 20 / 674 |
77+
| Detected unseen licenses (not indexed) | 203 / 608 |
78+
| Indexed file subset | 10,000 / 162,833 |
79+
| Overall trend | Positive performance despite limited indexing |
80+
81+
Code Repository: [atarashi-classifier](https://github.com/rajuljha/atarashi-classifer)
82+
83+
---
84+
85+
## Problems Identified
86+
87+
* **TF-IDF vectors** were too sparse, reducing effectiveness with SimHash.
88+
* **Limited indexing** (only 10k files) restricts generalization for rare licenses.
89+
* Not all licenses were present in the indexed corpus — needs **broader coverage**.
90+
* **False negatives** among non-license texts indicate further tuning is required.
91+
92+
## Planning for Next Week
93+
94+
* Start working on **Stage 1** of the proposed pipeline:
95+
* Extract and match keywords from `STRINGS.in` (used by Nomos) to identify candidate license regions.
96+
* Expand indexed dataset to include more diverse license types.
97+
* Improve non-license detection rate through better negative sampling and filtering.
98+
* Continue tuning SimHash and embedding-based search thresholds.

0 commit comments

Comments
 (0)