Merge pull request #318 from rajuljha/updates/week3

GMishx · web-flow · commit 62d5bc3e9f68 · 2025-06-19T21:31:39.000+05:30
chore(docs): Add week3 atarashi docs.
diff --git a/docs/2025/atarashi-enhancement/updates/2025-06-18.md b/docs/2025/atarashi-enhancement/updates/2025-06-18.md
@@ -0,0 +1,98 @@
+---
+title: Week 3  
+author: Rajul Jha  
+tags: [gsoc25, Atarashi]
+---
+
+<!--  
+SPDX-License-Identifier: CC-BY-SA-4.0  
+SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com>  
+-->
+
+# Week 3
+
+*(June 17, 2025 - June 18, 2025)*
+
+## Meeting 1
+
+*(June 18, 2025)*
+
+## Attendees
+
+* [Rajul Jha](https://github.com/rajuljha)
+* [Kaushlendra](https://github.com/Kaushl2208)
+* [Ayush](https://github.com/hastagAB)
+* [Sushant](https://github.com/its-sushant)
+* [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
+
+## Discussions
+
+* Presented progress on improving the Locality Sensitive Hashing (LSH) approach for license detection.
+* Compared MinHash (Jaccard-based) vs SimHash (cosine-based) algorithms.
+* Shared insights from experimenting with different vectorization techniques (TF-IDF vs. Sentence Transformers).
+* Discussed handling large-scale corpora with caching and sampling strategies.
+* Mentors proposed a **3-step architecture for Atarashi** involving:
+  1. Initial keyword detection using `STRINGS.in`.
+  2. License prediction via LSH-based classifier.
+  3. Final license verification for correctness.
+
+## LSH Algorithm and Implementation Updates
+
+### From MinHash to SimHash
+
+* Initial implementation using **MinHash with character shingles** and Jaccard similarity yielded poor results — lacked robustness against paraphrased or partial text.
+* Switched to **SimHash**, which is more suitable for high-dimensional dense vector spaces and performs well with cosine similarity.
+
+### SimHash Overview
+
+* SimHash works by projecting high-dimensional vectors into binary hash codes based on weighted sign projections.
+* Vectors that are **closer in cosine distance** map to hash codes with **small Hamming distances**.
+* Enables fast similarity search using **hash buckets**, significantly reducing lookup time.
+
+### Vectorization Techniques
+
+* **TF-IDF Vectorizer** was initially used but resulted in **sparse vectors**, which are incompatible with SimHash.
+* Transitioned to **`sentence-transformers`** — used **all-MiniLM-L6-v2** model, which generates **dense sentence embeddings** suitable for SimHash.
+
+### Performance Optimizations
+
+* Implemented **caching** to avoid repeated vector generation for the same files.
+* Due to dataset size (~162k files), limited vectorization to a representative subset of **10,000 files** for faster experimentation.
+
+## Experimental Results
+
+* **Combined all Minerva files into a single corpus** and indexed using SimHash-based LSH.
+* Indexed 10,000 sample files, including:
+  * 46 unique licenses (out of 654 total)
+  * 20 known non-license texts
+  * Total -> 674 queries.
+
+**Key Metrics:**
+
+| Metric | Value |
+|--------|-------|
+| Indexed licenses | 46 / 654 |
+| Correctly retrieved licenses | All 46 |
+| Correctly rejected non-license text |  20 / 674 |
+| Detected unseen licenses (not indexed) |  203 / 608 |
+| Indexed file subset | 10,000 / 162,833 |
+| Overall trend | Positive performance despite limited indexing |
+
+Code Repository: [atarashi-classifier](https://github.com/rajuljha/atarashi-classifer)
+
+---
+
+## Problems Identified
+
+* **TF-IDF vectors** were too sparse, reducing effectiveness with SimHash.
+* **Limited indexing** (only 10k files) restricts generalization for rare licenses.
+* Not all licenses were present in the indexed corpus — needs **broader coverage**.
+* **False negatives** among non-license texts indicate further tuning is required.
+
+## Planning for Next Week
+
+* Start working on **Stage 1** of the proposed pipeline:
+  * Extract and match keywords from `STRINGS.in` (used by Nomos) to identify candidate license regions.
+* Expand indexed dataset to include more diverse license types.
+* Improve non-license detection rate through better negative sampling and filtering.
+* Continue tuning SimHash and embedding-based search thresholds.