|
| 1 | +--- |
| 2 | +title: Week 3 |
| 3 | +author: Rajul Jha |
| 4 | +tags: [gsoc25, Atarashi] |
| 5 | +--- |
| 6 | + |
| 7 | +<!-- |
| 8 | +SPDX-License-Identifier: CC-BY-SA-4.0 |
| 9 | +SPDX-FileCopyrightText: 2025 Rajul Jha <rajuljha49@gmail.com> |
| 10 | +--> |
| 11 | + |
| 12 | +# Week 3 |
| 13 | + |
| 14 | +*(June 17, 2025 - June 18, 2025)* |
| 15 | + |
| 16 | +## Meeting 1 |
| 17 | + |
| 18 | +*(June 18, 2025)* |
| 19 | + |
| 20 | +## Attendees |
| 21 | + |
| 22 | +* [Rajul Jha](https://github.com/rajuljha) |
| 23 | +* [Kaushlendra](https://github.com/Kaushl2208) |
| 24 | +* [Ayush](https://github.com/hastagAB) |
| 25 | +* [Sushant](https://github.com/its-sushant) |
| 26 | +* [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) |
| 27 | + |
| 28 | +## Discussions |
| 29 | + |
| 30 | +* Presented progress on improving the Locality Sensitive Hashing (LSH) approach for license detection. |
| 31 | +* Compared MinHash (Jaccard-based) vs SimHash (cosine-based) algorithms. |
| 32 | +* Shared insights from experimenting with different vectorization techniques (TF-IDF vs. Sentence Transformers). |
| 33 | +* Discussed handling large-scale corpora with caching and sampling strategies. |
| 34 | +* Mentors proposed a **3-step architecture for Atarashi** involving: |
| 35 | + 1. Initial keyword detection using `STRINGS.in`. |
| 36 | + 2. License prediction via LSH-based classifier. |
| 37 | + 3. Final license verification for correctness. |
| 38 | + |
| 39 | +## LSH Algorithm and Implementation Updates |
| 40 | + |
| 41 | +### From MinHash to SimHash |
| 42 | + |
| 43 | +* Initial implementation using **MinHash with character shingles** and Jaccard similarity yielded poor results — lacked robustness against paraphrased or partial text. |
| 44 | +* Switched to **SimHash**, which is more suitable for high-dimensional dense vector spaces and performs well with cosine similarity. |
| 45 | + |
| 46 | +### SimHash Overview |
| 47 | + |
| 48 | +* SimHash works by projecting high-dimensional vectors into binary hash codes based on weighted sign projections. |
| 49 | +* Vectors that are **closer in cosine distance** map to hash codes with **small Hamming distances**. |
| 50 | +* Enables fast similarity search using **hash buckets**, significantly reducing lookup time. |
| 51 | + |
| 52 | +### Vectorization Techniques |
| 53 | + |
| 54 | +* **TF-IDF Vectorizer** was initially used but resulted in **sparse vectors**, which are incompatible with SimHash. |
| 55 | +* Transitioned to **`sentence-transformers`** — used **all-MiniLM-L6-v2** model, which generates **dense sentence embeddings** suitable for SimHash. |
| 56 | + |
| 57 | +### Performance Optimizations |
| 58 | + |
| 59 | +* Implemented **caching** to avoid repeated vector generation for the same files. |
| 60 | +* Due to dataset size (~162k files), limited vectorization to a representative subset of **10,000 files** for faster experimentation. |
| 61 | + |
| 62 | +## Experimental Results |
| 63 | + |
| 64 | +* **Combined all Minerva files into a single corpus** and indexed using SimHash-based LSH. |
| 65 | +* Indexed 10,000 sample files, including: |
| 66 | + * 46 unique licenses (out of 654 total) |
| 67 | + * 20 known non-license texts |
| 68 | + * Total -> 674 queries. |
| 69 | + |
| 70 | +**Key Metrics:** |
| 71 | + |
| 72 | +| Metric | Value | |
| 73 | +|--------|-------| |
| 74 | +| Indexed licenses | 46 / 654 | |
| 75 | +| Correctly retrieved licenses | All 46 | |
| 76 | +| Correctly rejected non-license text | 20 / 674 | |
| 77 | +| Detected unseen licenses (not indexed) | 203 / 608 | |
| 78 | +| Indexed file subset | 10,000 / 162,833 | |
| 79 | +| Overall trend | Positive performance despite limited indexing | |
| 80 | + |
| 81 | +Code Repository: [atarashi-classifier](https://github.com/rajuljha/atarashi-classifer) |
| 82 | + |
| 83 | +--- |
| 84 | + |
| 85 | +## Problems Identified |
| 86 | + |
| 87 | +* **TF-IDF vectors** were too sparse, reducing effectiveness with SimHash. |
| 88 | +* **Limited indexing** (only 10k files) restricts generalization for rare licenses. |
| 89 | +* Not all licenses were present in the indexed corpus — needs **broader coverage**. |
| 90 | +* **False negatives** among non-license texts indicate further tuning is required. |
| 91 | + |
| 92 | +## Planning for Next Week |
| 93 | + |
| 94 | +* Start working on **Stage 1** of the proposed pipeline: |
| 95 | + * Extract and match keywords from `STRINGS.in` (used by Nomos) to identify candidate license regions. |
| 96 | +* Expand indexed dataset to include more diverse license types. |
| 97 | +* Improve non-license detection rate through better negative sampling and filtering. |
| 98 | +* Continue tuning SimHash and embedding-based search thresholds. |
0 commit comments