You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Shaheem Azmal M MD](https://github.com/shaheemazmalmmd): No Update Available.
70
-
71
-
-[Gaurav Mishra](https://github.com/GMishx): No Update Available.
72
-
73
-
-[Kaushlendra Pratap](https://github.com/Kaushl2208): No Update Available.
68
+
-[Shaheem Azmal M MD](https://github.com/shaheemazmalmmd): General updates regarding project. Discuss on UX progress.
74
69
75
70
## Updates from contributors
76
71
77
72
-[Rajul Jha](https://github.com/rajuljha)
78
73
79
-
- No Update Available.
74
+
- Spent time exploring and analyzing the Minerva Dataset.
75
+
76
+
- Prepared and shared insightful visualizations for license frequency, source distribution, and class balance.
77
+
78
+
- Severe data imbalance across license classes. Lack of negative samples — currently all samples are associated with known licenses, with no explicit “no license” or “non-license” examples.
79
+
80
+
- Implemented a proof of concept for Locality Sensitive Hashing.
81
+
82
+
- Struggles when input is shorter or a subquery, highlighting the need for preprocessing strategies or text padding.
-[Shaheem Azmal M MD](https://github.com/shaheemazmalmmd): No Update Available.
70
-
71
-
-[Gaurav Mishra](https://github.com/GMishx): No Update Available.
68
+
-[Gaurav Mishra](https://github.com/GMishx): General updates regarding project.
72
69
73
-
-[Kaushlendra Pratap](https://github.com/Kaushl2208): No Update Available.
70
+
-[Shaheem Azmal M MD](https://github.com/shaheemazmalmmd): Request to Add a comment in the issues created for GSoC-2025 so that mentors/maintainers can assign this to you.
74
71
75
72
## Updates from contributors
76
73
77
74
-[Rajul Jha](https://github.com/rajuljha)
78
75
79
-
- No Update Available.
76
+
- Presented progress on improving the Locality Sensitive Hashing (LSH) approach for license detection.
77
+
78
+
- Compared MinHash (Jaccard-based) vs SimHash (cosine-based) algorithms.
79
+
80
+
- Shared insights from experimenting with different vectorization techniques (TF-IDF vs. Sentence Transformers).
81
+
82
+
- Discussed handling large-scale corpora with caching and sampling strategies.
0 commit comments