Skip to content

gsoc final report

mikit edited this page Aug 21, 2021 · 6 revisions

Google Summer of Code 2021 Final Report

authored by @m1kit

Abstract

Detecting the contents of license documents is a major task for software that automatically analyzes software metadata, such as package managers. OSS licenses can be detected automatically by comparing the license document with templates of major OSS licenses managed by SPDX using a string algorithm. However, typical string comparison algorithms are vulnerable to small additions of malicious wording.

For this reason, SPDX has provided the guideline for matching license documents. In addition, several license detection algorithms have been implemented. Among them, spdx_python_licensematching aims to faithfully reproduce the guidelines.

I have been working on two major tasks: to improve the existing implementation, and to release it as a library YALM to the world.

Work Product

Terregex

Structured Normalization

Words-bag Pre-filtering

Random Testcases

YALM Resources

Publish on PyPI

Multiprocessing

Misc

Future Work

API Documentation

Accuracy Improvement

Differences Extraction

Better Testing

Merge into SPDX libraries

Porting

Acknowledgements

First of all, I would like to thank @anshuldutt21 san for creating the first code base. The idea of transpiling the templates into regular expressions to follow the guidelines was his, and his implementation was very well thought out. This project won't be like this without his efforts.

Also, this project would not have been possible without the mentoring of @goneall san. The SPDX community has been working on the problem of license matching for many years, and they have knowledge about issues and solutions that I did not know. They were kind enough to provide me with relevant knowledge as the situation required.

Finally, I would like to thank members of the community for discussing with me and the GSoC staffs for giving me this great opportunity.

References

Clone this wiki locally