Skip to content

gsoc final report

mikit edited this page Aug 21, 2021 · 6 revisions

Google Summer of Code 2021 Final Report

SPDX YALM authored by @m1kit

Abstract

Detecting the contents of license documents is a major task for software that automatically analyzes software metadata, such as package managers. OSS licenses can be detected automatically by comparing the license document with templates of major OSS licenses managed by SPDX using a string algorithm. However, typical string comparison algorithms are vulnerable to small additions of malicious wording.

For this reason, SPDX has provided the guideline for matching license documents. In addition, several license detection algorithms have been implemented. Among them, spdx_python_licensematching aims to faithfully reproduce the guidelines.

I have been working on two major tasks: to improve the existing implementation, and to release it as a library YALM to the world.

Work Product

Terregex

terregex is a library to transform a regex pattern. I developed it because YALM uses a regular expression engine as the backbone of its algorithms, and in some situation we need to convert a pattern according to some rules.

For example, even if you want to convert all alphabets in a pattern to lowercase, you need to keep classes such as \W in uppercase. terregex can take such semantic differences into account.

Structured Normalization

In order to prevent false positives due to small notational shifts, YALM normalizes documents and templates according to specific rules. SPDX license template (like this) has expressions to show which part of license template is replacable and which part is omitable. In the past, all those expressions were converted to regular expressions and then normalized. However, those expressions should be subjected to a different normalization process. So YALM normalizes them first and then converts them into regular expressions.

PR: #18

Words-bag Pre-filtering

Regular expression testing is accurate but slow. To reduce inference time, I introduced word-based testing. By using license templates, we can extract essential words and we can quickly determine a document contains these words or not.

two_step_testing

PR: #18

Random Testcases

SPDX manages hundreds of different licenses, and the number is growing daily. Since it is difficult to manually set up test cases for all of these licenses, I generated random license documents from license templates as test cases.

PR: #22

YALM Resources

YALM uses a variety of resources, including license templates. These resources are updated from time to time, so they need to be regenerated. However, regenerating them at runtime leads to unnecessary overhead. In order to solve this problem, I have developed a framework for managing various resources.

We have resources listed below at m1kit/yalm-resources and they are updated periodically.

  • License List
  • Templates
  • Real Samples for Testing
  • Generated Samples for Testing
  • Pre-transplied Regex
  • Pre-extracted Word-set
  • equivalentwords.txt
  • expected-duplicates.json

By keeping these resources separated, we can use them when we port this library to another language.

two_step_testing

Repo: m1kit/yalm-resources

Publish on PyPI

I released an alpha version of this library on PyPI. Now we can install this library like this:

pip install yalm

Timeout / Multiprocessing

Due to catastrophic backtracking, sometimes regex engine hungs up. To prevent ReDoS I introduced timeout mechanism. Additionally, I added support for multiprocessing.

PR: #28

Misc

  • #6: refactoring
  • #25: minor accuracy improvement
  • #29: minor performance improvement

Future Work

API Documentation

Accuracy Improvement

Differences Extraction

Better Testing

Merge into SPDX libraries

Porting

Acknowledgements

First of all, I would like to thank @anshuldutt21 san for creating the first code base. The idea of transpiling the templates into regular expressions to follow the guidelines was his, and his implementation was very well thought out. This project won't be like this without his efforts.

Also, this project would not have been possible without the mentoring of @goneall san. The SPDX community has been working on the problem of license matching for many years, and they have knowledge about issues and solutions that I did not know. They were kind enough to provide me with relevant knowledge as the situation required.

Finally, I would like to thank members of the community for discussing with me and the GSoC staffs for giving me this great opportunity.

References

Clone this wiki locally