-
Notifications
You must be signed in to change notification settings - Fork 6
feat: add license matcher task #1453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CSV with the matches of licenses and feeds in DEV environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces a comprehensive license matching system for the Mobility Feed API, enabling automatic resolution of feed license URLs to SPDX identifiers. The implementation includes database schema changes, matching algorithms with multiple strategies (exact, fuzzy, heuristic), and a new task executor for batch license processing.
Key Changes
- Implements multi-strategy license URL matching (exact match, Creative Commons resolver, heuristic patterns, fuzzy string matching) with special handling for localized/regional license variants
- Adds database schema for license tracking on feeds and audit trail for license changes
- Provides batch processing capability with CSV export support for license matching results
Reviewed Changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
liquibase/changes/feat_1433.sql |
Adds license_id foreign key to feed table and creates feed_license_change audit table with indexes |
liquibase/changelog.xml |
Includes the new database migration file |
scripts/populate-licenses.sh |
New script to populate license rules and licenses from GitHub repository |
scripts/docker-localdb-rebuild-data.sh |
Integrates license population into database rebuild workflow |
functions-python/tasks_executor/src/main.py |
Adds CSV response format support and match_licenses task handler |
functions-python/tasks_executor/src/tasks/licenses/license_matcher.py |
Implements core license matching task with batch processing and feed updates |
functions-python/tasks_executor/src/tasks/licenses/populate_licenses.py |
Updates license population logic to handle new vs existing licenses explicitly |
functions-python/tasks_executor/requirements.txt |
Adds fastapi_filter dependency for feed filtering functionality |
functions-python/tasks_executor/function_config.json |
Includes feed_filters module in deployment package |
functions-python/helpers/query_helper.py |
Adds URL normalization utilities (duplicated from api/src/shared/common/db_utils.py) |
api/src/shared/common/db_utils.py |
Adds URL normalization utilities and feed query helpers |
api/src/shared/common/license_utils.py |
Implements comprehensive license URL resolution with Creative Commons handling and fuzzy matching |
api/tests/utils/test_license_utils.py |
Comprehensive unit tests for license resolution utilities |
functions-python/tasks_executor/tests/test_main.py |
Updates tests to handle new accept content type parameter |
functions-python/tasks_executor/tests/tasks/populate_licenses_and_rules/test_populate_licenses.py |
Updates tests for modified license population logic |
functions-python/tasks_executor/tests/license_matcher/test_license_matcher.py |
New comprehensive test suite for license matching functionality |
functions-python/tasks_executor/README.md |
Documents CSV response format capability |
docs/LICENSES.md |
New documentation explaining license matching system and data sources |
functions-python/tasks_executor/src/tasks/licenses/license_matcher.py
Outdated
Show resolved
Hide resolved
functions-python/tasks_executor/src/tasks/licenses/license_matcher.py
Outdated
Show resolved
Hide resolved
…cher.py Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
…cher.py Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
Total of licenses per data_type and SPDX ID in DEV environment: All feeds
Only published feeds:
|
|
@davidgamez This is wonderful! We will likely want to re-run this once the TDG data is added. It looks like there will still be a lot of manual review necessary for GTFS, but at least this removes almost all work on the GBFS side and for our automatically imported regional data. |
I added a parameter to go over the feeds that are not matched, so this can be run anytime in the future. We still need to think about how we can integrate this when a new feed is created and from the "add feed form", as the core building blocks are implemented. |
|
@davidgamez This also makes me think that in the license filter in the UI, we should find some way to group by type of license (e.g. all the creative common versions clustered together) |
qcdyx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is answered in a Slack call. LGTM
jcpitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MO-NU-MEN-TAL!
Summary:
Closes #1433
This PR introduces a license matcher task function.
The matching process follows the following steps: Exact Match, Fuzzy Match, and Domain Match.
If a license URL points to a localized version (for example, a country-specific version of a Creative Commons license),
the system identifies the corresponding standard SPDX license and adds a note explaining the regional variant. Example some Japanese feeds licenses that have CC-BY-2.1-jp get matched to CC-BY-2.0.
From our AI friend
This pull request introduces new utilities for normalizing URLs and resolving license URLs to SPDX identifiers, with a focus on Creative Commons and other common open data licenses. The changes enhance the ability to match license URLs in different formats using exact, heuristic, and fuzzy strategies. The most important changes are grouped below.
License URL Matching and Resolution:
license_utils.pywith functions to resolve license URLs to SPDX IDs using exact match, Creative Commons-specific logic, common URL patterns, and fuzzy string matching. Includes a comprehensive explanation for handling locale/jurisdiction ports and version normalization for Creative Commons licenses.MatchingLicensedataclass to standardize the structure of license resolution results.URL Normalization Utilities:
normalize_url,normalize_url_str, and related helper functions todb_utils.pyfor normalizing URLs by removing protocols,www.prefixes, trailing slashes, fragments, and query parameters, both for SQL queries and Python string comparison.Codebase Maintenance:
db_utils.pyto use absolute imports forGbfsFeedFilterandGbfsVersionFilter, improving code clarity and reliability.reimport todb_utils.pyfor regular expression operations used in URL normalization.Expected behavior:
Licenses are matched correctly by their URL to a correct spdx_id.
Testing tips:
output.csvfile.Please make sure these boxes are checked before submitting your pull request - thanks!
./scripts/api-tests.shto make sure you didn't break anything