Skip to content

Conversation

@davidgamez
Copy link
Member

@davidgamez davidgamez commented Nov 11, 2025

Summary:
Closes #1433
This PR introduces a license matcher task function.
The matching process follows the following steps: Exact Match, Fuzzy Match, and Domain Match.

If a license URL points to a localized version (for example, a country-specific version of a Creative Commons license),
the system identifies the corresponding standard SPDX license and adds a note explaining the regional variant. Example some Japanese feeds licenses that have CC-BY-2.1-jp get matched to CC-BY-2.0.

From our AI friend

This pull request introduces new utilities for normalizing URLs and resolving license URLs to SPDX identifiers, with a focus on Creative Commons and other common open data licenses. The changes enhance the ability to match license URLs in different formats using exact, heuristic, and fuzzy strategies. The most important changes are grouped below.

License URL Matching and Resolution:

  • Added license_utils.py with functions to resolve license URLs to SPDX IDs using exact match, Creative Commons-specific logic, common URL patterns, and fuzzy string matching. Includes a comprehensive explanation for handling locale/jurisdiction ports and version normalization for Creative Commons licenses.
  • Introduced the MatchingLicense dataclass to standardize the structure of license resolution results.

URL Normalization Utilities:

  • Added normalize_url, normalize_url_str, and related helper functions to db_utils.py for normalizing URLs by removing protocols, www. prefixes, trailing slashes, fragments, and query parameters, both for SQL queries and Python string comparison.
  • Implemented utility functions to query feeds by normalized URL, ensuring deprecated feeds are excluded and normalization is consistent across queries.

Codebase Maintenance:

  • Fixed import paths in db_utils.py to use absolute imports for GbfsFeedFilter and GbfsVersionFilter, improving code clarity and reliability.
  • Added missing re import to db_utils.py for regular expression operations used in URL normalization.

Expected behavior:

Licenses are matched correctly by their URL to a correct spdx_id.

Testing tips:

  • Populate your local DB with data using
./scripts/docker-localdb-rebuild-data.sh --populate-db
  • Run locally the task executor function
  • Use the following curl command
curl -X POST "https://tasks-executor-dev-978785769226.northamerica-northeast1.run.app" \
-H "Authorization: bearer $(gcloud auth print-identity-token)" \
-H "Content-Type: application/json" \
-H "Accept: text/csv" \
-d '{
  "task": "match_licenses",
  "payload": {
  "dry_run": true,
  "only_unmatched": false
  }
}' | sed 's/\\r//g; s/\\n/\n/g' > output.csv
  • Verify the content of the output.csv file.

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@davidgamez
Copy link
Member Author

CSV with the matches of licenses and feeds in DEV environment
licenses.csv

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces a comprehensive license matching system for the Mobility Feed API, enabling automatic resolution of feed license URLs to SPDX identifiers. The implementation includes database schema changes, matching algorithms with multiple strategies (exact, fuzzy, heuristic), and a new task executor for batch license processing.

Key Changes

  • Implements multi-strategy license URL matching (exact match, Creative Commons resolver, heuristic patterns, fuzzy string matching) with special handling for localized/regional license variants
  • Adds database schema for license tracking on feeds and audit trail for license changes
  • Provides batch processing capability with CSV export support for license matching results

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
liquibase/changes/feat_1433.sql Adds license_id foreign key to feed table and creates feed_license_change audit table with indexes
liquibase/changelog.xml Includes the new database migration file
scripts/populate-licenses.sh New script to populate license rules and licenses from GitHub repository
scripts/docker-localdb-rebuild-data.sh Integrates license population into database rebuild workflow
functions-python/tasks_executor/src/main.py Adds CSV response format support and match_licenses task handler
functions-python/tasks_executor/src/tasks/licenses/license_matcher.py Implements core license matching task with batch processing and feed updates
functions-python/tasks_executor/src/tasks/licenses/populate_licenses.py Updates license population logic to handle new vs existing licenses explicitly
functions-python/tasks_executor/requirements.txt Adds fastapi_filter dependency for feed filtering functionality
functions-python/tasks_executor/function_config.json Includes feed_filters module in deployment package
functions-python/helpers/query_helper.py Adds URL normalization utilities (duplicated from api/src/shared/common/db_utils.py)
api/src/shared/common/db_utils.py Adds URL normalization utilities and feed query helpers
api/src/shared/common/license_utils.py Implements comprehensive license URL resolution with Creative Commons handling and fuzzy matching
api/tests/utils/test_license_utils.py Comprehensive unit tests for license resolution utilities
functions-python/tasks_executor/tests/test_main.py Updates tests to handle new accept content type parameter
functions-python/tasks_executor/tests/tasks/populate_licenses_and_rules/test_populate_licenses.py Updates tests for modified license population logic
functions-python/tasks_executor/tests/license_matcher/test_license_matcher.py New comprehensive test suite for license matching functionality
functions-python/tasks_executor/README.md Documents CSV response format capability
docs/LICENSES.md New documentation explaining license matching system and data sources

@davidgamez davidgamez marked this pull request as ready for review November 12, 2025 23:38
@davidgamez
Copy link
Member Author

davidgamez commented Nov 12, 2025

Total of licenses per data_type and SPDX ID in DEV environment:

All feeds

license_id data_type total
CC-BY-4.0 gbfs 650
CC0-1.0 gbfs 205
CC-BY-4.0 gtfs 167
CC0-1.0 gtfs 67
CC0-1.0 gtfs_rt 59
OGL-UK-3.0 gtfs 30
CDLA-Permissive-2.0 gbfs 23
DL-DE-BY-2.0 gbfs 21
CC-BY-4.0 gtfs_rt 14
CC-BY-2.0 gtfs 7
ODbL-1.0 gtfs 5
CC-BY-SA-3.0 gbfs 1
DL-DE-BY-2.0 gtfs 1
CC-BY-NC-SA-4.0 gtfs 1
CC-BY-NC-4.0 gtfs 1
CC-BY-ND-2.0 gtfs 1
PDDL-1.0 gtfs 1

Only published feeds:

license_id data_type total
CC-BY-4.0 gbfs 650
CC0-1.0 gbfs 205
CC-BY-4.0 gtfs 31
OGL-UK-3.0 gtfs 30
CDLA-Permissive-2.0 gbfs 23
DL-DE-BY-2.0 gbfs 21
CC0-1.0 gtfs 14
ODbL-1.0 gtfs 5
CC0-1.0 gtfs_rt 3
PDDL-1.0 gtfs 1
CC-BY-3.0 gtfs 1
CC-BY-SA-3.0 gbfs 1
DL-DE-BY-2.0 gtfs 1
CC-BY-NC-SA-4.0 gtfs 1
CC-BY-NC-4.0 gtfs 1
CC-BY-ND-2.0 gtfs 1
CC-BY-4.0 gtfs_rt 1
CC-BY-2.0 gtfs 1

@emmambd
Copy link
Collaborator

emmambd commented Nov 13, 2025

@davidgamez This is wonderful! We will likely want to re-run this once the TDG data is added. It looks like there will still be a lot of manual review necessary for GTFS, but at least this removes almost all work on the GBFS side and for our automatically imported regional data.

@davidgamez
Copy link
Member Author

@davidgamez This is wonderful! We will likely want to re-run this once the TDG data is added. It looks like there will still be a lot of manual review necessary for GTFS, but at least this removes almost all work on the GBFS side and for our automatically imported regional data.

I added a parameter to go over the feeds that are not matched, so this can be run anytime in the future. We still need to think about how we can integrate this when a new feed is created and from the "add feed form", as the core building blocks are implemented.

@emmambd
Copy link
Collaborator

emmambd commented Nov 13, 2025

@davidgamez This also makes me think that in the license filter in the UI, we should find some way to group by type of license (e.g. all the creative common versions clustered together)

Copy link
Contributor

@qcdyx qcdyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is answered in a Slack call. LGTM

@davidgamez davidgamez requested a review from jcpitre November 19, 2025 19:41
Copy link
Collaborator

@jcpitre jcpitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MO-NU-MEN-TAL!

@davidgamez davidgamez merged commit 4820180 into main Nov 19, 2025
2 of 3 checks passed
@davidgamez davidgamez deleted the feat/license_matcher branch November 19, 2025 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a script to link current licenses based on the license URL

5 participants