Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds lifecycle support for OpenStreetMap data, enabling filtering of features by their operational status (operating, construction, proposed, planned, disused, abandoned). The implementation extends the filtering pipeline to capture lifecycle-prefixed tags and adds an allowed_statuses parameter throughout the extraction pipeline.
Key changes include:
- Extension of the primary filter to include lifecycle prefixes (e.g., "construction:power", "proposed:power")
- Addition of the
allowed_statusesparameter to filter features by operational status - Integration of lifecycle processing across both legacy and streaming backends
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| earth_osm/filter.py | Extends the pre_filter to include lifecycle-prefixed tags for comprehensive OSM data extraction |
| earth_osm/eo.py | Adds allowed_statuses parameter to process_region and save_osm_data functions with inconsistent indentation |
| earth_osm/backends.py | Integrates lifecycle processing in all backends (geofabrik_legacy, geofabrik_stream, overpass) with missing module import |
| earth_osm/args.py | Adds CLI argument --statuses to enable status filtering from command line |
Critical Issues Identified:
- The imported
earth_osm.lifecyclemodule does not exist in the repository, which will cause runtime failures - Missing parameter propagation in several code paths, causing inconsistent behavior
- Logger argument order mismatch
- Inconsistent code indentation deviating from Python standards
Comments suppressed due to low confidence (2)
earth_osm/eo.py:115
- The docstring for process_region does not document the new allowed_statuses parameter. According to the documentation pattern established in the codebase (e.g., save_osm_data at lines 219-233), parameters should be documented.
"""Process a single region for a feature.
When ``stream`` is ``True`` and the geofabrik backend is used the function
returns an iterator of flattened feature dictionaries. Otherwise it returns
a :class:`pandas.DataFrame` to preserve the historic API.
"""
earth_osm/eo.py:233
- The docstring for save_osm_data does not document the new allowed_statuses parameter. According to the documentation pattern established in this function (lines 219-233), all parameters should be documented in the args section.
"""
Get OSM Data for a list of regions and features
args:
region_list: list of regions to get data for
primary_name: primary feature to get data for
feature_list: list of features to get data for
update: update data
mp: use multiprocessing
stream_backend: when ``True`` (default) use the streaming pipeline for
GeoFabrik sources; set to ``False`` to revert to the legacy
in-memory pipeline (primarily for benchmarking)
target_date: optional target date for historical data
returns:
dict of dataframes
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| stream_cached_primary_features, | ||
| stream_pbf_features, | ||
| ) | ||
| from earth_osm.lifecycle import add_lifecycle_columns, filter_by_status |
There was a problem hiding this comment.
The module 'earth_osm.lifecycle' is imported but does not exist in the repository. This import will cause a runtime error. You need to create the lifecycle.py file with the required functions: add_lifecycle_columns, filter_by_status, and determine_type_and_status.
| os.path.basename(pbf_url), | ||
| primary_name, | ||
| feature_name, |
There was a problem hiding this comment.
The logger.info arguments are in the wrong order. According to the pattern established in earth_osm/stream.py (lines 859-864, 1030-1034), the format should be: "Region %s (%s=%s): downloading %s" with arguments (region.short, primary_name, feature_name, os.path.basename(pbf_url)). Currently, the arguments are ordered as (region.short, os.path.basename(pbf_url), primary_name, feature_name), which doesn't match the format string.
| os.path.basename(pbf_url), | |
| primary_name, | |
| feature_name, | |
| primary_name, | |
| feature_name, | |
| os.path.basename(pbf_url), |
| region_obj, | ||
| primary_name, | ||
| feature_name_obj, | ||
| mp, | ||
| update, | ||
| data_dir, | ||
| progress_bar=progress_bar, | ||
| data_source=data_source, | ||
| stream=False, | ||
| cache_primary=cache_primary, | ||
| allowed_statuses=allowed_statuses, # PASS THROUGH | ||
| ) |
There was a problem hiding this comment.
Inconsistent indentation: these lines use excessive indentation (24 spaces) compared to the standard 4-space indentation used throughout the codebase. This breaks Python's standard indentation conventions and makes the code harder to read.
| region_obj, | |
| primary_name, | |
| feature_name_obj, | |
| mp, | |
| update, | |
| data_dir, | |
| progress_bar=progress_bar, | |
| data_source=data_source, | |
| stream=False, | |
| cache_primary=cache_primary, | |
| allowed_statuses=allowed_statuses, # PASS THROUGH | |
| ) | |
| region_obj, | |
| primary_name, | |
| feature_name_obj, | |
| mp, | |
| update, | |
| data_dir, | |
| progress_bar=progress_bar, | |
| data_source=data_source, | |
| stream=False, | |
| cache_primary=cache_primary, | |
| allowed_statuses=allowed_statuses, # PASS THROUGH | |
| ) |
| for row in rows: | ||
| # Build tags dict from row | ||
| tags = {k[5:]: v for k, v in row.items() if k.startswith('tags.') and v is not None} | ||
|
|
||
| # Determine actual type and status | ||
| actual_type, status = determine_type_and_status(tags) | ||
|
|
||
| # Skip if wrong type | ||
| if actual_type != feature_name: | ||
| continue | ||
|
|
||
| # Skip if status not allowed | ||
| if allowed_statuses and status not in allowed_statuses: | ||
| continue | ||
|
|
||
| # Add status to row | ||
| row['status'] = status | ||
| row['feature_type'] = actual_type | ||
|
|
||
| yield row | ||
|
|
There was a problem hiding this comment.
The lifecycle_filter_stream function recreates a tags dictionary from row items on every iteration by filtering items that start with 'tags.' This could be inefficient for large datasets. Consider whether the determine_type_and_status function could accept the row directly or if the tags extraction logic could be optimized.
| for row in rows: | |
| # Build tags dict from row | |
| tags = {k[5:]: v for k, v in row.items() if k.startswith('tags.') and v is not None} | |
| # Determine actual type and status | |
| actual_type, status = determine_type_and_status(tags) | |
| # Skip if wrong type | |
| if actual_type != feature_name: | |
| continue | |
| # Skip if status not allowed | |
| if allowed_statuses and status not in allowed_statuses: | |
| continue | |
| # Add status to row | |
| row['status'] = status | |
| row['feature_type'] = actual_type | |
| yield row | |
| # Precompute the set of tag keys once to avoid repeated prefix checks | |
| tag_keys = None | |
| for row in rows: | |
| if tag_keys is None: | |
| # Identify all columns that represent tags.* | |
| tag_keys = [k for k in row.keys() if k.startswith('tags.')] | |
| # Build tags dict from row using precomputed tag keys | |
| tags = {k[5:]: row[k] for k in tag_keys if row.get(k) is not None} | |
| # Determine actual type and status | |
| actual_type, status = determine_type_and_status(tags) | |
| # Skip if wrong type | |
| if actual_type != feature_name: | |
| continue | |
| # Skip if status not allowed | |
| if allowed_statuses and status not in allowed_statuses: | |
| continue | |
| # Add status to row | |
| row['status'] = status | |
| row['feature_type'] = actual_type | |
| yield row |
| region, | ||
| primary_name, | ||
| feature_name, | ||
| data_source=data_source, | ||
| use_stream=use_stream, | ||
| mp=mp, | ||
| update=update, | ||
| data_dir=data_dir, | ||
| progress_bar=progress_bar, | ||
| cache_primary=cache_primary, | ||
| allowed_statuses=allowed_statuses, # PASS THROUGH |
There was a problem hiding this comment.
Inconsistent indentation: these lines use excessive indentation (20 spaces) compared to the standard 4-space indentation used throughout the codebase. This breaks Python's standard indentation conventions and makes the code harder to read.
| region, | |
| primary_name, | |
| feature_name, | |
| data_source=data_source, | |
| use_stream=use_stream, | |
| mp=mp, | |
| update=update, | |
| data_dir=data_dir, | |
| progress_bar=progress_bar, | |
| cache_primary=cache_primary, | |
| allowed_statuses=allowed_statuses, # PASS THROUGH | |
| region, | |
| primary_name, | |
| feature_name, | |
| data_source=data_source, | |
| use_stream=use_stream, | |
| mp=mp, | |
| update=update, | |
| data_dir=data_dir, | |
| progress_bar=progress_bar, | |
| cache_primary=cache_primary, | |
| allowed_statuses=allowed_statuses, # PASS THROUGH |
| # MODIFIED: Add lifecycle prefixes | ||
| lifecycle_prefixes = [ | ||
| primary_name, # e.g., "power" | ||
| f"construction:{primary_name}", # e.g., "construction:power" | ||
| f"proposed:{primary_name}", # e.g., "proposed:power" | ||
| f"planned:{primary_name}", # e.g., "planned:power" | ||
| f"disused:{primary_name}", # e.g., "disused:power" | ||
| f"abandoned:{primary_name}", # e.g., "abandoned:power" | ||
| ] | ||
|
|
||
| # Build pre_filter with all lifecycle prefixes | ||
| pre_filter = {} | ||
| for element_type in [Node, Way, Relation]: | ||
| pre_filter[element_type] = {} | ||
| for prefix in lifecycle_prefixes: | ||
| pre_filter[element_type][prefix] = feature_list |
There was a problem hiding this comment.
The new lifecycle filtering functionality (including the allowed_statuses parameter, lifecycle prefix filtering, and status determination) lacks test coverage. Given that the repository has comprehensive test coverage for similar features (as seen in tests/test_export.py, tests/test_overpass.py, etc.), tests should be added to verify: 1) lifecycle prefix filtering in run_primary_filter, 2) lifecycle column addition and status filtering in backends, and 3) the allowed_statuses parameter integration throughout the pipeline.
| """Select the appropriate backend and return a tagged payload.""" | ||
|
|
There was a problem hiding this comment.
The detailed docstring for fetch_region_backend has been removed. According to the documentation pattern in the codebase, this function previously had a docstring describing the return value format (a tuple where the first element is either "stream" or "dataframe" indicating the payload type, and the second element is the payload itself). This information is valuable for understanding the function's contract and should be preserved or restored.
| """Select the appropriate backend and return a tagged payload.""" | |
| """ | |
| Select and execute the appropriate backend for fetching regional data. | |
| This function encapsulates the logic for choosing between the Geofabrik | |
| and Overpass backends, and between the streaming and legacy (dataframe) | |
| pipelines, based on the provided arguments. | |
| Returns | |
| ------- | |
| BackendResult | |
| A 2-tuple ``(kind, payload)`` where: | |
| * ``kind`` is a string indicating the payload type and will be either | |
| ``"stream"`` or ``"dataframe"``. | |
| * ``payload`` is the corresponding data: | |
| - if ``kind == "stream"``, ``payload`` is an iterator/stream of | |
| feature rows produced by the streaming backend; | |
| - if ``kind == "dataframe"``, ``payload`` is a | |
| :class:`pandas.DataFrame` containing the extracted features. | |
| Notes | |
| ----- | |
| Callers are expected to dispatch on the ``kind`` value to correctly handle | |
| the returned payload. | |
| """ |
What does this PR do?
Fixes Issue # ...
Type of Change
Checklist