Skip to content

lifecycle support#78

Open
craigl33 wants to merge 1 commit intopypsa-meets-earth:mainfrom
craigl33:dev_craig
Open

lifecycle support#78
craigl33 wants to merge 1 commit intopypsa-meets-earth:mainfrom
craigl33:dev_craig

Conversation

@craigl33
Copy link

What does this PR do?

Fixes Issue # ...

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • Chore (refactoring code or workflow improvements)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Checklist

  • I haven't performed a self-review
  • I haven't commented/documented my code
  • I haven't added/checked tests

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds lifecycle support for OpenStreetMap data, enabling filtering of features by their operational status (operating, construction, proposed, planned, disused, abandoned). The implementation extends the filtering pipeline to capture lifecycle-prefixed tags and adds an allowed_statuses parameter throughout the extraction pipeline.

Key changes include:

  • Extension of the primary filter to include lifecycle prefixes (e.g., "construction:power", "proposed:power")
  • Addition of the allowed_statuses parameter to filter features by operational status
  • Integration of lifecycle processing across both legacy and streaming backends

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
earth_osm/filter.py Extends the pre_filter to include lifecycle-prefixed tags for comprehensive OSM data extraction
earth_osm/eo.py Adds allowed_statuses parameter to process_region and save_osm_data functions with inconsistent indentation
earth_osm/backends.py Integrates lifecycle processing in all backends (geofabrik_legacy, geofabrik_stream, overpass) with missing module import
earth_osm/args.py Adds CLI argument --statuses to enable status filtering from command line

Critical Issues Identified:

  • The imported earth_osm.lifecycle module does not exist in the repository, which will cause runtime failures
  • Missing parameter propagation in several code paths, causing inconsistent behavior
  • Logger argument order mismatch
  • Inconsistent code indentation deviating from Python standards
Comments suppressed due to low confidence (2)

earth_osm/eo.py:115

  • The docstring for process_region does not document the new allowed_statuses parameter. According to the documentation pattern established in the codebase (e.g., save_osm_data at lines 219-233), parameters should be documented.
    """Process a single region for a feature.

    When ``stream`` is ``True`` and the geofabrik backend is used the function
    returns an iterator of flattened feature dictionaries. Otherwise it returns
    a :class:`pandas.DataFrame` to preserve the historic API.
    """

earth_osm/eo.py:233

  • The docstring for save_osm_data does not document the new allowed_statuses parameter. According to the documentation pattern established in this function (lines 219-233), all parameters should be documented in the args section.
    """
    Get OSM Data for a list of regions and features
    args:
        region_list: list of regions to get data for
        primary_name: primary feature to get data for
        feature_list: list of features to get data for
        update: update data
        mp: use multiprocessing
        stream_backend: when ``True`` (default) use the streaming pipeline for
            GeoFabrik sources; set to ``False`` to revert to the legacy
            in-memory pipeline (primarily for benchmarking)
        target_date: optional target date for historical data
    returns:
        dict of dataframes
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

stream_cached_primary_features,
stream_pbf_features,
)
from earth_osm.lifecycle import add_lifecycle_columns, filter_by_status
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module 'earth_osm.lifecycle' is imported but does not exist in the repository. This import will cause a runtime error. You need to create the lifecycle.py file with the required functions: add_lifecycle_columns, filter_by_status, and determine_type_and_status.

Copilot uses AI. Check for mistakes.
Comment on lines +97 to 99
os.path.basename(pbf_url),
primary_name,
feature_name,
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger.info arguments are in the wrong order. According to the pattern established in earth_osm/stream.py (lines 859-864, 1030-1034), the format should be: "Region %s (%s=%s): downloading %s" with arguments (region.short, primary_name, feature_name, os.path.basename(pbf_url)). Currently, the arguments are ordered as (region.short, os.path.basename(pbf_url), primary_name, feature_name), which doesn't match the format string.

Suggested change
os.path.basename(pbf_url),
primary_name,
feature_name,
primary_name,
feature_name,
os.path.basename(pbf_url),

Copilot uses AI. Check for mistakes.
Comment on lines +271 to +282
region_obj,
primary_name,
feature_name_obj,
mp,
update,
data_dir,
progress_bar=progress_bar,
data_source=data_source,
stream=False,
cache_primary=cache_primary,
allowed_statuses=allowed_statuses, # PASS THROUGH
)
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: these lines use excessive indentation (24 spaces) compared to the standard 4-space indentation used throughout the codebase. This breaks Python's standard indentation conventions and makes the code harder to read.

Suggested change
region_obj,
primary_name,
feature_name_obj,
mp,
update,
data_dir,
progress_bar=progress_bar,
data_source=data_source,
stream=False,
cache_primary=cache_primary,
allowed_statuses=allowed_statuses, # PASS THROUGH
)
region_obj,
primary_name,
feature_name_obj,
mp,
update,
data_dir,
progress_bar=progress_bar,
data_source=data_source,
stream=False,
cache_primary=cache_primary,
allowed_statuses=allowed_statuses, # PASS THROUGH
)

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +147
for row in rows:
# Build tags dict from row
tags = {k[5:]: v for k, v in row.items() if k.startswith('tags.') and v is not None}

# Determine actual type and status
actual_type, status = determine_type_and_status(tags)

# Skip if wrong type
if actual_type != feature_name:
continue

# Skip if status not allowed
if allowed_statuses and status not in allowed_statuses:
continue

# Add status to row
row['status'] = status
row['feature_type'] = actual_type

yield row

Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lifecycle_filter_stream function recreates a tags dictionary from row items on every iteration by filtering items that start with 'tags.' This could be inefficient for large datasets. Consider whether the determine_type_and_status function could accept the row directly or if the tags extraction logic could be optimized.

Suggested change
for row in rows:
# Build tags dict from row
tags = {k[5:]: v for k, v in row.items() if k.startswith('tags.') and v is not None}
# Determine actual type and status
actual_type, status = determine_type_and_status(tags)
# Skip if wrong type
if actual_type != feature_name:
continue
# Skip if status not allowed
if allowed_statuses and status not in allowed_statuses:
continue
# Add status to row
row['status'] = status
row['feature_type'] = actual_type
yield row
# Precompute the set of tag keys once to avoid repeated prefix checks
tag_keys = None
for row in rows:
if tag_keys is None:
# Identify all columns that represent tags.*
tag_keys = [k for k in row.keys() if k.startswith('tags.')]
# Build tags dict from row using precomputed tag keys
tags = {k[5:]: row[k] for k in tag_keys if row.get(k) is not None}
# Determine actual type and status
actual_type, status = determine_type_and_status(tags)
# Skip if wrong type
if actual_type != feature_name:
continue
# Skip if status not allowed
if allowed_statuses and status not in allowed_statuses:
continue
# Add status to row
row['status'] = status
row['feature_type'] = actual_type
yield row

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +144
region,
primary_name,
feature_name,
data_source=data_source,
use_stream=use_stream,
mp=mp,
update=update,
data_dir=data_dir,
progress_bar=progress_bar,
cache_primary=cache_primary,
allowed_statuses=allowed_statuses, # PASS THROUGH
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: these lines use excessive indentation (20 spaces) compared to the standard 4-space indentation used throughout the codebase. This breaks Python's standard indentation conventions and makes the code harder to read.

Suggested change
region,
primary_name,
feature_name,
data_source=data_source,
use_stream=use_stream,
mp=mp,
update=update,
data_dir=data_dir,
progress_bar=progress_bar,
cache_primary=cache_primary,
allowed_statuses=allowed_statuses, # PASS THROUGH
region,
primary_name,
feature_name,
data_source=data_source,
use_stream=use_stream,
mp=mp,
update=update,
data_dir=data_dir,
progress_bar=progress_bar,
cache_primary=cache_primary,
allowed_statuses=allowed_statuses, # PASS THROUGH

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +81
# MODIFIED: Add lifecycle prefixes
lifecycle_prefixes = [
primary_name, # e.g., "power"
f"construction:{primary_name}", # e.g., "construction:power"
f"proposed:{primary_name}", # e.g., "proposed:power"
f"planned:{primary_name}", # e.g., "planned:power"
f"disused:{primary_name}", # e.g., "disused:power"
f"abandoned:{primary_name}", # e.g., "abandoned:power"
]

# Build pre_filter with all lifecycle prefixes
pre_filter = {}
for element_type in [Node, Way, Relation]:
pre_filter[element_type] = {}
for prefix in lifecycle_prefixes:
pre_filter[element_type][prefix] = feature_list
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new lifecycle filtering functionality (including the allowed_statuses parameter, lifecycle prefix filtering, and status determination) lacks test coverage. Given that the repository has comprehensive test coverage for similar features (as seen in tests/test_export.py, tests/test_overpass.py, etc.), tests should be added to verify: 1) lifecycle prefix filtering in run_primary_filter, 2) lifecycle column addition and status filtering in backends, and 3) the allowed_statuses parameter integration throughout the pipeline.

Copilot uses AI. Check for mistakes.
Comment on lines +188 to 189
"""Select the appropriate backend and return a tagged payload."""

Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The detailed docstring for fetch_region_backend has been removed. According to the documentation pattern in the codebase, this function previously had a docstring describing the return value format (a tuple where the first element is either "stream" or "dataframe" indicating the payload type, and the second element is the payload itself). This information is valuable for understanding the function's contract and should be preserved or restored.

Suggested change
"""Select the appropriate backend and return a tagged payload."""
"""
Select and execute the appropriate backend for fetching regional data.
This function encapsulates the logic for choosing between the Geofabrik
and Overpass backends, and between the streaming and legacy (dataframe)
pipelines, based on the provided arguments.
Returns
-------
BackendResult
A 2-tuple ``(kind, payload)`` where:
* ``kind`` is a string indicating the payload type and will be either
``"stream"`` or ``"dataframe"``.
* ``payload`` is the corresponding data:
- if ``kind == "stream"``, ``payload`` is an iterator/stream of
feature rows produced by the streaming backend;
- if ``kind == "dataframe"``, ``payload`` is a
:class:`pandas.DataFrame` containing the extracted features.
Notes
-----
Callers are expected to dispatch on the ``kind`` value to correctly handle
the returned payload.
"""

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants