lifecycle support by craigl33 · Pull Request #78 · pypsa-meets-earth/earth-osm

craigl33 · 2025-12-18T17:29:43Z

What does this PR do?

Fixes Issue # ...

Type of Change

Bug fix (non-breaking change which fixes an issue)
Chore (refactoring code or workflow improvements)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Checklist

I haven't performed a self-review
I haven't commented/documented my code
I haven't added/checked tests

Copilot

Pull request overview

This PR adds lifecycle support for OpenStreetMap data, enabling filtering of features by their operational status (operating, construction, proposed, planned, disused, abandoned). The implementation extends the filtering pipeline to capture lifecycle-prefixed tags and adds an allowed_statuses parameter throughout the extraction pipeline.

Key changes include:

Extension of the primary filter to include lifecycle prefixes (e.g., "construction:power", "proposed:power")
Addition of the allowed_statuses parameter to filter features by operational status
Integration of lifecycle processing across both legacy and streaming backends

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
earth_osm/filter.py	Extends the pre_filter to include lifecycle-prefixed tags for comprehensive OSM data extraction
earth_osm/eo.py	Adds `allowed_statuses` parameter to `process_region` and `save_osm_data` functions with inconsistent indentation
earth_osm/backends.py	Integrates lifecycle processing in all backends (geofabrik_legacy, geofabrik_stream, overpass) with missing module import
earth_osm/args.py	Adds CLI argument `--statuses` to enable status filtering from command line

Critical Issues Identified:

The imported earth_osm.lifecycle module does not exist in the repository, which will cause runtime failures
Missing parameter propagation in several code paths, causing inconsistent behavior
Logger argument order mismatch
Inconsistent code indentation deviating from Python standards

Comments suppressed due to low confidence (2)

earth_osm/eo.py:115

The docstring for process_region does not document the new allowed_statuses parameter. According to the documentation pattern established in the codebase (e.g., save_osm_data at lines 219-233), parameters should be documented.

    """Process a single region for a feature.

    When ``stream`` is ``True`` and the geofabrik backend is used the function
    returns an iterator of flattened feature dictionaries. Otherwise it returns
    a :class:`pandas.DataFrame` to preserve the historic API.
    """

earth_osm/eo.py:233

The docstring for save_osm_data does not document the new allowed_statuses parameter. According to the documentation pattern established in this function (lines 219-233), all parameters should be documented in the args section.

    """
    Get OSM Data for a list of regions and features
    args:
        region_list: list of regions to get data for
        primary_name: primary feature to get data for
        feature_list: list of features to get data for
        update: update data
        mp: use multiprocessing
        stream_backend: when ``True`` (default) use the streaming pipeline for
            GeoFabrik sources; set to ``False`` to revert to the legacy
            in-memory pipeline (primarily for benchmarking)
        target_date: optional target date for historical data
    returns:
        dict of dataframes
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-19T17:44:37Z

earth_osm/backends.py

    stream_cached_primary_features,
    stream_pbf_features,
 )
+from earth_osm.lifecycle import add_lifecycle_columns, filter_by_status


The module 'earth_osm.lifecycle' is imported but does not exist in the repository. This import will cause a runtime error. You need to create the lifecycle.py file with the required functions: add_lifecycle_columns, filter_by_status, and determine_type_and_status.

Copilot · 2025-12-19T17:44:38Z

earth_osm/backends.py

+        os.path.basename(pbf_url),
        primary_name,
        feature_name,


The logger.info arguments are in the wrong order. According to the pattern established in earth_osm/stream.py (lines 859-864, 1030-1034), the format should be: "Region %s (%s=%s): downloading %s" with arguments (region.short, primary_name, feature_name, os.path.basename(pbf_url)). Currently, the arguments are ordered as (region.short, os.path.basename(pbf_url), primary_name, feature_name), which doesn't match the format string.

Suggested change

os.path.basename(pbf_url),

primary_name,

feature_name,

primary_name,

feature_name,

os.path.basename(pbf_url),

Copilot · 2025-12-19T17:44:38Z

earth_osm/eo.py

+                        region_obj,
+                        primary_name,
+                        feature_name_obj,
+                        mp,
+                        update,
+                        data_dir,
+                        progress_bar=progress_bar,
+                        data_source=data_source,
+                        stream=False,
+                        cache_primary=cache_primary,
+                        allowed_statuses=allowed_statuses,  # PASS THROUGH
+                    )


Inconsistent indentation: these lines use excessive indentation (24 spaces) compared to the standard 4-space indentation used throughout the codebase. This breaks Python's standard indentation conventions and makes the code harder to read.

Suggested change

region_obj,

primary_name,

feature_name_obj,

mp,

update,

data_dir,

progress_bar=progress_bar,

data_source=data_source,

stream=False,

cache_primary=cache_primary,

allowed_statuses=allowed_statuses, # PASS THROUGH

)

region_obj,

primary_name,

feature_name_obj,

mp,

update,

data_dir,

progress_bar=progress_bar,

data_source=data_source,

stream=False,

cache_primary=cache_primary,

allowed_statuses=allowed_statuses, # PASS THROUGH

)

Copilot · 2025-12-19T17:44:39Z

earth_osm/backends.py

+            for row in rows:
+                # Build tags dict from row
+                tags = {k[5:]: v for k, v in row.items() if k.startswith('tags.') and v is not None}
+
+                # Determine actual type and status
+                actual_type, status = determine_type_and_status(tags)
+
+                # Skip if wrong type
+                if actual_type != feature_name:
+                    continue
+
+                # Skip if status not allowed
+                if allowed_statuses and status not in allowed_statuses:
+                    continue
+
+                # Add status to row
+                row['status'] = status
+                row['feature_type'] = actual_type
+
+                yield row
+


The lifecycle_filter_stream function recreates a tags dictionary from row items on every iteration by filtering items that start with 'tags.' This could be inefficient for large datasets. Consider whether the determine_type_and_status function could accept the row directly or if the tags extraction logic could be optimized.

Suggested change

for row in rows:

# Build tags dict from row

tags = {k[5:]: v for k, v in row.items() if k.startswith('tags.') and v is not None}

# Determine actual type and status

actual_type, status = determine_type_and_status(tags)

# Skip if wrong type

if actual_type != feature_name:

continue

# Skip if status not allowed

if allowed_statuses and status not in allowed_statuses:

continue

# Add status to row

row['status'] = status

row['feature_type'] = actual_type

yield row

# Precompute the set of tag keys once to avoid repeated prefix checks

tag_keys = None

for row in rows:

if tag_keys is None:

# Identify all columns that represent tags.*

tag_keys = [k for k in row.keys() if k.startswith('tags.')]

# Build tags dict from row using precomputed tag keys

tags = {k[5:]: row[k] for k in tag_keys if row.get(k) is not None}

# Determine actual type and status

actual_type, status = determine_type_and_status(tags)

# Skip if wrong type

if actual_type != feature_name:

continue

# Skip if status not allowed

if allowed_statuses and status not in allowed_statuses:

continue

# Add status to row

row['status'] = status

row['feature_type'] = actual_type

yield row

Copilot · 2025-12-19T17:44:39Z

earth_osm/eo.py

+    region,
+    primary_name,
+    feature_name,
+    data_source=data_source,
+    use_stream=use_stream,
+    mp=mp,
+    update=update,
+    data_dir=data_dir,
+    progress_bar=progress_bar,
+    cache_primary=cache_primary,
+    allowed_statuses=allowed_statuses,  # PASS THROUGH


Inconsistent indentation: these lines use excessive indentation (20 spaces) compared to the standard 4-space indentation used throughout the codebase. This breaks Python's standard indentation conventions and makes the code harder to read.

Suggested change

region,

primary_name,

feature_name,

data_source=data_source,

use_stream=use_stream,

mp=mp,

update=update,

data_dir=data_dir,

progress_bar=progress_bar,

cache_primary=cache_primary,

allowed_statuses=allowed_statuses, # PASS THROUGH

region,

primary_name,

feature_name,

data_source=data_source,

use_stream=use_stream,

mp=mp,

update=update,

data_dir=data_dir,

progress_bar=progress_bar,

cache_primary=cache_primary,

allowed_statuses=allowed_statuses, # PASS THROUGH

Copilot · 2025-12-19T17:44:39Z

earth_osm/filter.py

+    # MODIFIED: Add lifecycle prefixes
+    lifecycle_prefixes = [
+        primary_name,                    # e.g., "power"
+        f"construction:{primary_name}",  # e.g., "construction:power"
+        f"proposed:{primary_name}",      # e.g., "proposed:power"
+        f"planned:{primary_name}",       # e.g., "planned:power"
+        f"disused:{primary_name}",       # e.g., "disused:power"
+        f"abandoned:{primary_name}",     # e.g., "abandoned:power"
+    ]
+
+    # Build pre_filter with all lifecycle prefixes
+    pre_filter = {}
+    for element_type in [Node, Way, Relation]:
+        pre_filter[element_type] = {}
+        for prefix in lifecycle_prefixes:
+            pre_filter[element_type][prefix] = feature_list


The new lifecycle filtering functionality (including the allowed_statuses parameter, lifecycle prefix filtering, and status determination) lacks test coverage. Given that the repository has comprehensive test coverage for similar features (as seen in tests/test_export.py, tests/test_overpass.py, etc.), tests should be added to verify: 1) lifecycle prefix filtering in run_primary_filter, 2) lifecycle column addition and status filtering in backends, and 3) the allowed_statuses parameter integration throughout the pipeline.

Copilot · 2025-12-19T17:44:40Z

earth_osm/backends.py

+    """Select the appropriate backend and return a tagged payload."""



The detailed docstring for fetch_region_backend has been removed. According to the documentation pattern in the codebase, this function previously had a docstring describing the return value format (a tuple where the first element is either "stream" or "dataframe" indicating the payload type, and the second element is the payload itself). This information is valuable for understanding the function's contract and should be preserved or restored.

Suggested change

"""Select the appropriate backend and return a tagged payload."""

"""

Select and execute the appropriate backend for fetching regional data.

This function encapsulates the logic for choosing between the Geofabrik

and Overpass backends, and between the streaming and legacy (dataframe)

pipelines, based on the provided arguments.

Returns

-------

BackendResult

A 2-tuple ``(kind, payload)`` where:

* ``kind`` is a string indicating the payload type and will be either

``"stream"`` or ``"dataframe"``.

* ``payload`` is the corresponding data:

- if ``kind == "stream"``, ``payload`` is an iterator/stream of

feature rows produced by the streaming backend;

- if ``kind == "dataframe"``, ``payload`` is a

:class:`pandas.DataFrame` containing the extracted features.

Notes

-----

Callers are expected to dispatch on the ``kind`` value to correctly handle

the returned payload.

"""

lifecycle support

70f481e

mnm-matin requested a review from Copilot December 19, 2025 17:40

Copilot started reviewing on behalf of mnm-matin December 19, 2025 17:40 View session

Copilot AI reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lifecycle support#78

lifecycle support#78
craigl33 wants to merge 1 commit intopypsa-meets-earth:mainfrom
craigl33:dev_craig

craigl33 commented Dec 18, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 19, 2025

Uh oh!

Copilot AI Dec 19, 2025

Uh oh!

Copilot AI Dec 19, 2025

Uh oh!

Copilot AI Dec 19, 2025

Uh oh!

Copilot AI Dec 19, 2025

Uh oh!

Copilot AI Dec 19, 2025

Uh oh!

Copilot AI Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"""Select the appropriate backend and return a tagged payload."""

-    """Select the appropriate backend and return a tagged payload."""
+    """
+    Select and execute the appropriate backend for fetching regional data.
+    This function encapsulates the logic for choosing between the Geofabrik
+    and Overpass backends, and between the streaming and legacy (dataframe)
+    pipelines, based on the provided arguments.
+    Returns
+    -------
+    BackendResult
+        A 2-tuple ``(kind, payload)`` where:
+        * ``kind`` is a string indicating the payload type and will be either
+          ``"stream"`` or ``"dataframe"``.
+        * ``payload`` is the corresponding data:
+          - if ``kind == "stream"``, ``payload`` is an iterator/stream of
+            feature rows produced by the streaming backend;
+          - if ``kind == "dataframe"``, ``payload`` is a
+            :class:`pandas.DataFrame` containing the extracted features.
+    Notes
+    -----
+    Callers are expected to dispatch on the ``kind`` value to correctly handle
+    the returned payload.
+    """

Conversation

craigl33 commented Dec 18, 2025

What does this PR do?

Type of Change

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants