Skip to content

Conversation

@BENR0
Copy link
Collaborator

@BENR0 BENR0 commented Jun 16, 2025

This implements the LI flash geometry product as proposed by ESSl as a composite so it can easily produces with Satpy.

This is a proof of concept PR. As discussed during the spring PCW this is not really a composite but fits more in line with a concept of a "processor". Currently this is still implemented under composites just because for a proof of concept it did not make sense to add a "processor" loading similiar to the composites to Satpy yet.

Things to note:

  • There are multiple things to still discuss including:
    • How do we proceed with this new concept of "features" as geometries. Is it a good idea to introduce a "processor" concept as a more general approach to modify data which does not fit into a composite output?
    • What does it implicate for other functionality possibly needed/nice to have if Satpy will be able to process/use geometries? Currently to comply with Satpy DatyID and Dataset access methods this PR introduces a bare bones geometry container just wrapping a geopandas geodataframe.

I will add some documentation later on, so just a few notes on how to use this in the mean time.
Together with the LI LGR files you can just load "acc_flash_geometry" and you will get the flash geometry product.
With the feature writer you can save this as a sqlite file or as geojson by just using the writer together with the corresponding file extension.

The flash geometry algorithm implemented here is already including thresholding with a default of 10km (currently also used by ESSL) to mitigate the issue that sometimes unreasonable connections between groups are made. This threshold is adaptable by using the satpy config in order to be easily be played around with. Just include a new setting of "flash_geom_distance_treshold" under "composites" in the Satpy config or use a:

with satpy.config.set({"composites.flash_geom_distance_threshold": value}):

statement

  • Tests added
  • Fully documented

@codecov
Copy link

codecov bot commented Jun 16, 2025

Codecov Report

❌ Patch coverage is 16.90141% with 118 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.10%. Comparing base (6602e1e) to head (d7fb9f3).
⚠️ Report is 57 commits behind head on main.

Files with missing lines Patch % Lines
satpy/composites/lightning.py 21.79% 61 Missing ⚠️
satpy/writers/oracle.py 2.17% 45 Missing ⚠️
satpy/writers/feature.py 40.00% 9 Missing ⚠️
satpy/vectorscene.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3150      +/-   ##
==========================================
- Coverage   96.30%   96.10%   -0.21%     
==========================================
  Files         463      466       +3     
  Lines       58200    58828     +628     
==========================================
+ Hits        56048    56535     +487     
- Misses       2152     2293     +141     
Flag Coverage Δ
behaviourtests 3.60% <0.00%> (-0.05%) ⬇️
unittests 96.19% <16.90%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 15680470090

Details

  • 24 of 139 (17.27%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.2%) to 96.2%

Changes Missing Coverage Covered Lines Changed/Added Lines %
satpy/writers/feature.py 6 15 40.0%
satpy/writers/oracle.py 1 46 2.17%
satpy/composites/lightning.py 17 78 21.79%
Totals Coverage Status
Change from base Build 15633405589: -0.2%
Covered Lines: 56001
Relevant Lines: 58213

💛 - Coveralls

Copy link
Member

@gerritholl gerritholl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First small thoughts based on initial read-through. I know this is a draft PR so it's still being worked on. Biggest open tasks I see:

  • Is this dask-friendly or can it be made dask-friendly?
  • Unit tests.
  • Documentation.

import logging
from typing import Optional, Sequence

import geopandas as gpd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an optional dependency, the import should probably be inside the processor implementation. Not sure how that would work for annotations, though.

import geopandas as gpd
import numpy as np
import pandas as pd
import pyproj as pro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other modules this is simply imported as pyproj.

pandas.DataFrame
References:
[1] https://gist.github.com/emiliom/87a6aa137610bf59787868943b406e8f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reference correct? It links to a gist on holoviz by Emilio Mayorga and does not seem to have anything to do with Pieter Groenemeijers flash geometry algorithm.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I don't know how the wrong link got into this. I will try and find the correct one.

super().__init__(name, filename, base_dir, **kwargs)

def save_dataset(self, dataset, filename=None, layer_name=None, **kwargs):
"""Save the geodataframe to sqlite."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it writes either to sqlite or to JSON, correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

def write_gdf_to_oracle(gdf, table_name, user, password, host, port, service_name):
"""Write geodataframe to oracle database.
The connection data like user/pw/host/port/service_name are taken from the config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that they are taken from the config, rather they are passed by the user to this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is correct. I had to refactor a little bit from our custom chain and did not adapt the documentation.

@mraspaud
Copy link
Member

We just had a meeting on this topic, here are my notes:

Pytroll LI geometry meeting

Last PCW Ro started implementing it.
It's still a bit rough.

Should satpy support geometries?

No reason why not.

(Geometries could also be used to crop/filter data ?)

Right now there is a PR that hacks the geometries into satpy. It can be written to database.
The algorithm input is points (xarray.DataArray), and outputs multilines (geodataframe).

Windbarbs and Lightnings are another example of vector data that could use the same processing way.

Datatree (thought of a future internal data structure for the Scene) would not be compatible with a dataframe.

Ideally, the Scene should be split into container vs processor.

We could have a VectorScene as an alternative to a RasterScene.

If the readers know what kind of data they produce, we could decide what scene to use.

Can we use the existing scene and just add the vector functionality to it, and the scene would know when there is a need to rasterise.

As long as there is no need to combine geometries with rasters for now, we can start with a separate vectorscene object.

At the moment, the LI reader returns a "swath" which should really be a geometry: could we have a "get_vector_dataset" that the yaml reader (subclass?) could use?

Dataframes has experimental attributes.

We agree on using GeoDataFrame (check attrs), try this new container (VectorScene). Try vector product coming out directly from the reader.
We should not use the composite class for geometries.

A modifier is a good match for what we want to do with flash geometries, but it's not exactly the same. The logic of putting points together could also be a composite.

Discussion about composites vs algorithm that could potentially generate multiple products.

Start with the existing PR and try to adapt it.

@pnuu
Copy link
Member

pnuu commented Oct 16, 2025

Some thoughts I also wrote partially to Slack follow.

What if the writer handled the geometry creation?

  1. Create a writer called, lets say li_shapes_to_geojson that includes the conversion from LFL/LGR raster data to geometries and then saves the file as geojson.
  2. Extract the shape/geometry creation code to a new module
  3. Create more writers that use the same shape/geometry creation code but handle the data so that they can be saved to whichever database or format necessary

The usage would be straightforward

scn = Scene(...)
scn.load([something])
scn.save_dataset(something, filename="li.json", writer="li_shapes_to_geojson")

and for databases

...
# connection address, credentials, table, format string, and whatever else is needed
postgis_settings = {...}
scn.save_dataset(something, writer="li_shapes_to_postgis", writer_kwargs=postgis_settings)

To limit the data to a specific area there might be a need for some data reduction. The most simple to implement would be in the "compositor" that is used for "something" and the area definition given in the composite YAML definition. The more correct way would be to have a "resampler" that masks out pixels outside of the given area (a modified version of crop, I guess?) and the writer ignores NaNs in its handling.

Usage together with rasters in the same Scene? No idea 😅

I do something like this in my Trollflow2 plugin (see example config here) that writes point data to our PostGIS database. The data can be anything that can be described as point data, but I've used it for LI/LFL.

@pnuu
Copy link
Member

pnuu commented Oct 16, 2025

Possibilities for the "use geometries with rasters loaded in the same Scene" case with the above approach that has no VectorScene and the geometries are handled as-is where needed:

  • add an option to the compositor to call the geometry creator and another "rasterizer"
    • very hacky, confusing and duplicates things
  • add an enhancement that creates the geometries and then rasterizes the data
    • to which area to rasterize?
    • It should be the final desired output area and resolution without any additional resampling/scaling, but this might not be known at this stage depending on generate=False/True setting

@pnuu
Copy link
Member

pnuu commented Oct 16, 2025

At the moment, the LI reader returns a "swath" which should really be a geometry: could we have a "get_vector_dataset" that the yaml reader (subclass?) could use?

If the reader for LFL and LGR data returned a geometry instead of swath, a lot of the more traditional processing capabilities would be lost, or made unnecessarily complicated, like creating heatmaps with resampler="bucket_sum".

@pnuu
Copy link
Member

pnuu commented Oct 16, 2025

Pseudo-code for my approach would be something like:

class LightningPathGeometry:
    """Create lightning path geometries."""
    def __init__(self, data, ..., adef=None):
        ...
        self.adef = adef:
        self._filter_by_area(adef)
        ...
    def _filter_by_area(self, adef):
        if self.adef:
            ...
        ...
    def generate_geometries(self):
        ...
    def to_geojson(self):
        if not self.geometries:
            self.generate_geometries()
        ...
    def to_raster(self):
        if not self.geometries:
            self.generate_geometries()
        ...

class GeoJsonWriter:
    def __init__(self, ..., processor):
        ...
    def save(self, filename, data, ..., adef=None):
        proc = processor(data, ..., adef=adef)
        gjson = proc.to_geojson()
        self._write(filename, gjson)

def GeometryRasterCompositor:
    def __init__(self, ..., processor):
        self.processor = processor
    def __call__(self, datasets, ...):
        proc = self.processor(datasets, ..., adef=adef)
        raster_data = proc.to_raster()
        ...

Using the writer is like in my first braindump. For raster case, the processor class would be defined in the composite YAML. Area definition handling for filtering and rasterization is to be figured out.

@mraspaud
Copy link
Member

At the moment, the LI reader returns a "swath" which should really be a geometry: could we have a "get_vector_dataset" that the yaml reader (subclass?) could use?

If the reader for LFL and LGR data returned a geometry instead of swath, a lot of the more traditional processing capabilities would be lost, or made unnecessarily complicated, like creating heatmaps with resampler="bucket_sum".

Yes, the resampling today acts as an implicit rasterising method.
The idea here is to have an expicit way of rasterising a VectorScene to get the corresponding RasterScene. Resampling will surely be one implementation of it.

This will allow a better separation of concerns, where the VectorScene does vector things, easily coupling it to dedicated vector writers for example, and thus not encumbering the existing (Raster)Scene. A workflow could be for example:

vscene = VectorScene(...)
vscene.load(...some LI stuff...)
vscene.save_datasets_to_database(<database_url>)  # allows writing without converting from raster/dataarray

rscene = vscene.rasterise(method="accumulate", area="fci_1km")
# possibly combine rscene with an fci scene for example
rscene.save_datasets()  # to images

One reason also for splitting into another scene class is that the current Scene class is already doing so much that (at least) I am really reluctant on adding more functionality to it if we can avoid it.

@gerritholl
Copy link
Member

Thanks @mraspaud for the write-up. As for:

If the readers know what kind of data they produce, we could decide what scene to use

I think that would be the FileHandler, wouldn't it? The word reader is used for multiple things in satpy,

@pnuu
Copy link
Member

pnuu commented Oct 30, 2025

I'd say the data should be read directly as vector only and only if the data are vectors already in the files being handled. At the moment I think the only such data I know of is NWC SAF GEO RDT (Rapidly Developing Thunderstorms), which we don't have a reader for.

I don't consider point data such as LI/LFL as vector data, nor atmospheric motion vectors if they are given as u- and v- components or as speed and direction. These are already data that could be put in a database (as an example) as-is without any vectorization. In fact if the data are read to a vector format, it might require deconstruction before being usable.

BENR0 and others added 5 commits November 10, 2025 15:51
Started a module containing a VectorScene class.  So far no
implementation and no tests.  In fact this commit serves mostly to test
if I can push to benjamins branch ☺
Split off from the Scene functionality shared with VectorScene into a
BaseScene class.  Leaves anything related to resampling, areas,
composites in Scene.  Just a first attempt, will very likely be further
adjusted in later commits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

5 participants