Skip to content

Scenario Failure: forest_fire_mapping #290

@github-actions

Description

@github-actions

Benchmark scenario ID: forest_fire_mapping
Benchmark scenario definition: https://github.com/ESA-APEx/apex_algorithms/blob/0f2661a4a7e5ad9554e0cba4759c37c83eb97982/algorithm_catalog/vito/random_forest_firemapping/benchmark_scenarios/random_forest_firemapping.json
openEO backend: openeo.vito.be

GitHub Actions workflow run: https://github.com/ESA-APEx/apex_algorithms/actions/runs/20304197450
Workflow artifacts: https://github.com/ESA-APEx/apex_algorithms/actions/runs/20304197450#artifacts

Test start: 2025-12-17 13:20:02.095034+00:00
Test duration: 0:04:48.349579
Test outcome: ❌ failed

Last successful test phase: create-job
Failure in test phase: run-job

Contact Information

Name Organization Contact
Pratichhya Sharma VITO Contact via VITO (VITO Website, GitHub)

Process Graph

{
  "randomforestfiremapping1": {
    "arguments": {
      "padding_window_size": 33,
      "spatial_extent": {
        "coordinates": [
          [
            [
              -17.996638457335074,
              28.771993378019005
            ],
            [
              -17.960989271845406,
              28.822652746872745
            ],
            [
              -17.913144312372435,
              28.85454938652139
            ],
            [
              -17.842315009623224,
              28.83015783855478
            ],
            [
              -17.781805207936817,
              28.842353612538087
            ],
            [
              -17.728331429702315,
              28.74103487483061
            ],
            [
              -17.766795024572748,
              28.681932277834584
            ],
            [
              -17.75131577297855,
              28.624236885528937
            ],
            [
              -17.756944591740076,
              28.579206335436727
            ],
            [
              -17.838093395552082,
              28.451150708612
            ],
            [
              -17.871397239891113,
              28.480702007110015
            ],
            [
              -17.88969090086607,
              28.57404658490533
            ],
            [
              -17.957705794234517,
              28.658947934558352
            ],
            [
              -18.003674480786984,
              28.76167387695621
            ],
            [
              -18.003674480786984,
              28.76167387695621
            ],
            [
              -17.996638457335074,
              28.771993378019005
            ]
          ]
        ],
        "type": "Polygon"
      },
      "temporal_extent": [
        "2023-07-15",
        "2023-09-15"
      ]
    },
    "namespace": "https://raw.githubusercontent.com/ESA-APEx/apex_algorithms/0962bf79f836859e701fa7437307240ef689ff2e/algorithm_catalog/vito/random_forest_firemapping/openeo_udp/random_forest_firemapping.json",
    "process_id": "random_forest_firemapping",
    "result": true
  }
}

Error Logs

scenario = BenchmarkScenario(id='forest_fire_mapping', description='Forest Fire Mapping using Random Forest based on Sentinel-2 a.../apex_algorithms/algorithm_catalog/vito/random_forest_firemapping/benchmark_scenarios/random_forest_firemapping.json'))
connection_factory = <function connection_factory.<locals>.get_connection at 0x7f6ae022d120>
tmp_path = PosixPath('/home/runner/work/apex_algorithms/apex_algorithms/qa/benchmarks/tmp_path_root/test_run_benchmark_forest_fire0')
track_metric = <function track_metric.<locals>.track at 0x7f6ae022d260>
track_phase = <function track_phase.<locals>.track at 0x7f6ae022d3a0>
upload_assets_on_fail = <function upload_assets_on_fail.<locals>.collect at 0x7f6ae022d440>
request = <FixtureRequest for <Function test_run_benchmark[forest_fire_mapping]>>

    @pytest.mark.parametrize(
        "scenario",
        [
            # Use scenario id as parameterization id to give nicer test names.
            pytest.param(uc, id=uc.id)
            for uc in get_benchmark_scenarios()
        ],
    )
    def test_run_benchmark(
        scenario: BenchmarkScenario,
        connection_factory,
        tmp_path: Path,
        track_metric,
        track_phase,
        upload_assets_on_fail,
        request,
    ):
        track_metric("scenario_id", scenario.id)

        with track_phase(phase="connect"):
            # Check if a backend override has been provided via cli options.
            override_backend = request.config.getoption("--override-backend")
            backend_filter = request.config.getoption("--backend-filter")
            if backend_filter and not re.match(backend_filter, scenario.backend):
                # TODO apply filter during scenario retrieval, but seems to be hard to retrieve cli param
                pytest.skip(
                    f"skipping scenario {scenario.id} because backend {scenario.backend} does not match filter {backend_filter!r}"
                )
            backend = scenario.backend
            if override_backend:
                _log.info(f"Overriding backend URL with {override_backend!r}")
                backend = override_backend

            connection: openeo.Connection = connection_factory(url=backend)

        with track_phase(phase="create-job"):
            # TODO #14 scenario option to use synchronous instead of batch job mode?
            job = connection.create_job(
                process_graph=scenario.process_graph,
                title=f"APEx benchmark {scenario.id}",
                additional=scenario.job_options,
            )
            track_metric("job_id", job.job_id)

        with track_phase(phase="run-job"):
            # TODO: monitor timing and progress
            # TODO: abort excessively long batch jobs? https://github.com/Open-EO/openeo-python-client/issues/589
>           job.start_and_wait()

tests/test_benchmarks.py:69:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <BatchJob job_id='j-25121713200447fa8249e4181bb8c138'>

    def start_and_wait(
        self,
        *,
        print=print,
        max_poll_interval: float = DEFAULT_JOB_STATUS_POLL_INTERVAL_MAX,
        connection_retry_interval: float = DEFAULT_JOB_STATUS_POLL_CONNECTION_RETRY_INTERVAL,
        soft_error_max: int = DEFAULT_JOB_STATUS_POLL_SOFT_ERROR_MAX,
        show_error_logs: bool = True,
        require_success: bool = True,
    ) -> BatchJob:
        """
        Start the batch job, poll its status and wait till it finishes (or fails)

        :param print: print/logging function to show progress/status
        :param max_poll_interval: maximum number of seconds to sleep between job status polls
        :param connection_retry_interval: how long to wait when status poll failed due to connection issue
        :param soft_error_max: maximum number of soft errors (e.g. temporary connection glitches) to allow
        :param show_error_logs: whether to automatically print error logs when the batch job failed.
        :param require_success: whether to raise an exception if the job did not finish successfully.

        :return: Handle to the job created at the backend.

        .. versionchanged:: 0.37.0
            Added argument ``show_error_logs``.

        .. versionchanged:: 0.42.0
            All arguments must be specified as keyword arguments,
            to eliminate the risk of positional mix-ups between heterogeneous arguments and flags.

        .. versionchanged:: 0.42.0
            Added argument ``require_success``.
        """
        # TODO rename `connection_retry_interval` to something more generic?
        start_time = time.time()

        def elapsed() -> str:
            return str(datetime.timedelta(seconds=time.time() - start_time)).rsplit(".")[0]

        def print_status(msg: str):
            print("{t} Job {i!r}: {m}".format(t=elapsed(), i=self.job_id, m=msg))

        # TODO: make `max_poll_interval`, `connection_retry_interval` class constants or instance properties?
        print_status("send 'start'")
        self.start()

        # TODO: also add  `wait` method so you can track a job that already has started explicitly
        #   or just rename this method to `wait` and automatically do start if not started yet?

        # Start with fast polling.
        poll_interval = min(5, max_poll_interval)
        status = None
        _soft_error_count = 0

        def soft_error(message: str):
            """Non breaking error (unless we had too much of them)"""
            nonlocal _soft_error_count
            _soft_error_count += 1
            if _soft_error_count > soft_error_max:
                raise OpenEoClientException("Excessive soft errors")
            print_status(message)
            time.sleep(connection_retry_interval)

        while True:
            # TODO: also allow a hard time limit on this infinite poll loop?
            try:
                job_info = self.describe()
            except requests.ConnectionError as e:
                soft_error("Connection error while polling job status: {e}".format(e=e))
                continue
            except OpenEoApiPlainError as e:
                if e.http_status_code in [HTTP_502_BAD_GATEWAY, HTTP_503_SERVICE_UNAVAILABLE]:
                    soft_error("Service availability error while polling job status: {e}".format(e=e))
                    continue
                else:
                    raise

            status = job_info.get("status", "N/A")

            progress = job_info.get("progress")
            if isinstance(progress, int):
                progress = f"{progress:d}%"
            elif isinstance(progress, float):
                progress = f"{progress:.1f}%"
            else:
                progress = "N/A"
            print_status(f"{status} (progress {progress})")
            if status not in ('submitted', 'created', 'queued', 'running'):
                break

            # Sleep for next poll (and adaptively make polling less frequent)
            time.sleep(poll_interval)
            poll_interval = min(1.25 * poll_interval, max_poll_interval)

        if require_success and status != "finished":
            # TODO: render logs jupyter-aware in a notebook context?
            if show_error_logs:
                print(f"Your batch job {self.job_id!r} failed. Error logs:")
                print(self.logs(level=logging.ERROR))
                print(
                    f"Full logs can be inspected in an openEO (web) editor or with `connection.job({self.job_id!r}).logs()`."
                )
>           raise JobFailedException(
                f"Batch job {self.job_id!r} didn't finish successfully. Status: {status} (after {elapsed()}).",
                job=self,
            )
E           openeo.rest.JobFailedException: Batch job 'j-25121713200447fa8249e4181bb8c138' didn't finish successfully. Status: error (after 0:04:46).

/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/openeo/rest/job.py:382: JobFailedException
----------------------------- Captured stdout call -----------------------------
0:00:00 Job 'j-25121713200447fa8249e4181bb8c138': send 'start'
0:00:09 Job 'j-25121713200447fa8249e4181bb8c138': created (progress 0%)
0:00:17 Job 'j-25121713200447fa8249e4181bb8c138': queued (progress 0%)
0:00:23 Job 'j-25121713200447fa8249e4181bb8c138': queued (progress 0%)
0:00:31 Job 'j-25121713200447fa8249e4181bb8c138': queued (progress 0%)
0:00:41 Job 'j-25121713200447fa8249e4181bb8c138': queued (progress 0%)
0:00:54 Job 'j-25121713200447fa8249e4181bb8c138': queued (progress 0%)
0:01:09 Job 'j-25121713200447fa8249e4181bb8c138': running (progress 9.3%)
0:01:29 Job 'j-25121713200447fa8249e4181bb8c138': running (progress 11.8%)
0:01:53 Job 'j-25121713200447fa8249e4181bb8c138': running (progress 14.8%)
0:02:23 Job 'j-25121713200447fa8249e4181bb8c138': running (progress 18.3%)
0:03:00 Job 'j-25121713200447fa8249e4181bb8c138': running (progress 22.3%)
0:03:47 Job 'j-25121713200447fa8249e4181bb8c138': running (progress 26.7%)
0:04:45 Job 'j-25121713200447fa8249e4181bb8c138': error (progress N/A)
Your batch job 'j-25121713200447fa8249e4181bb8c138' failed. Error logs:
[{'id': '[1765977863075, 49219]', 'time': '2025-12-17T13:24:23.075Z', 'level': 'error', 'message': 'Task 15 in stage 33.0 failed 4 times; aborting job'}, {'id': '[1765977863083, 49520]', 'time': '2025-12-17T13:24:23.083Z', 'level': 'error', 'message': 'Stage error: Job aborted due to stage failure: Task 15 in stage 33.0 failed 4 times, most recent failure: Lost task 15.3 in stage 33.0 (TID 179) (epod087.vgt.vito.be executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):\n  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1247, in main\n    process()\n  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1239, in process\n    serializer.dump_stream(out_iter, outfile)\n  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 274, in dump_stream\n    vs = list(itertools.islice(iterator, batch))\n         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 83, in wrapper\n    return f(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^\n  File "/opt/venv/lib64/python3.11/site-packages/epsel.py", line 44, in wrapper\n    return _FUNCTION_POINTERS[key](*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/venv/lib64/python3.11/site-packages/epsel.py", line 37, in first_time\n    return f(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^\n  File "/opt/venv/lib64/python3.11/site-packages/openeogeotrellis/geopysparkdatacube.py", line 1187, in get_metadata\n    module = load_module_from_string(udf_code)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/venv/lib64/python3.11/site-packages/openeo/udf/run_code.py", line 64, in load_module_from_string\n    exec(code, globals)\n  File "<string>", line 3, in <module>\nModuleNotFoundError: No module named \'skimage\'\n\n\tat org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)\n\tat org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:784)\n\tat org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)\n\tat org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)\n\tat org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)\n\tat scala.collection.mutable.Growable.addAll(Growable.scala:61)\n\tat scala.collection.mutable.Growable.addAll$(Growable.scala:57)\n\tat scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:66)\n\tat scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1282)\n\tat scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1276)\n\tat org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1049)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2433)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:141)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:624)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n\nDriver stacktrace:'}, {'id': '[1765977863817, 64691]', 'time': '2025-12-17T13:24:23.817Z', 'level': 'error', 'message': 'OpenEO batch job failed: OpenEOApiException(status_code=500, code=\'Internal\', message=\'Unexpected error during \\\'apply_neighborhood\\\': UDF exception while evaluating processing graph. Please check your user defined functions. stacktrace:\\n  File "<string>", line 3, in <module>\\nModuleNotFoundError: No module named \\\'skimage\\\'. The process had these arguments: {\\\'context\\\': {\\\'padding_window_size\\\': 33}, \\\'data\\\': GeopysparkDataCube(metadata=GeopysparkCubeMetadata(dimension_names=[\\\'x\\\', \\\'y\\\', \\\'bands\\\'], band_names=[\\\'B02\\\', \\\'B03\\\', \\\'B04\\\', \\\'B08\\\', \\\'B12\\\'])), \\\'overlap\\\': [{\\\'dimension\\\': \\\'x\\\', \\\'value\\\': 32, \\\'unit\\\': \\\'px\\\'}, {\\\'dimension\\\': \\\'y\\\', \\\'value\\\': 32, \\\'unit\\\': \\\'px\\\'}], \\\'process\\\': {\\\'process_graph\\\': {\\\'runudf1\\\': {\\\'process_id\\\': \\\'run_udf\\\', \\\'arguments\\\': {\\\'data\\\': {\\\'from_parameter\\\': \\\'data\\\'}, \\\'runtime\\\': \\\'Python\\\', \\\'udf\\\': \\\'import xarray\\\\nimport numpy as np\\\\nfrom skimage.feature import graycomatrix, graycoprops\\\\nfrom openeo.metadata import CollectionMetadata\\\\n\\\\n\\\\ndef apply_metadata(metadata: CollectionMetadata, context: dict) -> CollectionMetadata:\\\\n    return metadata.rename_labels(\\\\n        dimension = "bands",\\\\n        target = ["contrast","variance","NDFI"]\\\\n    )\\\\n\\\\n\\\\ndef apply_datacube(cube: xarray.DataArray, context: dict) -> xarray.DataArray:\\\\n    """\\\\n    Applies spatial texture analysis and spectral index computation to a Sentinel-2 data cube.\\\\n\\\\n    Computes:\\\\n    - NDFI (Normalized Difference Fraction Index) from bands B08 and B12\\\\n    - Texture features (contrast and variance) using Gray-Level Co-occurrence Matrix (GLCM)\\\\n\\\\n    Args:\\\\n        cube (xarray.DataArray): A 3D data cube with dimensions (bands, y, x) containing at least bands B08 and B12.\\\\n        context (dict): A context dictionary (currently unused, included for API compatibility).\\\\n\\\\n    Returns:\\\\n        xarray.DataArray: A new data cube with dimensions (bands, y, x) containing:\\\\n                          - \\\\\\\'contrast\\\\\\\': GLCM contrast\\\\n       ...'}]
Full logs can be inspected in an openEO (web) editor or with `connection.job('j-25121713200447fa8249e4181bb8c138').logs()`.
------------------------------ Captured log call -------------------------------
INFO     conftest:conftest.py:131 Connecting to 'openeo.vito.be'
INFO     openeo.config:config.py:193 Loaded openEO client config from sources: []
INFO     conftest:conftest.py:144 Checking for auth_env_var='OPENEO_AUTH_CLIENT_CREDENTIALS_TERRASCOPE' to drive auth against url='openeo.vito.be'.
INFO     conftest:conftest.py:148 Extracted provider_id='terrascope' client_id='openeo-apex-service-account' from auth_env_var='OPENEO_AUTH_CLIENT_CREDENTIALS_TERRASCOPE'
INFO     openeo.rest.connection:connection.py:255 Found OIDC providers: ['egi', 'terrascope', 'CDSE']
INFO     openeo.rest.auth.oidc:oidc.py:404 Doing 'client_credentials' token request 'https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-apex-service-account')
INFO     openeo.rest.connection:connection.py:354 Obtained tokens: ['access_token', 'id_token']

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions