Skip to content

Pipeline breaking when PDF file is replaced by an HTML page #1210

@LVerneyEC

Description

@LVerneyEC

Hi,

We hit this exception thrown this morning in our daily run on our set of declarations:

if (modifiedFilesInCommit.length > 1) {
throw new Error(`Only one file should have been recorded in ${hash}, but all these files were recorded: ${modifiedFilesInCommit.join(', ')}`);
}

It seems this error is uncaught and crashes the whole pipeline with no recovery options. I get the following log:

2025-11-28T06:05:18+00:00 �[31merror�[39m Zalando — Data Catalogue for Vetted Researchers Error: Only one file should have been recorded in 693a560f39b6de4006a6219c3e97c8778dbe6bbb, but all these files were recorded: Zalando/Data Catalogue for Vetted Researchers.html, Zalando/Data Catalogue for Vetted Researchers.pdf

And then a traceback:

 at Module.toDomain (file:///home/pptruser/open-terms-archive/engine/src/archivist/recorder/repositories/git/dataMapper.js:57:11)
...
 at async Archivist.trackTermsChanges (file:///home/pptruser/open-terms-archive/engine/src/archivist/index.js:184:22)

The snapshot commit mentioned is current HEAD of our snapshot Git repository: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-snapshots/-/tree/693a560f39b6de4006a6219c3e97c8778dbe6bbb

As you can see in the "Zalando" folder, the "Data catalogue..." file is duplicated, once as (empty) HTML and once as PDF.

Relevant declaration is: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/blob/main/declarations/Zalando.yml?ref_type=heads#L14-15

My understanding of the situation is that:

  • Zalando declaration contains a PDF file, which was correctly fetched over the last days/weeks.
  • At some point in time, some issue triggered an empty HTML reply (temporary issue on the webserver, antibot, whatever). Then, the engine recorded the HTML file alongside the PDF file.
  • The snapshot directory now contains both a HTML and a PDF file, crashing the pipeline.

I can probably work around it by manually removing the faulty HTML file, but this issue will likely happen again on future runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions