Skip to content

What's the canonical way to ETL an incremental pipeline of this data? #2839

@ghost

Description

I work at DBT and have been improving an ETL pipeline for gov.uk content we have based on parameters the department needs. I'd like to configure it so it ingests and overwrites data that's changed rather than ingesting everything over and over again.

My plan is:

  • Use the search API and the updated_at field to return results changed in the last few days
  • Use the content API to fetch the content, recursing through related pages to pick up collection children etc, again filtering on updated_at for new content

From the other side of the API, is that a good plan?

  • Is updated_at reliably updated? Is is safe to base a pipeline on?
  • Do I actually need to recurse through the children once this in incremental? It's in there as we found filtering on our department in the search API missed lots of documents our department published in related pages
  • On testing I'll sometimes get JSONDecodeError for very new items, which makes me think I'm picking up drafts. Is there a field I'm missing to ignore these until they're ready?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions