What's the canonical way to ETL an incremental pipeline of this data?

I work at DBT and have been improving an ETL pipeline for gov.uk content we have based on parameters the department needs. I'd like to configure it so it ingests and overwrites data that's changed rather than ingesting everything over and over again.

My plan is:

* Use the search API and the `updated_at` field to return results changed in the last few days
* Use the content API to fetch the content, recursing through related pages to pick up collection children etc, again filtering on `updated_at` for new content

From the other side of the API, is that a good plan?

* Is `updated_at` reliably updated? Is is safe to base a pipeline on?
* Do I actually need to recurse through the children once this in incremental? It's in there as we found filtering on our department in the search API missed lots of documents our department published in related pages
* On testing I'll sometimes get `JSONDecodeError` for very new items, which makes me think I'm picking up drafts. Is there a field I'm missing to ignore these until they're ready?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the canonical way to ETL an incremental pipeline of this data? #2839

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What's the canonical way to ETL an incremental pipeline of this data? #2839

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions