How should we handle data duplication, provenance, metadata? #499

clizbe · 2024-02-26T15:59:10Z

clizbe
Feb 26, 2024
Maintainer

@suvayu asked the question in the TIO repo, so thought I'd post it here for discussion.

The thought is that while we're doing analyses, we'll have a tendency to save-off datasets, duplicating a lot of stuff.
Of course Julia+DuckDB can do database manipulations without duplicating data, but then we have to decide the workflow and when we WANT it to save.

I see creating a scenario as a building-up process. So when do we save and when do we leave something as a "run this script again" process?

In DuckDB we should have copies of (or references to) major data sources (i.e., ENTSO-E).
Merging those data sources, we get a few core databases (i.e., NLD, Europe)
To make a scenario we may want to alter (maybe aggregate) a core database to create the base case.
From the base case, we create alternatives.
We run the base case and alternatives, exporting the results.
We build graphs using Input+Output data.

Issues to think about:

Reproducibility (what happens if a database changes?)
Minimal duplication
Metadata when saving databases
Different units in source data -> waiting for UnitsJuMP.jl

@gnawin @datejada @nope82 @g-moralesespana

clizbe · 2024-04-08T12:27:03Z

clizbe
Apr 8, 2024
Maintainer Author

We discussed this at lunch the other day. We thought for a given project we might have a folder containing:

Project.toml - for version control of software
Script with data manipulation pipeline
- Any minor data files specific to the project (should be rare)
Maybe saved-off copy of input data
- Can we version-control the input data so this isn't necessary?
Output data
Graphs OR scripts to create graphs

5 replies

gnawin Apr 8, 2024
Collaborator

For me this depends on what we consider as input data, and what are minor changes. And from my experience, even input data needs versioning when there are minor changes. So would be great to have some kind of versioning system for the input data.

Maybe saved-off copy of input data

Can we version-control the input data so this isn't necessary?

suvayu Apr 8, 2024
Collaborator

This is a tricky problem. Is it possible to list out a few different ways input data changes? E.g.

The original data repo changed what it returns when you enter the same query. In this scenario, does the data repo include a version number in its metadata? If so, we can record and use that.
If on the other hand it's undocumented upstream, then we can calendar version it every time we import, but that would also mean record the schema/columns for that version on our end (maybe this js required for the above case as well)
Is input changing because the analyst is "fixing" something because of new understanding? Then it's not a new data version, it's a new step in the workflow.

These are few possibilities off the top of my head.

gnawin Apr 8, 2024
Collaborator

I was mainly talking about the last situation, i.e., an input database can consist of different sources of data (that does not change), but how you combine these sources and even make adjustments can mean an evolution of the input database. Experiences show that we don't finish discussing the input database until the last moment, and in the meantime, runs have to be done. These changes can be major or minor, but in any case, the database iterates itself.

clizbe Apr 8, 2024
Maintainer Author

I don't think many of our data sources will have versioning, so we probably need to handle it. We also might need local storage of some sources, rather than just linking to them.

suvayu Apr 10, 2024
Collaborator

We also might need local storage of some sources, rather than just linking to them.

How are these datasets typically available?

connect, query, get a result in a database client.
login to a website, browse through some taxonomy menu, and download a file.
same as (2), but before the download, you can put a filter, e.g. exclude columns, or select a region on a map, etc.

For 1, if the database is a relational database, quite likely "saving a copy" in DuckDB is simple. For 2 & 3, you just keep the file, or if the file is a custom/irregular format, save a processed file as either of Parquet, CSV, JSON-LD. If uniformity is preferred, we can also import into DuckDB as 1.

clizbe · 2024-08-14T15:42:35Z

clizbe
Aug 14, 2024
Maintainer Author

Recent discussions we're thinking of having scripts for the processing from Raw to Model. And version-controlling the scripts. Depending how long processing takes, we might also save the "Tulipa-friendly" versions of data sources - which users can manipulate/combine for scenarios. But this runs the risk of users permanently editing these "Tulipa-friendly" files without documenting what they did.

Data provenance is a hard problem. This (or something similar) might be a solution (thanks @suvayu!):
https://dvc.org/

@datejada @gnawin

2 replies

clizbe Aug 14, 2024
Maintainer Author

Also consider:
https://github.com/sentinel-energy/friendly_data
https://frictionlessdata.io/

g-moralesespana Aug 15, 2024
Maintainer

Agree, there should also be a "simple way" where the user can manipulate the data by hand: because of a very tiny example, or because they just want to copy/paste from their own database without creating a script for it (because it could be done just once). Currently it can be done through CSVs but they are really not friendly. We need an alternative that can persuade the access/excel lovers.
We can talk about this in the coming Tulipa day and also during the dedicated data management session in a couple of weeks.

How should we handle data duplication, provenance, metadata? #499

Uh oh!

Uh oh!

clizbe Feb 26, 2024 Maintainer

Replies: 2 comments · 7 replies

Uh oh!

clizbe Apr 8, 2024 Maintainer Author

Uh oh!

gnawin Apr 8, 2024 Collaborator

Uh oh!

Uh oh!

suvayu Apr 8, 2024 Collaborator

Uh oh!

gnawin Apr 8, 2024 Collaborator

Uh oh!

clizbe Apr 8, 2024 Maintainer Author

Uh oh!

Uh oh!

suvayu Apr 10, 2024 Collaborator

Uh oh!

clizbe Aug 14, 2024 Maintainer Author

Uh oh!

clizbe Aug 14, 2024 Maintainer Author

Uh oh!

g-moralesespana Aug 15, 2024 Maintainer

clizbe
Feb 26, 2024
Maintainer

Replies: 2 comments 7 replies

clizbe
Apr 8, 2024
Maintainer Author

gnawin Apr 8, 2024
Collaborator

suvayu Apr 8, 2024
Collaborator

gnawin Apr 8, 2024
Collaborator

clizbe Apr 8, 2024
Maintainer Author

suvayu Apr 10, 2024
Collaborator

clizbe
Aug 14, 2024
Maintainer Author

clizbe Aug 14, 2024
Maintainer Author

g-moralesespana Aug 15, 2024
Maintainer