Replies: 3 comments 6 replies
-
Hi @ezwelty I'll ask @roll to respond in detail, but I wanted to say that this is great to see & we're definitely excited to help/discuss! |
Beta Was this translation helpful? Give feedback.
-
@ezwelty
And regarding the performance, it would be great if we start to identify and reduce bottlenecks. So real-world performance problems examples will be very helpful. For example, if you use |
Beta Was this translation helpful? Give feedback.
-
I see. Actually, Frictionless can "resolve" foreign keys on reading. I'm wandering if we can bring it to validation |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am the new "Database Manager" at the World Glacier Monitoring Service (WGMS). Previously, I restructured one of their side datasets as a continuously-validated Data Package (https://gitlab.com/wgms/glathida) and wrote a journal article about the benefits:
Now, over the next three years, I am working on their main data pipeline. Every year, they receive (tabular) data submissions (of glacier observations) from a network of observers scattered around the world. My main task is to build a system for validating these submissions with automated and human-supervised checks. My hope is to do so in part by improving existing open-source tools, potentially using the Frictionless-verse as my starting point. Although the data ultimately ends up in a SQL database, it seems horrific acrobatics would be needed to rely on SQL to generate a human-friendly validation report.
So in short, I am thinking of building a software stack (user interface, API, and validation engine) similar to the Validata / Etalab project, but with (i) a connection to a database for consistency checks (and tabular diffs) with existing data, and (ii) support for Tabular Data Packages (with relations between Resources) rather than just standalone Table Schema.
table-schema-to-markdown
and add this tofrictionless-py
(see Auto-generate data docs - Convert table schema, data resource and data package to markdown #665, Implement package/resource/schema.to_markdown frictionless-py#837).tableschema-to-template
to work with Tabular Data Package and add this tofrictionless-py
(see Integrate "Table Schema -> Excel Templates" library frictionless-py#584).On the validation side, it is less clear how I can plug into existing tooling.
frictionless-py
. Perhaps I could build a pre-processor that performs table joins to reduce Package checks to Resource checks for use infrictionless-py
, or I could roll something new.frictionless.validate
, which is why I wrote a Pandas-based alternative (https://github.com/ezwelty/goodtables-pandas-py). Loading many rows into memory for vectorized operations does not play well with the Frictionless approach, but it is fast. At the cost of additional overhead for writing custom tests (which we will need to maintain over time), perhaps Cython (or Javascript) would be the answer to faster row-by-row validation?Long story short, I would love to hear your thoughts about what I am trying to achieve and where you see opportunities for me to contribute to shared software tools.
Beta Was this translation helpful? Give feedback.
All reactions