Collaboration with World Glacier Monitoring Service #685

ezwelty · 2021-06-25T09:34:16Z

ezwelty
Jun 25, 2021

I am the new "Database Manager" at the World Glacier Monitoring Service (WGMS). Previously, I restructured one of their side datasets as a continuously-validated Data Package (https://gitlab.com/wgms/glathida) and wrote a journal article about the benefits:

[...] we detail our use of open-source metadata formats and software tools to describe the data, validate the data format and content against this metadata description, and track changes to the data following modern data management best practices.

Now, over the next three years, I am working on their main data pipeline. Every year, they receive (tabular) data submissions (of glacier observations) from a network of observers scattered around the world. My main task is to build a system for validating these submissions with automated and human-supervised checks. My hope is to do so in part by improving existing open-source tools, potentially using the Frictionless-verse as my starting point. Although the data ultimately ends up in a SQL database, it seems horrific acrobatics would be needed to rely on SQL to generate a human-friendly validation report.

So in short, I am thinking of building a software stack (user interface, API, and validation engine) similar to the Validata / Etalab project, but with (i) a connection to a database for consistency checks (and tabular diffs) with existing data, and (ii) support for Tabular Data Packages (with relations between Resources) rather than just standalone Table Schema.

For data documentation and submission instructions, I want to represent a Package (and thus a Field, Schema, and Resource) as Markdown based on a flexible template. Perhaps I can extend table-schema-to-markdown and add this to frictionless-py (see Auto-generate data docs - Convert table schema, data resource and data package to markdown #665, Implement package/resource/schema.to_markdown frictionless-py#837).
For submissions from Excel-users, I want a template that can catch simple validation errors early on. Perhaps I can extend tableschema-to-template to work with Tabular Data Package and add this to frictionless-py (see Integrate "Table Schema -> Excel Templates" library frictionless-py#584).

On the validation side, it is less clear how I can plug into existing tooling.

I want the metadata to support custom checks, as done for Table Schema by https://git.opendatafrance.net/validata/validata-core, but also at the Package level (across Resources), but this is not readily supported in frictionless-py. Perhaps I could build a pre-processor that performs table joins to reduce Package checks to Resource checks for use in frictionless-py, or I could roll something new.
As you might remember, I have been frustrated by the slowness of frictionless.validate, which is why I wrote a Pandas-based alternative (https://github.com/ezwelty/goodtables-pandas-py). Loading many rows into memory for vectorized operations does not play well with the Frictionless approach, but it is fast. At the cost of additional overhead for writing custom tests (which we will need to maintain over time), perhaps Cython (or Javascript) would be the answer to faster row-by-row validation?

Long story short, I would love to hear your thoughts about what I am trying to achieve and where you see opportunities for me to contribute to shared software tools.

lwinfree · 2021-06-25T15:46:10Z

lwinfree
Jun 25, 2021

Hi @ezwelty I'll ask @roll to respond in detail, but I wanted to say that this is great to see & we're definitely excited to help/discuss!

0 replies

roll · 2021-07-12T08:03:09Z

roll
Jul 12, 2021
Maintainer

@ezwelty
For the package validation, I would subclass frictionless.Package:

from frictionless import Package, validate

class GlacierPackage(Package):
  def metadata_validate(self):
    yield from super().metadata_validate()
    # Yield domain specific errors 

package = GlacierPackage()
report = validate(package)

And regarding the performance, it would be great if we start to identify and reduce bottlenecks. So real-world performance problems examples will be very helpful. For example, if you use datatimes it's expected to be very slow just because it's very slow in Python. I'm thinking of an improvement we can apply for it. But it will be good to profile the problem first

6 replies

roll Jul 12, 2021
Maintainer

@ezwelty
Based on prev profiling the main problem with datetimes in frictionless that it needs to create Python datetime objects during validation for every cell which is very slow. Not using datetimes in data should reduce the gap in speed many times. So my current idea is that optimizing the way frictionless works with datetimes will be the best value/price ratio step

ezwelty Jul 12, 2021
Author

Perhaps, but the example I gave above does not use datetimes, just raw strings validated by a regular expression. My experience with validating larger datasets (with no datetimes) with frictionless is that the processing of each cell is relatively slow, regardless of the data type, and this overhead adds up.

roll Jul 12, 2021
Maintainer

That's my mistake I read the snippet wrong.

When I was rewriting prev stack I was able to make Frictionless much faster and, on simple date types, it was less than 5-10 times slower than pure read_csv. Much faster than goodtables.

But, of course, we're not yet fully committed to performance. There are no reproducible benchmarks to catch performance regressions and there are no tiny optimizations like you mentioned (or with datetiems) that can increase performance dramatically.

I think working on some performance optimizations can be a good direction for us for the next 6 months.

roll Jul 14, 2021
Maintainer

Hi @ezwelty,

I released a few patches and on [email protected] your benchmark runs as:

frictionless: 375.798 ms
jsonschema: 402.327 ms
fastjsonschema: 34.248 ms
pure python: 10.351 ms
pandas: 11.247 ms

Also, here is another script to discover how the time spent is distributed in frictionless:

# --- list ---

start = datetime.datetime.now()
with frictionless.Resource(resource_schema) as resource:
    for list in resource.list_stream:
        pass
dt = (datetime.datetime.now() - start).total_seconds()
print(f"list: {dt * 1e3} ms")

# --- row ---

start = datetime.datetime.now()
with frictionless.Resource(resource_schema) as resource:
    for row in resource.row_stream:
        pass
dt = (datetime.datetime.now() - start).total_seconds()
print(f"row: {dt * 1e3} ms")

# --- row.valid ---

start = datetime.datetime.now()
with frictionless.Resource(resource_schema) as resource:
    for row in resource.row_stream:
        row.valid
dt = (datetime.datetime.now() - start).total_seconds()
print(f"row.valid: {dt * 1e3} ms")

list: 72.614 ms
row: 115.715 ms
row.valid: 260.983 ms

The performance of list/row is comparable to native Python's csv module so I would say that the main reserve for optimizations lays in:

frictionless.Row.__process (run for row.valid)
frictionless.validate_resource

On the other hand, we need to take into account that Frictionless performs many structural checks on every row like missing-cell, extra-cell, blank-row etc So it's expected to be slower than your codes for other systems.

Also, my measurements show that overhead of cast functions in frictionless.types is not so important for this case (to check you can provide a schema with all fields having any type)

roll Jul 14, 2021
Maintainer

Also, here is an optimization issue - frictionlessdata/frictionless-py#568 (I added date/time benchmarking)

roll · 2021-07-12T15:17:42Z

roll
Jul 12, 2021
Maintainer

I may need to check whether observation.elevation is between glacier.elevation_min and glacier.elevation_max. In SQL, I would write this as

I see. Actually, Frictionless can "resolve" foreign keys on reading. I'm wandering if we can bring it to validation

0 replies

Collaboration with World Glacier Monitoring Service #685

Uh oh!

ezwelty Jun 25, 2021

Replies: 3 comments · 6 replies

Uh oh!

lwinfree Jun 25, 2021

Uh oh!

roll Jul 12, 2021 Maintainer

Uh oh!

Uh oh!

roll Jul 12, 2021 Maintainer

Uh oh!

ezwelty Jul 12, 2021 Author

Uh oh!

Uh oh!

roll Jul 12, 2021 Maintainer

Uh oh!

Uh oh!

roll Jul 14, 2021 Maintainer

Uh oh!

roll Jul 14, 2021 Maintainer

Uh oh!

roll Jul 12, 2021 Maintainer

ezwelty
Jun 25, 2021

Replies: 3 comments 6 replies

lwinfree
Jun 25, 2021

roll
Jul 12, 2021
Maintainer

roll Jul 12, 2021
Maintainer

ezwelty Jul 12, 2021
Author

roll Jul 12, 2021
Maintainer

roll Jul 14, 2021
Maintainer

roll Jul 14, 2021
Maintainer

roll
Jul 12, 2021
Maintainer