-
Notifications
You must be signed in to change notification settings - Fork 194
Description
Dear developers / project owners,
Upfront, this is not an issue, more a kind of request how I can commit to your project.
I've created this account with my company mail address, we are in the south of Luxembourg and are in the energy business.
We are currently working on improving the data quality along the flow of data items through our system ( This might sound a bit like buzzword bingo ;)).
We have different data sources with different format types, csv, json, csv disguised as excel and some mythical xmls. They all arrive in S3 or on some ftp infrastructure.
We committed on using ODCS for our contracts that model the expected format and quality on each step of the process.
We've done some case tests with the datacontract-cli framework, via python, integrated it into our airflow infrastructure, ran tests and created some notifications out of the results.
I've encountered some effects while using the framework and validating files, two of them are:
-
Translation from ODCS to DCS Format - not all keywords are translated.
Example: We have this definition for a field:
logicalTypeOptions:
minLength: 36
maxLength: 36
pattern: '^[0-9a-fA-F]{8}-(?:[0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12}$'
Those elements are not yet handled in the odcs_v3_importer ( don't get me wrong, thats no criticism ;) ).
I've implemented three lines that take care of this and work in our environment, would be a shame to keep them for myself only. -
Executing tests on csv-files with some dirt, like # as comment symbols.
Reading of csv is handled through duckdb, the "autosensing" of the format fails if there is a comment present in the file.
Duckdb is capable to handle this throuh some options in the read_csv function, but in this case something goes wrong, I haven't figured it out yet.
I wanted to keep it short, didn't work that well. So to get to the point - I would like to contribute to your project and coordinate with you how to do this without disrupting any of your processes ;).
I'm also quite new to github and have not worked on any projects yet, I would unterstand that I just can't commit any of my changes to the main branch.
In case you would like to reach me through mail, I've provided my company address in my profile. German language is fine with me ;).
Thank you!
Greetings,
Roland