Generation of meta data files based on CSV data #942

EricBenschneider · 2025-02-04T13:55:29Z

This PR adds CSV schema detection based on data as described in issue #63.
This PR also enables reading a CSV file without manually adding a meta data file as discussed in issue #688.
Once a CSV file is read and the MetaDataParser cannot find an appropriate meta data file, it will now be automatically generated and stored in the location it was expected.
The changes are tested by the added generateMetaDataFileTest.cpp
A new test case in MetaDataParserTest.cpp is added to test saving of the generated meta data file.

auge · 2025-02-04T19:09:40Z

Thanks a lot for this PR, I'm looking into the usage part of this (no real code review from my side).

Currently, I see a lot of debug statements generated, I guess they should rather be silenced? (I have 1e9 rows in my csv file): inferred valueType: ... (aborting execution as it will take long time …)

when trying on a smaller file (10 rows), I get a segfault on my system (probably while writing the .meta file). When the meta file already exists, daphne continues normally.

As I have not looked into the code, I would like to inquiry if the failure to write the .meta file will be silent? I expect daphne to continue normally even when the meta file cannot be written - like when the csv file lies at a location where the user should not be allowed to write.

pdamme

Thanks for this contribution, @EricBenschneider! Being able to read a CSV file without a meta data file present is great for users (even though it will typically come with a certain performance penalty).

Thanks @auge for having a first look at this new feature. I can confirm that it crashes when used from a DaphneDSL script. The case of writing the meta data file to a non-writable location was most likely solved by @EricBenschneider's recent commit e481039, but would need additional testing.

So overall, this is a good first step towards solving #63, but a few points need to be addressed:

Required changes: (must be addressed before the merge)

Make the CI checks pass. Several test cases failed; however, on my system all tests pass... Let's have a look again after you've addressed the other points. The clang-format check failed; please format all changed files with the clang-format specification in .clang-format (maybe you can use something like "format document" in your IDE).
Extend the test cases. Please consider the following:
1. Employ a quoted/multi-line string somewhere in the test CSV files (DAPHNE's CSV reader supports such strings).
2. Add at least one script-level test case to check if your feature works from DaphneDSL. Currently, it doesn't seem to work, as @auge found out. For instance, the following DaphneDSL script crashes with the following error message when reading the following CSV file (and no meta data file).
```
print(readFrame("my-frame.csv"));
```
```
1,1.1,one
```
```
inferred valueType: 0, 1.
inferred valueType: 6, 1.1.
inferred valueType: 8, one.
bin/daphne(+0x14a4422)[0x565112219422]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7b512b129320]
bin/daphne(+0x1e3abed)[0x565112bafbed]
...
bin/daphne(+0x1557c67)[0x5651122ccc67]
Segmentation fault (core dumped)
```
3. Test if CSV values like "12+34" are correctly identified as strings, not as integers.
4. Test if CSV values like "1.23abc" are correctly identified as strings.
Make sure your changes to the read kernel are reflected in DaphneDSL and DaphneIR. You added an additional parameter specifying if a header/labels row is present. First, that must not come after the DaphneContext (DCTX) in the kernel's parameter list, otherwise there will be problems when lowering DaphneIR's ReadOp to the kernel call. The additonal parameter must be reflected in the registration of kernels (src/runtime/local/kernels/kernels.json), DaphneIR's ReadOp (src/ir/daphneir/DaphneOps.td), and the readMatrix()/readFrame() DaphneDSL built-in functions (src/parser/daphnedsl/DaphneDSLBuiltins.cpp). This point will be necessary for writing a script-level test case.
Please remove all custom prints to stdout/stderr from your code. Consider using DAPHNE's logger (with an appropriate log level like info or debug) for important messages (but I think it's okay to just omit the ones you have). We generally want to avoid unnecessary output.
Some of the newly added test cases assume that a meta data file doesn't exist in the beginning, but don't check if this is really the case. They do (try to) remove the (supposedly) created meta data file afterwards, but in case something goes wrong there, the next test run could still see the meta data file created by the previous test run.

Optional changes: (can be addressed before the merge or later)

Please make sure that the corner case of reading a CSV file with zero rows (and no meta data file) is handled gracefully. To this end, please add a respective test case. Most importantly, DAPHNE should not simply crash in such a case. There should either be an error (because types cannot be inferred when there is no data), or the most general value type (i.e., string) should be assumed for all columns. It would be worthwhile to find out how other systems/libraries (e.g., pandas) handle this case.
Make sure that malformed CSV files are handled gracefully. Currently, you say numCols = std::max(numCols, colIndex); in generateFileMetaData(). To my mind, a different number of columns per rows should raise an error.
Currently, the entire CSV file is read once for detecting the schema and once again for the actual read. That will likely result in a 2x slow-down for reading CSV files without a meta data file, which is quite expensive. It would be great to think about more efficient ways, such as scanning the input file just once, interleaving read and schema detection or detecting the schema based on a sample (with the option of correcting it in case not all data fits).
Furthermore, the code for detecting the schema implements a little CSV reader on its own (but doesn't seem to support all features of DAPHNE's existing CSV reader, e.g., multi-line strings). It would be great to avoid redundant code.

Let me know if you have any comments/questions or need further advice to address these points.

This reverts commit 85ea77a.

EricBenschneider · 2025-02-24T07:12:52Z

Thanks you guys for the feedback. Most of it could be incorporated.

Please note that I decided to implement the meta data generation in the scope of MetaDataParser::readMetaData().
Having it directly in the read kernel would be more fitting for on the fly type inferring, which would severely reduce the current overhead introduced by parsing the file twice. On the other hand, we currently need meta data for shape, frame label and type inference which are all called before the read kernel. Therefore an implementation only in the read kernel would not be sufficient.

More details about the emerging overhead can be extracted from the evaluation results:

pdamme self-requested a review February 10, 2025 15:00

pdamme requested changes Feb 10, 2025

View reviewed changes

EricBenschneider force-pushed the 63-detection-of-csv-schema branch 2 times, most recently from 0003949 to 68ba270 Compare February 13, 2025 21:02

EricBenschneider added 21 commits February 17, 2025 18:30

added generateFileMetaData

4052edd

added tests for meta data generation

b66530e

updated read kernel and readMetaData for meta data generation

32fd597

used matrix/frame flag for meta data generation

2452ad8

updated DaphneDSL to use label flag

868aa65

added default labels for generated meta data

9b40756

improved generateMetaDataTest

9fb400f

added systest for metadata generation

003df11

Revert "updated DaphneDSL to use label flag"

94293ac

This reverts commit 85ea77a.

removed labels flag

8e59b66

improved generateMetaDataTest

9861fab

preparing for hybrid csv read without fmd

0159931

added sample rows for meta data generation

363012b

moved generation back to meta data parser

99ae372

refactor generateMetaData

008be9e

used singlevaluetype correctly

80858d8

added handling of empty daphne context

f517cd7

updated generateMetadata test

206534d

moved isMatrix flag

2fa5c3c

fixed single value type in test

09d531d

added multi line support

fb85dfa

EricBenschneider force-pushed the 63-detection-of-csv-schema branch from 68ba270 to fb85dfa Compare February 17, 2025 18:11

EricBenschneider added 3 commits February 24, 2025 05:40

code cleanup

126d4bf

tested new edge cases

d5d4d34

added experiment script

a12f201

EricBenschneider added 3 commits February 24, 2025 07:47

fix for multi line strings

31d52c0

added evalutation results

fdf4623

added documentation

4ab7b59

pdamme added the LDE winter 2024/25 Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25). label Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation of meta data files based on CSV data #942

Generation of meta data files based on CSV data #942

Uh oh!

EricBenschneider commented Feb 4, 2025

Uh oh!

auge commented Feb 4, 2025 •

edited

Loading

Uh oh!

pdamme left a comment

Uh oh!

EricBenschneider commented Feb 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Generation of meta data files based on CSV data #942

Are you sure you want to change the base?

Generation of meta data files based on CSV data #942

Uh oh!

Conversation

EricBenschneider commented Feb 4, 2025

Uh oh!

auge commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdamme left a comment

Choose a reason for hiding this comment

Uh oh!

EricBenschneider commented Feb 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

auge commented Feb 4, 2025 •

edited

Loading