Skip to content

Conversation

@EricBenschneider
Copy link

This PR adds CSV schema detection based on data as described in issue #63.
This PR also enables reading a CSV file without manually adding a meta data file as discussed in issue #688.
Once a CSV file is read and the MetaDataParser cannot find an appropriate meta data file, it will now be automatically generated and stored in the location it was expected.
The changes are tested by the added generateMetaDataFileTest.cpp
A new test case in MetaDataParserTest.cpp is added to test saving of the generated meta data file.

@auge
Copy link
Contributor

auge commented Feb 4, 2025

Thanks a lot for this PR, I'm looking into the usage part of this (no real code review from my side).

Currently, I see a lot of debug statements generated, I guess they should rather be silenced? (I have 1e9 rows in my csv file): inferred valueType: ... (aborting execution as it will take long time …)

when trying on a smaller file (10 rows), I get a segfault on my system (probably while writing the .meta file). When the meta file already exists, daphne continues normally.

As I have not looked into the code, I would like to inquiry if the failure to write the .meta file will be silent? I expect daphne to continue normally even when the meta file cannot be written - like when the csv file lies at a location where the user should not be allowed to write.

@pdamme pdamme self-requested a review February 10, 2025 15:00
Copy link
Collaborator

@pdamme pdamme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, @EricBenschneider! Being able to read a CSV file without a meta data file present is great for users (even though it will typically come with a certain performance penalty).

Thanks @auge for having a first look at this new feature. I can confirm that it crashes when used from a DaphneDSL script. The case of writing the meta data file to a non-writable location was most likely solved by @EricBenschneider's recent commit e481039, but would need additional testing.

So overall, this is a good first step towards solving #63, but a few points need to be addressed:

Required changes: (must be addressed before the merge)

  1. Make the CI checks pass. Several test cases failed; however, on my system all tests pass... Let's have a look again after you've addressed the other points. The clang-format check failed; please format all changed files with the clang-format specification in .clang-format (maybe you can use something like "format document" in your IDE).
  2. Extend the test cases. Please consider the following:
    1. Employ a quoted/multi-line string somewhere in the test CSV files (DAPHNE's CSV reader supports such strings).
    2. Add at least one script-level test case to check if your feature works from DaphneDSL. Currently, it doesn't seem to work, as @auge found out. For instance, the following DaphneDSL script crashes with the following error message when reading the following CSV file (and no meta data file).
      print(readFrame("my-frame.csv"));
      
      1,1.1,one
      
      inferred valueType: 0, 1.
      inferred valueType: 6, 1.1.
      inferred valueType: 8, one.
      bin/daphne(+0x14a4422)[0x565112219422]
      /lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7b512b129320]
      bin/daphne(+0x1e3abed)[0x565112bafbed]
      ...
      bin/daphne(+0x1557c67)[0x5651122ccc67]
      Segmentation fault (core dumped)
      
    3. Test if CSV values like "12+34" are correctly identified as strings, not as integers.
    4. Test if CSV values like "1.23abc" are correctly identified as strings.
  3. Make sure your changes to the read kernel are reflected in DaphneDSL and DaphneIR. You added an additional parameter specifying if a header/labels row is present. First, that must not come after the DaphneContext (DCTX) in the kernel's parameter list, otherwise there will be problems when lowering DaphneIR's ReadOp to the kernel call. The additonal parameter must be reflected in the registration of kernels (src/runtime/local/kernels/kernels.json), DaphneIR's ReadOp (src/ir/daphneir/DaphneOps.td), and the readMatrix()/readFrame() DaphneDSL built-in functions (src/parser/daphnedsl/DaphneDSLBuiltins.cpp). This point will be necessary for writing a script-level test case.
  4. Please remove all custom prints to stdout/stderr from your code. Consider using DAPHNE's logger (with an appropriate log level like info or debug) for important messages (but I think it's okay to just omit the ones you have). We generally want to avoid unnecessary output.
  5. Some of the newly added test cases assume that a meta data file doesn't exist in the beginning, but don't check if this is really the case. They do (try to) remove the (supposedly) created meta data file afterwards, but in case something goes wrong there, the next test run could still see the meta data file created by the previous test run.

Optional changes: (can be addressed before the merge or later)

  1. Please make sure that the corner case of reading a CSV file with zero rows (and no meta data file) is handled gracefully. To this end, please add a respective test case. Most importantly, DAPHNE should not simply crash in such a case. There should either be an error (because types cannot be inferred when there is no data), or the most general value type (i.e., string) should be assumed for all columns. It would be worthwhile to find out how other systems/libraries (e.g., pandas) handle this case.
  2. Make sure that malformed CSV files are handled gracefully. Currently, you say numCols = std::max(numCols, colIndex); in generateFileMetaData(). To my mind, a different number of columns per rows should raise an error.
  3. Currently, the entire CSV file is read once for detecting the schema and once again for the actual read. That will likely result in a 2x slow-down for reading CSV files without a meta data file, which is quite expensive. It would be great to think about more efficient ways, such as scanning the input file just once, interleaving read and schema detection or detecting the schema based on a sample (with the option of correcting it in case not all data fits).
  4. Furthermore, the code for detecting the schema implements a little CSV reader on its own (but doesn't seem to support all features of DAPHNE's existing CSV reader, e.g., multi-line strings). It would be great to avoid redundant code.

Let me know if you have any comments/questions or need further advice to address these points.

@EricBenschneider EricBenschneider force-pushed the 63-detection-of-csv-schema branch 2 times, most recently from 0003949 to 68ba270 Compare February 13, 2025 21:02
@EricBenschneider EricBenschneider force-pushed the 63-detection-of-csv-schema branch from 68ba270 to fb85dfa Compare February 17, 2025 18:11
@EricBenschneider
Copy link
Author

Thanks you guys for the feedback. Most of it could be incorporated.

Please note that I decided to implement the meta data generation in the scope of MetaDataParser::readMetaData().
Having it directly in the read kernel would be more fitting for on the fly type inferring, which would severely reduce the current overhead introduced by parsing the file twice. On the other hand, we currently need meta data for shape, frame label and type inference which are all called before the read kernel. Therefore an implementation only in the read kernel would not be sufficient.

More details about the emerging overhead can be extracted from the evaluation results:
create_read_breakdown

@pdamme pdamme added the LDE winter 2024/25 Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25). label Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

LDE winter 2024/25 Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants