-
Notifications
You must be signed in to change notification settings - Fork 78
Generation of meta data files based on CSV data #942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Generation of meta data files based on CSV data #942
Conversation
|
Thanks a lot for this PR, I'm looking into the usage part of this (no real code review from my side). Currently, I see a lot of debug statements generated, I guess they should rather be silenced? (I have 1e9 rows in my csv file): when trying on a smaller file (10 rows), I get a segfault on my system (probably while writing the .meta file). When the meta file already exists, daphne continues normally. As I have not looked into the code, I would like to inquiry if the failure to write the .meta file will be silent? I expect daphne to continue normally even when the meta file cannot be written - like when the csv file lies at a location where the user should not be allowed to write. |
pdamme
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution, @EricBenschneider! Being able to read a CSV file without a meta data file present is great for users (even though it will typically come with a certain performance penalty).
Thanks @auge for having a first look at this new feature. I can confirm that it crashes when used from a DaphneDSL script. The case of writing the meta data file to a non-writable location was most likely solved by @EricBenschneider's recent commit e481039, but would need additional testing.
So overall, this is a good first step towards solving #63, but a few points need to be addressed:
Required changes: (must be addressed before the merge)
- Make the CI checks pass. Several test cases failed; however, on my system all tests pass... Let's have a look again after you've addressed the other points. The clang-format check failed; please format all changed files with the clang-format specification in
.clang-format(maybe you can use something like "format document" in your IDE). - Extend the test cases. Please consider the following:
- Employ a quoted/multi-line string somewhere in the test CSV files (DAPHNE's CSV reader supports such strings).
- Add at least one script-level test case to check if your feature works from DaphneDSL. Currently, it doesn't seem to work, as @auge found out. For instance, the following DaphneDSL script crashes with the following error message when reading the following CSV file (and no meta data file).
print(readFrame("my-frame.csv"));1,1.1,oneinferred valueType: 0, 1. inferred valueType: 6, 1.1. inferred valueType: 8, one. bin/daphne(+0x14a4422)[0x565112219422] /lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7b512b129320] bin/daphne(+0x1e3abed)[0x565112bafbed] ... bin/daphne(+0x1557c67)[0x5651122ccc67] Segmentation fault (core dumped) - Test if CSV values like "12+34" are correctly identified as strings, not as integers.
- Test if CSV values like "1.23abc" are correctly identified as strings.
- Make sure your changes to the read kernel are reflected in DaphneDSL and DaphneIR. You added an additional parameter specifying if a header/labels row is present. First, that must not come after the
DaphneContext(DCTX) in the kernel's parameter list, otherwise there will be problems when lowering DaphneIR'sReadOpto the kernel call. The additonal parameter must be reflected in the registration of kernels (src/runtime/local/kernels/kernels.json), DaphneIR'sReadOp(src/ir/daphneir/DaphneOps.td), and thereadMatrix()/readFrame()DaphneDSL built-in functions (src/parser/daphnedsl/DaphneDSLBuiltins.cpp). This point will be necessary for writing a script-level test case. - Please remove all custom prints to stdout/stderr from your code. Consider using DAPHNE's logger (with an appropriate log level like info or debug) for important messages (but I think it's okay to just omit the ones you have). We generally want to avoid unnecessary output.
- Some of the newly added test cases assume that a meta data file doesn't exist in the beginning, but don't check if this is really the case. They do (try to) remove the (supposedly) created meta data file afterwards, but in case something goes wrong there, the next test run could still see the meta data file created by the previous test run.
Optional changes: (can be addressed before the merge or later)
- Please make sure that the corner case of reading a CSV file with zero rows (and no meta data file) is handled gracefully. To this end, please add a respective test case. Most importantly, DAPHNE should not simply crash in such a case. There should either be an error (because types cannot be inferred when there is no data), or the most general value type (i.e., string) should be assumed for all columns. It would be worthwhile to find out how other systems/libraries (e.g., pandas) handle this case.
- Make sure that malformed CSV files are handled gracefully. Currently, you say
numCols = std::max(numCols, colIndex);ingenerateFileMetaData(). To my mind, a different number of columns per rows should raise an error. - Currently, the entire CSV file is read once for detecting the schema and once again for the actual read. That will likely result in a 2x slow-down for reading CSV files without a meta data file, which is quite expensive. It would be great to think about more efficient ways, such as scanning the input file just once, interleaving read and schema detection or detecting the schema based on a sample (with the option of correcting it in case not all data fits).
- Furthermore, the code for detecting the schema implements a little CSV reader on its own (but doesn't seem to support all features of DAPHNE's existing CSV reader, e.g., multi-line strings). It would be great to avoid redundant code.
Let me know if you have any comments/questions or need further advice to address these points.
0003949 to
68ba270
Compare
This reverts commit 85ea77a.
68ba270 to
fb85dfa
Compare

This PR adds CSV schema detection based on data as described in issue #63.
This PR also enables reading a CSV file without manually adding a meta data file as discussed in issue #688.
Once a CSV file is read and the MetaDataParser cannot find an appropriate meta data file, it will now be automatically generated and stored in the location it was expected.
The changes are tested by the added
generateMetaDataFileTest.cppA new test case in
MetaDataParserTest.cppis added to test saving of the generated meta data file.