-
Notifications
You must be signed in to change notification settings - Fork 78
857 speed up repeated csv reads using positional map #945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
EricBenschneider
wants to merge
72
commits into
daphne-project:main
Choose a base branch
from
EricBenschneider:857-speed-up-repeated-csv-reads-posmap
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
857 speed up repeated csv reads using positional map #945
EricBenschneider
wants to merge
72
commits into
daphne-project:main
from
EricBenschneider:857-speed-up-repeated-csv-reads-posmap
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This reverts commit 85ea77a.
d890d0c to
377a781
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
LDE winter 2024/25
Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25).
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR introduces comprehensive positional map optimizations to speed up CSV parsing by reducing redundant string scanning when loading large files. The key changes include:
Positional Map Loading & Utilization:
When enabled (via the --use-positional-map flag or user config), the code attempts to read a precomputed positional map from disk.
If the map is available, the file is loaded into memory and row pointers are constructed using absolute offsets. Relative offsets stored in the map are then used to directly extract each field. This optimized “second” branch bypasses iterative scanning for delimiters, reducing processing overhead.
On-the-fly Map Generation:
In cases where no positional map exists, the CSV is parsed using the “normal” branch or when the flag is set constructs the positional map while reading.
After parsing, a new positional map is written to disk for future runs.
Support for Different Data Structures:
The optimizations apply for CSV files that are loaded into either a DenseMatrix (including those with numerical and string types like std::string and FixedStr16) or a Frame.
Performance Reporting & Generated Figures:
Shows the performance of normal, first (map creation), and second (optimized with posmap) read times across different dataset sizes (row times column count):




:These csv files were generated using a python script and were read as a frame consisting of all value types currently supported by daphne in same variety.
:The csv files for this chart were generated using a python script creating a frame with all unsigned and signed as well as both floating point value types supported by daphne.
:The csv files for this chart were generated using a python script creating a matrix with only floating point value types supported by daphne.
:The csv files for this chart were generated using a python script creating a matrix with random strings of length 1 to 20
File Size Ratio Bar Chart: Compares the baseline CSV file size (100%) against the average positional map size for different subtypes, helping to understand the overhead introduced by the map.
Observations on Optimization Performance:
While the intended goal is to reduce CSV parsing overhead, performance measurements indicate that the “optimized” posmap branch performs slower than the normal read.
Based on the similar scaling of all performance measurements, bottlenck is most likely incurred by memory operations rather than by algorithmic complexity. In particular:
Memory Allocation and Copying Overhead:
The optimized branch reads the entire file into a large buffer. This extra copying and allocation work has a significant performance cost.
Cache Inefficiencies:
Processing a huge memory buffer may not be as cache-friendly as sequentially reading lines, causing more cache misses that slow down the computation.
Offset Computation Overhead:
Even though the field extraction logic avoids repeated scanning in theory, the extra arithmetic and vector operations for computing row and column boundaries add a non-negligible per-line overhead that scales with the file size.
String Encoding:
While long strings are promising to reduce parsing overhead searching for delimiters, having to skim the whole string for double quote encoding limits the usability significantly. Experiments on matrices with fixed sized strings showed, that the positional map can be slightly ahead of the normal read. Omitting the double quote encoding further increased the performance increase, yet doesn't fulfill daphnes pasrser claims.
Testing:
This PR also adds unit and system level tests to validate the functionality of the positional map optimization.
Unit Tests:
The unit tests verify proper CSV parsing for all supported data types and structures:
Tests include parsing of positive/negative values, proper cast conversions (e.g., for uint8_t), and comparisons with expected floating‐point values (using Approx for doubles).
Test cases validate that each column is read correctly, including handling of special cases such as embedded commas, newlines, quotes, INF and NAN values.
For both DenseMatrix and Frame, tests confirm that when the positional map flag is enabled:
A “.posmap” file is created on the first read;
Subsequent reads reuse the existing positional map and yield identical cell values to the normal read path;
The file is cleaned up after the test.
System-Level Tests:
System-level tests have been integrated by running .daphne files that use the readFrame and readCsvFile (i.e. readMatrix) functions. These tests ensure:
This testing suite confirms that the new positional map optimizations work seamlessly and that the system-level workflows (i.e. running a full .daphne file) produce the expected results.
Please review this changes and the attached figures. Everything regarding the experiments (e.g. for reproducibility) can be found ion the evaluation folder.