Skip to content

Conversation

@EricBenschneider
Copy link

@EricBenschneider EricBenschneider commented Feb 16, 2025

Description:

This PR introduces comprehensive positional map optimizations to speed up CSV parsing by reducing redundant string scanning when loading large files. The key changes include:

Positional Map Loading & Utilization:

When enabled (via the --use-positional-map flag or user config), the code attempts to read a precomputed positional map from disk.
If the map is available, the file is loaded into memory and row pointers are constructed using absolute offsets. Relative offsets stored in the map are then used to directly extract each field. This optimized “second” branch bypasses iterative scanning for delimiters, reducing processing overhead.

On-the-fly Map Generation:

In cases where no positional map exists, the CSV is parsed using the “normal” branch or when the flag is set constructs the positional map while reading.
After parsing, a new positional map is written to disk for future runs.

Support for Different Data Structures:

The optimizations apply for CSV files that are loaded into either a DenseMatrix (including those with numerical and string types like std::string and FixedStr16) or a Frame.

Performance Reporting & Generated Figures:

Shows the performance of normal, first (map creation), and second (optimized with posmap) read times across different dataset sizes (row times column count):
overall_read_time_frame_mixed
:These csv files were generated using a python script and were read as a frame consisting of all value types currently supported by daphne in same variety.
overall_read_time_frame_number
:The csv files for this chart were generated using a python script creating a frame with all unsigned and signed as well as both floating point value types supported by daphne.
overall_read_time_matrix_float
:The csv files for this chart were generated using a python script creating a matrix with only floating point value types supported by daphne.
overall_read_time_matrix_str
:The csv files for this chart were generated using a python script creating a matrix with random strings of length 1 to 20

avg_ratio_bar_chart

File Size Ratio Bar Chart: Compares the baseline CSV file size (100%) against the average positional map size for different subtypes, helping to understand the overhead introduced by the map.

Observations on Optimization Performance:

While the intended goal is to reduce CSV parsing overhead, performance measurements indicate that the “optimized” posmap branch performs slower than the normal read.
Based on the similar scaling of all performance measurements, bottlenck is most likely incurred by memory operations rather than by algorithmic complexity. In particular:

  • Memory Allocation and Copying Overhead:
    The optimized branch reads the entire file into a large buffer. This extra copying and allocation work has a significant performance cost.

  • Cache Inefficiencies:
    Processing a huge memory buffer may not be as cache-friendly as sequentially reading lines, causing more cache misses that slow down the computation.

  • Offset Computation Overhead:
    Even though the field extraction logic avoids repeated scanning in theory, the extra arithmetic and vector operations for computing row and column boundaries add a non-negligible per-line overhead that scales with the file size.

  • String Encoding:
    While long strings are promising to reduce parsing overhead searching for delimiters, having to skim the whole string for double quote encoding limits the usability significantly. Experiments on matrices with fixed sized strings showed, that the positional map can be slightly ahead of the normal read. Omitting the double quote encoding further increased the performance increase, yet doesn't fulfill daphnes pasrser claims.

Testing:

This PR also adds unit and system level tests to validate the functionality of the positional map optimization.

Unit Tests:
The unit tests verify proper CSV parsing for all supported data types and structures:

  • DenseMatrix (numeric, string and fixed-length string types):
    Tests include parsing of positive/negative values, proper cast conversions (e.g., for uint8_t), and comparisons with expected floating‐point values (using Approx for doubles).
  • Frame (with mixed data types, including numbers and strings):
    Test cases validate that each column is read correctly, including handling of special cases such as embedded commas, newlines, quotes, INF and NAN values.
  • Positional Map Flag Behavior:
    For both DenseMatrix and Frame, tests confirm that when the positional map flag is enabled:
    A “.posmap” file is created on the first read;
    Subsequent reads reuse the existing positional map and yield identical cell values to the normal read path;
    The file is cleaned up after the test.

System-Level Tests:
System-level tests have been integrated by running .daphne files that use the readFrame and readCsvFile (i.e. readMatrix) functions. These tests ensure:

  • The flag enabling positional map optimizations works as expected across the entire pipeline.
  • The results (for both matrix and frame data) match those obtained from standard CSV parsing.
  • All tests run on the local system (Windows) and have been executed via the build/test automation ensuring that both the new flag logic and legacy code paths continue to function correctly.

This testing suite confirms that the new positional map optimizations work seamlessly and that the system-level workflows (i.e. running a full .daphne file) produce the expected results.

Please review this changes and the attached figures. Everything regarding the experiments (e.g. for reproducibility) can be found ion the evaluation folder.

@EricBenschneider EricBenschneider force-pushed the 857-speed-up-repeated-csv-reads-posmap branch from d890d0c to 377a781 Compare February 24, 2025 00:06
@EricBenschneider EricBenschneider changed the title 857 speed up repeated csv reads posmap 857 speed up repeated csv reads using positional map Feb 24, 2025
@EricBenschneider EricBenschneider marked this pull request as ready for review February 24, 2025 01:15
@pdamme pdamme self-requested a review March 24, 2025 18:55
@pdamme pdamme added the LDE winter 2024/25 Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25). label Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

LDE winter 2024/25 Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants