857 speed up repeated csv reads using positional map #945

EricBenschneider · 2025-02-16T19:20:35Z

Description:

This PR introduces comprehensive positional map optimizations to speed up CSV parsing by reducing redundant string scanning when loading large files. The key changes include:

Positional Map Loading & Utilization:

When enabled (via the --use-positional-map flag or user config), the code attempts to read a precomputed positional map from disk.
If the map is available, the file is loaded into memory and row pointers are constructed using absolute offsets. Relative offsets stored in the map are then used to directly extract each field. This optimized “second” branch bypasses iterative scanning for delimiters, reducing processing overhead.

On-the-fly Map Generation:

In cases where no positional map exists, the CSV is parsed using the “normal” branch or when the flag is set constructs the positional map while reading.
After parsing, a new positional map is written to disk for future runs.

Support for Different Data Structures:

The optimizations apply for CSV files that are loaded into either a DenseMatrix (including those with numerical and string types like std::string and FixedStr16) or a Frame.

Performance Reporting & Generated Figures:

Shows the performance of normal, first (map creation), and second (optimized with posmap) read times across different dataset sizes (row times column count):

:These csv files were generated using a python script and were read as a frame consisting of all value types currently supported by daphne in same variety.

:The csv files for this chart were generated using a python script creating a frame with all unsigned and signed as well as both floating point value types supported by daphne.

:The csv files for this chart were generated using a python script creating a matrix with only floating point value types supported by daphne.

:The csv files for this chart were generated using a python script creating a matrix with random strings of length 1 to 20

File Size Ratio Bar Chart: Compares the baseline CSV file size (100%) against the average positional map size for different subtypes, helping to understand the overhead introduced by the map.

Observations on Optimization Performance:

While the intended goal is to reduce CSV parsing overhead, performance measurements indicate that the “optimized” posmap branch performs slower than the normal read.
Based on the similar scaling of all performance measurements, bottlenck is most likely incurred by memory operations rather than by algorithmic complexity. In particular:

Memory Allocation and Copying Overhead:
The optimized branch reads the entire file into a large buffer. This extra copying and allocation work has a significant performance cost.
Cache Inefficiencies:
Processing a huge memory buffer may not be as cache-friendly as sequentially reading lines, causing more cache misses that slow down the computation.
Offset Computation Overhead:
Even though the field extraction logic avoids repeated scanning in theory, the extra arithmetic and vector operations for computing row and column boundaries add a non-negligible per-line overhead that scales with the file size.
String Encoding:
While long strings are promising to reduce parsing overhead searching for delimiters, having to skim the whole string for double quote encoding limits the usability significantly. Experiments on matrices with fixed sized strings showed, that the positional map can be slightly ahead of the normal read. Omitting the double quote encoding further increased the performance increase, yet doesn't fulfill daphnes pasrser claims.

Testing:

This PR also adds unit and system level tests to validate the functionality of the positional map optimization.

Unit Tests:
The unit tests verify proper CSV parsing for all supported data types and structures:

DenseMatrix (numeric, string and fixed-length string types):
Tests include parsing of positive/negative values, proper cast conversions (e.g., for uint8_t), and comparisons with expected floating‐point values (using Approx for doubles).
Frame (with mixed data types, including numbers and strings):
Test cases validate that each column is read correctly, including handling of special cases such as embedded commas, newlines, quotes, INF and NAN values.
Positional Map Flag Behavior:
For both DenseMatrix and Frame, tests confirm that when the positional map flag is enabled:
A “.posmap” file is created on the first read;
Subsequent reads reuse the existing positional map and yield identical cell values to the normal read path;
The file is cleaned up after the test.

System-Level Tests:
System-level tests have been integrated by running .daphne files that use the readFrame and readCsvFile (i.e. readMatrix) functions. These tests ensure:

The flag enabling positional map optimizations works as expected across the entire pipeline.
The results (for both matrix and frame data) match those obtained from standard CSV parsing.
All tests run on the local system (Windows) and have been executed via the build/test automation ensuring that both the new flag logic and legacy code paths continue to function correctly.

This testing suite confirms that the new positional map optimizations work seamlessly and that the system-level workflows (i.e. running a full .daphne file) produce the expected results.

Please review this changes and the attached figures. Everything regarding the experiments (e.g. for reproducibility) can be found ion the evaluation folder.

This reverts commit 85ea77a.

EricBenschneider added 30 commits February 17, 2025 21:36

added generateFileMetaData

b3d829e

added tests for meta data generation

30edf69

updated read kernel and readMetaData for meta data generation

b9f5913

used matrix/frame flag for meta data generation

312b30c

ran clang-format

030aa48

fixed runtime error when trying to save generated file

37f7d68

1

698e3ca

added positional map utility functions

8abddf6

using positional map for frame reading

4bdbcf1

posMap working but indexes screwed

7f29271

new tests

26ef589

update tests to not use newline

7765485

wsl stuff

e8530f5

refactor old readcsvfile for frames

61f6673

added daphne file util to csv

8d71bcc

conv to unix file endings

45ee7c6

added config for read optimizations

f066c11

fixed flag usage

63175fe

added config for read optimization

51b7842

metadata test fix

744cf21

added generateFileMetaData

ace898a

added tests for meta data generation

a11191d

updated read kernel and readMetaData for meta data generation

db777dc

updated DaphneDSL to use label flag

bd94011

improved generateMetaDataTest

033ee14

added systest for reading frame without meta data

a607add

Revert "updated DaphneDSL to use label flag"

d633884

This reverts commit 85ea77a.

removed label flag

5fdca5a

improved generateMetaDataTest

04d5f8e

added sample rows for meta data generation

fed8e8b

EricBenschneider added 24 commits February 24, 2025 00:38

posmap final

d2b12fd

removed binary optimization and posmap for matrix

4f86996

removed binary optimization

644e699

removed posmap matrix tests

b9335ef

removed prints

0801805

added evaluation artifacts

2f12c70

used time measuring correctly

f86b48a

fixed tests and rebase errors

c8d8282

updated tests

afb6a65

strings without multiline

4d262d5

added double quote encoding

00708d0

added fixedstr matrix optimization

bbbc7ff

used one read for posmap reading

edecf1b

optimized positional map

d32dd75

added positional map for string matrix

6ae0e68

added positional map for general matrix

434874d

last fixes

f03b0ac

test update

029515d

read matrix string opt

70ba3a6

added experiment script

6ea67b8

precomputed nextPos

ff8f53c

ran first experiments and created charts

39d6911

changed usage to single flag

d65c9fb

added documentation

377a781

EricBenschneider force-pushed the 857-speed-up-repeated-csv-reads-posmap branch from d890d0c to 377a781 Compare February 24, 2025 00:06

EricBenschneider changed the title ~~857 speed up repeated csv reads posmap~~ 857 speed up repeated csv reads using positional map Feb 24, 2025

EricBenschneider marked this pull request as ready for review February 24, 2025 01:15

changed flag default to false

a16d3c4

pdamme self-requested a review March 24, 2025 18:55

pdamme added the LDE winter 2024/25 Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25). label Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

857 speed up repeated csv reads using positional map #945

857 speed up repeated csv reads using positional map #945

Uh oh!

EricBenschneider commented Feb 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

857 speed up repeated csv reads using positional map #945

Are you sure you want to change the base?

857 speed up repeated csv reads using positional map #945

Uh oh!

Conversation

EricBenschneider commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Positional Map Loading & Utilization:

On-the-fly Map Generation:

Support for Different Data Structures:

Performance Reporting & Generated Figures:

Observations on Optimization Performance:

Testing:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EricBenschneider commented Feb 16, 2025 •

edited

Loading