Skip to content

[C++] arrow::schema construction performance degrades to O(n^2) on libc++ for duplicate/unnamed fields #48977

@arvidjonasson

Description

@arvidjonasson

Describe the bug, including details regarding any error messages, version, and platform.

arrow::schema construction performance degrades to quadratic time on libc++ when there's a lot of fields with duplicate or no names. This is most critical when creating wide schemas programmatically.

The performance degradation is caused by the function CreateNameToIndexMap. The function calls std::unordered_multimap::emplace once per field. On libc++, this function call has linear runtime (amortized) for duplicate keys because it iterates past every existing entry to emplace the new element at the end of the range (see LLVM bugzilla 16747 and 21275).

Some performance benchmarks creating a schema with N int32 fields:

N Time (s) Growth factor (vs previous input)
500 0.000464958s NA
5000 0.041429s 89.1x
50000 4.12813s 99.6x
500000 416.985s 101.0x

One fix is to use map.emplace_hint(map.find(key), ...) as recommended by libc++ maintainers. This avoids the linear scan but may reverse order for duplicates. Alternatively, using std::unordered_map<std::string_view, std::vector<int>> keeps the order but is a slightly larger refactor.

The same benchmark fixing the issue with the latter approach (unordered_map + vector):

N Time (s) Growth factor (vs previous input)
500 0.000040s NA
5000 0.000066s 1.65x
50000 0.000681s 10.3x
500000 0.007084s 10.4x

I can submit a PR if a fix is worth pursuing.

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions