-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug, including details regarding any error messages, version, and platform.
arrow::schema construction performance degrades to quadratic time on libc++ when there's a lot of fields with duplicate or no names. This is most critical when creating wide schemas programmatically.
The performance degradation is caused by the function CreateNameToIndexMap. The function calls std::unordered_multimap::emplace once per field. On libc++, this function call has linear runtime (amortized) for duplicate keys because it iterates past every existing entry to emplace the new element at the end of the range (see LLVM bugzilla 16747 and 21275).
Some performance benchmarks creating a schema with N int32 fields:
| N | Time (s) | Growth factor (vs previous input) |
|---|---|---|
| 500 | 0.000464958s | NA |
| 5000 | 0.041429s | 89.1x |
| 50000 | 4.12813s | 99.6x |
| 500000 | 416.985s | 101.0x |
One fix is to use map.emplace_hint(map.find(key), ...) as recommended by libc++ maintainers. This avoids the linear scan but may reverse order for duplicates. Alternatively, using std::unordered_map<std::string_view, std::vector<int>> keeps the order but is a slightly larger refactor.
The same benchmark fixing the issue with the latter approach (unordered_map + vector):
| N | Time (s) | Growth factor (vs previous input) |
|---|---|---|
| 500 | 0.000040s | NA |
| 5000 | 0.000066s | 1.65x |
| 50000 | 0.000681s | 10.3x |
| 500000 | 0.007084s | 10.4x |
I can submit a PR if a fix is worth pursuing.
Component(s)
C++