Performance/9697143845/mvp arrow performance profiling and optimisation (#2573)

alexowens90 · web-flow · commit 78f5d2e898f0 · 2025-08-15T14:22:06.000+01:00
#### Reference Issues/PRs [9697143845](https://man312219.monday.com/boards/7852509418/pulses/9697143845) #### What does this implement or fix? Adds basic ASV benchmarks for Arrow reads, and improves the performance of reading strings. Eventually we can parametrize all the benchmarks with Arrow as the input/output formats, this just makes sure we don't break anything egregiously until we get to that point. #### Benchmarking findings summary Timings: - Plain read calls `lib.read(sym)` - Numeric data. Tested with 10 columns/1, 1k, 100k, 1m, and 100m rows of int64 data - Arrow performance identical to performance returning Pandas when consolidation is disabled (the default with the V2 API for Pandas >=2.0.0). This is expected, as the only real difference in this case is the output frame is made up of multiple buffers per column with Arrow, and just one with Pandas, but the timing difference of this is negligible, and the time is dominated by decoding + memcpying with both. - Pandas with consolidation enabled (the default inside Man) is strictly slower, although this not noticeable at very low row counts. - String data. Tested with 10 columns/1, 100, 10k, and 1m rows/1, 100, and 100k unique 10 character strings - No noticeable difference with <=100 rows as timings dominated by other operations - On a single core, Arrow ~x2 faster than Pandas before this change, and ~x3 faster after it - Using more cores increased the performance difference - Date range read calls `lib.read(sym, date_range=blah)` - Tested with a variety of ranges to test the various Arrow truncation code paths and Pandas post-processing - Numeric data. Tested with 10 columns/1m rows of int64 data - Additional memcpys are visible in timings - In the pathological case of excluding just 1 row from each end of the symbol, adds ~17% compared to Pandas with consolidation disabled, taking performance back to about where Pandas is with consolidation enabled - String data. Tested with 10 columns/1m rows/1, 100, and 100k unique 10 character strings - Same performance hit observed as with numeric data - But because the strings performance is faster than Pandas, overall it is still always faster to use Arrow Peak memory usage: - For reading 10 columns x 100m rows = 1B 64 bit integers (8GB of data), Arrow needs a few hundred MB in addition to the raw data size, and Pandas without consolidation is similar. Pandas with consolidation is about double as it memcpys all of the buffers we return into one larger buffer - Reading 10 columns x 1m rows = 10m strings with 1/100/100k unique strings - Arrow needs 400MB/400MB/666MB - Pandas without consolidation needs 400MB/400MB/900MB - Pandas with consolidation needs 500MB/500MB/1GB
diff --git a/cpp/arcticdb/arrow/arrow_handlers.cpp b/cpp/arcticdb/arrow/arrow_handlers.cpp
@@ -63,41 +63,41 @@ void ArrowStringHandler::convert_type(
     struct DictEntry {
         int32_t offset_buffer_pos_;
         int64_t string_buffer_pos_;
+        std::string_view strv;
     };
     std::vector<StringPool::offset_t> unique_offsets_in_order;
     ankerl::unordered_dense::map<StringPool::offset_t, DictEntry> unique_offsets;
+    // Trade some memory for more performance
+    // TODO: Use unique count column stat in V2 encoding
+    unique_offsets_in_order.reserve(source_column.row_count());
+    unique_offsets.reserve(source_column.row_count());
     int64_t bytes = 0;
+    int32_t unique_offset_count = 0;
     auto dest_ptr = reinterpret_cast<int32_t*>(dest_column.bytes_at(mapping.offset_bytes_, source_column.row_count() * sizeof(int32_t)));
 
     // First go through the source column once to compute the size of offset and string buffers.
     while(pos != end) {
-        auto [entry, is_emplaced] = unique_offsets.try_emplace(*pos, DictEntry{static_cast<int32_t>(unique_offsets_in_order.size()), bytes});
+        auto [entry, is_emplaced] = unique_offsets.try_emplace(*pos, DictEntry{unique_offset_count, bytes, string_pool->get_const_view(*pos)});
         if(is_emplaced) {
-            bytes += string_pool->get_const_view(*pos).size();
+            bytes += entry->second.strv.size();
             unique_offsets_in_order.push_back(*pos);
+            ++unique_offset_count;
         }
         ++pos;
-        *dest_ptr = entry->second.offset_buffer_pos_;
-        ++dest_ptr;
+        *dest_ptr++ = entry->second.offset_buffer_pos_;
     }
     auto& string_buffer = dest_column.create_extra_buffer(mapping.offset_bytes_, ExtraBufferType::STRING, bytes, AllocationType::DETACHABLE);
     auto& offsets_buffer = dest_column.create_extra_buffer(mapping.offset_bytes_, ExtraBufferType::OFFSET, (unique_offsets_in_order.size() + 1) * sizeof(int64_t), AllocationType::DETACHABLE);
-
     // Then go through unique_offsets to fill up the offset and string buffers.
     auto offsets_ptr = reinterpret_cast<int64_t*>(offsets_buffer.data());
     auto string_ptr = reinterpret_cast<char*>(string_buffer.data());
-    auto string_begin_ptr = string_ptr;
-    for(auto i=0u; i<unique_offsets_in_order.size(); ++i) {
-        auto string_pool_offset = unique_offsets_in_order[i];
-        auto& entry = unique_offsets[string_pool_offset];
-        util::check(static_cast<int32_t>(i) == entry.offset_buffer_pos_, "Mismatch in offset buffer pos");
-        util::check(string_ptr - string_begin_ptr == entry.string_buffer_pos_, "Mismatch in string buffer pos");
-        offsets_ptr[i] = entry.string_buffer_pos_;
-        const auto strv = string_pool->get_const_view(string_pool_offset);
-        memcpy(string_ptr, strv.data(), strv.size());
-        string_ptr += strv.size();
+    for (auto unique_offset: unique_offsets_in_order) {
+        const auto& entry = unique_offsets[unique_offset];
+        *offsets_ptr++ = entry.string_buffer_pos_;
+        memcpy(string_ptr, entry.strv.data(), entry.strv.size());
+        string_ptr += entry.strv.size();
     }
-    offsets_ptr[unique_offsets_in_order.size()] = bytes;
+    *offsets_ptr = bytes;
 }
 
 TypeDescriptor ArrowStringHandler::output_type(const TypeDescriptor&) const {
diff --git a/python/.asv/results/benchmarks.json b/python/.asv/results/benchmarks.json
@@ -1,4 +1,120 @@
 {
+    "arrow.ArrowReadNumeric.peakmem_read": {
+        "code": "class ArrowReadNumeric:\n    def peakmem_read(self, rows, date_range):\n        self.lib.read(self.symbol_name(rows), date_range=self.date_range)\n\n    def setup(self, rows, date_range):\n        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)\n        self.lib = self.ac.get_library(self.lib_name)\n        if date_range is None:\n            self.date_range = None\n        else:\n            # Create a date range that excludes the first and last 10 rows of the data only\n            self.date_range = (pd.Timestamp(10), pd.Timestamp(rows - 10))\n\n    def setup_cache(self):\n        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)\n        num_rows, date_ranges = self.params\n        num_cols = 9 # 10 including the index column\n        self.ac.delete_library(self.lib_name)\n        self.ac.create_library(self.lib_name)\n        lib = self.ac.get_library(self.lib_name)\n        for rows in num_rows:\n            df = pd.DataFrame(\n                {\n                    f\"col{idx}\": np.arange(idx * rows, (idx + 1) * rows, dtype=np.int64) for idx in range(num_cols)\n                },\n                index = pd.date_range(\"1970-01-01\", freq=\"ns\", periods=rows)\n            )\n            lib.write(self.symbol_name(rows), df)",
+        "name": "arrow.ArrowReadNumeric.peakmem_read",
+        "param_names": [
+            "rows",
+            "date_range"
+        ],
+        "params": [
+            [
+                "100000",
+                "100000000"
+            ],
+            [
+                "None",
+                "'middle'"
+            ]
+        ],
+        "setup_cache_key": "arrow:30",
+        "timeout": 6000,
+        "type": "peakmemory",
+        "unit": "bytes",
+        "version": "f41e907f991caa155f765981fb845308e6ac55ba912a18f44b73ec87587d3667"
+    },
+    "arrow.ArrowReadNumeric.time_read": {
+        "code": "class ArrowReadNumeric:\n    def time_read(self, rows, date_range):\n        self.lib.read(self.symbol_name(rows), date_range=self.date_range)\n\n    def setup(self, rows, date_range):\n        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)\n        self.lib = self.ac.get_library(self.lib_name)\n        if date_range is None:\n            self.date_range = None\n        else:\n            # Create a date range that excludes the first and last 10 rows of the data only\n            self.date_range = (pd.Timestamp(10), pd.Timestamp(rows - 10))\n\n    def setup_cache(self):\n        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)\n        num_rows, date_ranges = self.params\n        num_cols = 9 # 10 including the index column\n        self.ac.delete_library(self.lib_name)\n        self.ac.create_library(self.lib_name)\n        lib = self.ac.get_library(self.lib_name)\n        for rows in num_rows:\n            df = pd.DataFrame(\n                {\n                    f\"col{idx}\": np.arange(idx * rows, (idx + 1) * rows, dtype=np.int64) for idx in range(num_cols)\n                },\n                index = pd.date_range(\"1970-01-01\", freq=\"ns\", periods=rows)\n            )\n            lib.write(self.symbol_name(rows), df)",
+        "min_run_count": 2,
+        "name": "arrow.ArrowReadNumeric.time_read",
+        "number": 5,
+        "param_names": [
+            "rows",
+            "date_range"
+        ],
+        "params": [
+            [
+                "100000",
+                "100000000"
+            ],
+            [
+                "None",
+                "'middle'"
+            ]
+        ],
+        "repeat": 0,
+        "rounds": 1,
+        "sample_time": 0.01,
+        "setup_cache_key": "arrow:30",
+        "timeout": 6000,
+        "type": "time",
+        "unit": "seconds",
+        "version": "2e1ac87c8fa79349da6d59f4f8618a1fdb207b72b692092d0c9c2c69c26c297f",
+        "warmup_time": 0
+    },
+    "arrow.ArrowReadStrings.peakmem_read": {
+        "code": "class ArrowReadStrings:\n    def peakmem_read(self, rows, date_range, unique_string_count):\n        self.lib.read(self.symbol_name(rows, unique_string_count), date_range=self.date_range)\n\n    def setup(self, rows, date_range, unique_string_count):\n        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)\n        self.lib = self.ac.get_library(self.lib_name)\n        if date_range is None:\n            self.date_range = None\n        else:\n            # Create a date range that excludes the first and last 10 rows of the data only\n            self.date_range = (pd.Timestamp(10), pd.Timestamp(rows - 10))\n\n    def setup_cache(self):\n        rng = np.random.default_rng()\n        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)\n        num_rows, date_ranges, unique_string_counts = self.params\n        num_cols = 10\n        self.ac.delete_library(self.lib_name)\n        self.ac.create_library(self.lib_name)\n        lib = self.ac.get_library(self.lib_name)\n        for unique_string_count in unique_string_counts:\n            strings = np.array(random_strings_of_length(unique_string_count, 10, unique=True))\n            for rows in num_rows:\n                df = pd.DataFrame(\n                    {\n                        f\"col{idx}\": rng.choice(strings, rows) for idx in range(num_cols)\n                    },\n                    index = pd.date_range(\"1970-01-01\", freq=\"ns\", periods=rows)\n                )\n                lib.write(self.symbol_name(rows, unique_string_count), df)",
+        "name": "arrow.ArrowReadStrings.peakmem_read",
+        "param_names": [
+            "rows",
+            "date_range",
+            "unique_string_count"
+        ],
+        "params": [
+            [
+                "10000",
+                "1000000"
+            ],
+            [
+                "None",
+                "'middle'"
+            ],
+            [
+                "1",
+                "100",
+                "100000"
+            ]
+        ],
+        "setup_cache_key": "arrow:78",
+        "timeout": 6000,
+        "type": "peakmemory",
+        "unit": "bytes",
+        "version": "daf31f0aa67cce7ace0d495e3648983ba46ca8cb5a4184f5875421398de5a862"
+    },
+    "arrow.ArrowReadStrings.time_read": {
+        "code": "class ArrowReadStrings:\n    def time_read(self, rows, date_range, unique_string_count):\n        self.lib.read(self.symbol_name(rows, unique_string_count), date_range=self.date_range)\n\n    def setup(self, rows, date_range, unique_string_count):\n        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)\n        self.lib = self.ac.get_library(self.lib_name)\n        if date_range is None:\n            self.date_range = None\n        else:\n            # Create a date range that excludes the first and last 10 rows of the data only\n            self.date_range = (pd.Timestamp(10), pd.Timestamp(rows - 10))\n\n    def setup_cache(self):\n        rng = np.random.default_rng()\n        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)\n        num_rows, date_ranges, unique_string_counts = self.params\n        num_cols = 10\n        self.ac.delete_library(self.lib_name)\n        self.ac.create_library(self.lib_name)\n        lib = self.ac.get_library(self.lib_name)\n        for unique_string_count in unique_string_counts:\n            strings = np.array(random_strings_of_length(unique_string_count, 10, unique=True))\n            for rows in num_rows:\n                df = pd.DataFrame(\n                    {\n                        f\"col{idx}\": rng.choice(strings, rows) for idx in range(num_cols)\n                    },\n                    index = pd.date_range(\"1970-01-01\", freq=\"ns\", periods=rows)\n                )\n                lib.write(self.symbol_name(rows, unique_string_count), df)",
+        "min_run_count": 2,
+        "name": "arrow.ArrowReadStrings.time_read",
+        "number": 5,
+        "param_names": [
+            "rows",
+            "date_range",
+            "unique_string_count"
+        ],
+        "params": [
+            [
+                "10000",
+                "1000000"
+            ],
+            [
+                "None",
+                "'middle'"
+            ],
+            [
+                "1",
+                "100",
+                "100000"
+            ]
+        ],
+        "repeat": 0,
+        "rounds": 1,
+        "sample_time": 0.01,
+        "setup_cache_key": "arrow:78",
+        "timeout": 6000,
+        "type": "time",
+        "unit": "seconds",
+        "version": "5879913710f02f75634c0d679be8af1f8ec3e017af5a6300c3ee965779f5ffbb",
+        "warmup_time": 0
+    },
     "basic_functions.BasicFunctions.peakmem_read": {
         "code": "class BasicFunctions:\n    def peakmem_read(self, rows):\n        self.lib.read(f\"sym\").data\n\n    def setup(self, rows):\n        self.ac = Arctic(BasicFunctions.CONNECTION_STRING)\n    \n        self.df = generate_pseudo_random_dataframe(rows)\n        self.df_short_wide = generate_random_floats_dataframe(BasicFunctions.WIDE_DF_ROWS, BasicFunctions.WIDE_DF_COLS)\n    \n        self.lib = self.ac[get_prewritten_lib_name(rows)]\n        self.fresh_lib = self.get_fresh_lib()\n\n    def setup_cache(self):\n        self.ac = Arctic(BasicFunctions.CONNECTION_STRING)\n        rows_values = BasicFunctions.params\n    \n        self.dfs = {rows: generate_pseudo_random_dataframe(rows) for rows in rows_values}\n        for rows in rows_values:\n            lib = get_prewritten_lib_name(rows)\n            self.ac.delete_library(lib)\n            self.ac.create_library(lib)\n            lib = self.ac[lib]\n            lib.write(f\"sym\", self.dfs[rows])\n    \n        lib_name = get_prewritten_lib_name(BasicFunctions.WIDE_DF_ROWS)\n        self.ac.delete_library(lib_name)\n        lib = self.ac.create_library(lib_name)\n        lib.write(\n            \"short_wide_sym\",\n            generate_random_floats_dataframe(BasicFunctions.WIDE_DF_ROWS, BasicFunctions.WIDE_DF_COLS),\n        )\n    \n        lib_name = get_prewritten_lib_name(BasicFunctions.ULTRA_SHORT_WIDE_DF_ROWS)\n        self.ac.delete_library(lib_name)\n        lib = self.ac.create_library(lib_name)\n        lib.write(\n            \"ultra_short_wide_sym\",\n            generate_random_floats_dataframe(BasicFunctions.ULTRA_SHORT_WIDE_DF_ROWS, BasicFunctions.WIDE_DF_COLS),\n        )",
         "name": "basic_functions.BasicFunctions.peakmem_read",
diff --git a/python/benchmarks/arrow.py b/python/benchmarks/arrow.py
@@ -0,0 +1,113 @@
+"""
+Copyright 2025 Man Group Operations Limited
+
+Use of this software is governed by the Business Source License 1.1 included in the file licenses/BSL.txt.
+
+As of the Change Date specified in that file, in accordance with the Business Source License, use of this software will be governed by the Apache License, version 2.0.
+"""
+
+
+import numpy as np
+import pandas as pd
+
+from arcticdb import Arctic, OutputFormat
+from arcticdb.util.test import random_strings_of_length
+
+
+class ArrowReadNumeric:
+    number = 5
+    warmup_time = 0
+    timeout = 6000
+    rounds = 1
+    connection_string = "lmdb://arrow_read_numeric?map_size=20GB"
+    lib_name = "arrow_read_numeric"
+    params = ([100_000, 100_000_000], [None, "middle"])
+    param_names = ["rows", "date_range"]
+
+    def symbol_name(self, num_rows: int):
+        return f"numeric_{num_rows}_rows"
+
+    def setup_cache(self):
+        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)
+        num_rows, date_ranges = self.params
+        num_cols = 9 # 10 including the index column
+        self.ac.delete_library(self.lib_name)
+        self.ac.create_library(self.lib_name)
+        lib = self.ac.get_library(self.lib_name)
+        for rows in num_rows:
+            df = pd.DataFrame(
+                {
+                    f"col{idx}": np.arange(idx * rows, (idx + 1) * rows, dtype=np.int64) for idx in range(num_cols)
+                },
+                index = pd.date_range("1970-01-01", freq="ns", periods=rows)
+            )
+            lib.write(self.symbol_name(rows), df)
+
+    def teardown(self, rows, date_range):
+        del self.ac
+
+    def setup(self, rows, date_range):
+        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)
+        self.lib = self.ac.get_library(self.lib_name)
+        if date_range is None:
+            self.date_range = None
+        else:
+            # Create a date range that excludes the first and last 10 rows of the data only
+            self.date_range = (pd.Timestamp(10), pd.Timestamp(rows - 10))
+
+    def time_read(self, rows, date_range):
+        self.lib.read(self.symbol_name(rows), date_range=self.date_range)
+
+    def peakmem_read(self, rows, date_range):
+        self.lib.read(self.symbol_name(rows), date_range=self.date_range)
+
+
+class ArrowReadStrings:
+    number = 5
+    warmup_time = 0
+    timeout = 6000
+    rounds = 1
+    connection_string = "lmdb://arrow_read_strings?map_size=20GB"
+    lib_name = "arrow_read_strings"
+    params = ([10_000, 1_000_000], [None, "middle"], [1, 100, 100_000])
+    param_names = ["rows", "date_range", "unique_string_count"]
+
+    def symbol_name(self, num_rows: int, unique_strings: int):
+        return f"string_{num_rows}_rows_{unique_strings}_unique_strings"
+
+    def setup_cache(self):
+        rng = np.random.default_rng()
+        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)
+        num_rows, date_ranges, unique_string_counts = self.params
+        num_cols = 10
+        self.ac.delete_library(self.lib_name)
+        self.ac.create_library(self.lib_name)
+        lib = self.ac.get_library(self.lib_name)
+        for unique_string_count in unique_string_counts:
+            strings = np.array(random_strings_of_length(unique_string_count, 10, unique=True))
+            for rows in num_rows:
+                df = pd.DataFrame(
+                    {
+                        f"col{idx}": rng.choice(strings, rows) for idx in range(num_cols)
+                    },
+                    index = pd.date_range("1970-01-01", freq="ns", periods=rows)
+                )
+                lib.write(self.symbol_name(rows, unique_string_count), df)
+
+    def teardown(self, rows, date_range, unique_string_count):
+        del self.ac
+
+    def setup(self, rows, date_range, unique_string_count):
+        self.ac = Arctic(self.connection_string, output_format=OutputFormat.EXPERIMENTAL_ARROW)
+        self.lib = self.ac.get_library(self.lib_name)
+        if date_range is None:
+            self.date_range = None
+        else:
+            # Create a date range that excludes the first and last 10 rows of the data only
+            self.date_range = (pd.Timestamp(10), pd.Timestamp(rows - 10))
+
+    def time_read(self, rows, date_range, unique_string_count):
+        self.lib.read(self.symbol_name(rows, unique_string_count), date_range=self.date_range)
+
+    def peakmem_read(self, rows, date_range, unique_string_count):
+        self.lib.read(self.symbol_name(rows, unique_string_count), date_range=self.date_range)