Skip to content

Commit 3cc34ff

Browse files
authored
perf(cubestore): Reduce allocations in info schema tables (#9855)
- Replace intermediate Vec allocations with direct iterator consumption - Use from_iter/from_iter_values instead of collect() followed by from() - Remove unnecessary .as_str() conversions where &String is sufficient - Simplify Option<String> handling with .as_deref() instead of .as_ref().map() This optimization eliminates ~78 unnecessary heap allocations across all info schema table implementations, improving performance especially with large datasets.
1 parent e79d203 commit 3cc34ff

16 files changed

+391
-468
lines changed

rust/cubestore/CLAUDE.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Repository Overview
6+
7+
CubeStore is the Rust-based distributed OLAP storage engine for Cube.js, designed to store and serve pre-aggregations at scale. It's part of the larger Cube.js monorepo and serves as the materialized cache store for rollup tables.
8+
9+
## Architecture Overview
10+
11+
### Core Components
12+
13+
The codebase is organized as a Rust workspace with multiple crates:
14+
15+
- **`cubestore`**: Main CubeStore implementation with distributed storage, query execution, and API interfaces
16+
- **`cubestore-sql-tests`**: SQL compatibility test suite and benchmarks
17+
- **`cubehll`**: HyperLogLog implementation for approximate distinct counting
18+
- **`cubedatasketches`**: DataSketches integration for advanced approximate algorithms
19+
- **`cubezetasketch`**: Theta Sketch implementation for set operations
20+
- **`cuberpc`**: RPC layer for distributed communication
21+
- **`cuberockstore`**: RocksDB wrapper and storage abstraction
22+
23+
### Key Modules in `cubestore/src/`
24+
25+
- **`metastore/`**: Metadata management, table schemas, partitioning, and distributed coordination
26+
- **`queryplanner/`**: Query planning, optimization, and physical execution planning using DataFusion
27+
- **`store/`**: Core storage layer with compaction and data management
28+
- **`cluster/`**: Distributed cluster management, worker pools, and inter-node communication
29+
- **`table/`**: Table data handling, Parquet integration, and data redistribution
30+
- **`cachestore/`**: Caching layer with eviction policies and queue management
31+
- **`sql/`**: SQL parsing and execution layer
32+
- **`streaming/`**: Kafka streaming support and traffic handling
33+
- **`remotefs/`**: Cloud storage integration (S3, GCS, MinIO)
34+
- **`config/`**: Dependency injection and configuration management
35+
36+
## Development Commands
37+
38+
### Building
39+
40+
```bash
41+
# Build all crates in release mode
42+
cargo build --release
43+
44+
# Build all crates in debug mode
45+
cargo build
46+
47+
# Build specific crate
48+
cargo build -p cubestore
49+
50+
# Check code without building
51+
cargo check
52+
```
53+
54+
### Testing
55+
56+
```bash
57+
# Run all tests
58+
cargo test
59+
60+
# Run tests for specific crate
61+
cargo test -p cubestore
62+
cargo test -p cubestore-sql-tests
63+
64+
# Run single test
65+
cargo test test_name
66+
67+
# Run tests with output
68+
cargo test -- --nocapture
69+
70+
# Run integration tests
71+
cargo test --test '*'
72+
73+
# Run benchmarks
74+
cargo bench
75+
```
76+
77+
### Development
78+
79+
```bash
80+
# Format code
81+
cargo fmt
82+
83+
# Check formatting
84+
cargo fmt -- --check
85+
86+
# Run clippy lints
87+
cargo clippy
88+
89+
# Run with debug logging
90+
RUST_LOG=debug cargo run
91+
92+
# Run specific binary
93+
cargo run --bin cubestore
94+
95+
# Watch for changes (requires cargo-watch)
96+
cargo watch -x check -x test
97+
```
98+
99+
### JavaScript Wrapper Commands
100+
101+
```bash
102+
# Build TypeScript wrapper
103+
npm run build
104+
105+
# Run JavaScript tests
106+
npm test
107+
108+
# Lint JavaScript code
109+
npm run lint
110+
111+
# Fix linting issues
112+
npm run lint:fix
113+
```
114+
115+
## Key Dependencies and Technologies
116+
117+
- **DataFusion**: Apache Arrow-based query engine (using Cube's fork)
118+
- **Apache Arrow/Parquet**: Columnar data format and processing
119+
- **RocksDB**: Embedded key-value store for metadata
120+
- **Tokio**: Async runtime for concurrent operations
121+
- **sqlparser-rs**: SQL parsing (using Cube's fork)
122+
123+
## Configuration via Dependency Injection
124+
125+
The codebase uses a custom dependency injection system defined in `config/injection.rs`. Services are configured through the `Injector` and use `Arc<dyn ServiceTrait>` patterns for abstraction.
126+
127+
## Testing Approach
128+
129+
- Unit tests are colocated with source files using `#[cfg(test)]` modules
130+
- Integration tests are in `cubestore-sql-tests/tests/`
131+
- SQL compatibility tests use fixtures in `cubestore-sql-tests/src/tests.rs`
132+
- Benchmarks are in `benches/` directories
133+
134+
## Important Notes
135+
136+
- This is a Rust nightly project (see `rust-toolchain.toml`)
137+
- Uses custom forks of Arrow/DataFusion and sqlparser-rs for Cube-specific features
138+
- Distributed mode involves router and worker nodes communicating via RPC
139+
- Heavy use of async/await patterns with Tokio runtime
140+
- Parquet files are the primary storage format for data

rust/cubestore/Cargo.lock

Lines changed: 0 additions & 36 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

rust/cubestore/cubestore/Cargo.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,6 @@ tokio-stream = { version = "0.1.15", features=["io-util"] }
7474
scopeguard = "1.1.0"
7575
async-compression = { version = "0.3.7", features = ["gzip", "tokio"] }
7676
tempfile = "3.10.1"
77-
tarpc = { version = "0.24", features = ["tokio1"] }
7877
pin-project-lite = "0.2.4"
7978
paste = "1.0.4"
8079
memchr = "2"

rust/cubestore/cubestore/src/queryplanner/info_schema/info_schema_columns.rs

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -43,35 +43,29 @@ impl InfoSchemaTableDef for ColumnsInfoSchemaTableDef {
4343
fn columns(&self) -> Vec<Box<dyn Fn(Arc<Vec<(Column, TablePath)>>) -> ArrayRef>> {
4444
vec![
4545
Box::new(|tables| {
46-
Arc::new(StringArray::from(
46+
Arc::new(StringArray::from_iter_values(
4747
tables
4848
.iter()
49-
.map(|(_, row)| row.schema.get_row().get_name().as_str())
50-
.collect::<Vec<_>>(),
49+
.map(|(_, row)| row.schema.get_row().get_name()),
5150
))
5251
}),
5352
Box::new(|tables| {
54-
Arc::new(StringArray::from(
53+
Arc::new(StringArray::from_iter_values(
5554
tables
5655
.iter()
57-
.map(|(_, row)| row.table.get_row().get_table_name().as_str())
58-
.collect::<Vec<_>>(),
56+
.map(|(_, row)| row.table.get_row().get_table_name()),
5957
))
6058
}),
6159
Box::new(|tables| {
62-
Arc::new(StringArray::from(
63-
tables
64-
.iter()
65-
.map(|(column, _)| column.get_name().as_str())
66-
.collect::<Vec<_>>(),
60+
Arc::new(StringArray::from_iter_values(
61+
tables.iter().map(|(column, _)| column.get_name()),
6762
))
6863
}),
6964
Box::new(|tables| {
70-
Arc::new(StringArray::from(
65+
Arc::new(StringArray::from_iter_values(
7166
tables
7267
.iter()
73-
.map(|(column, _)| column.get_column_type().to_string())
74-
.collect::<Vec<_>>(),
68+
.map(|(column, _)| column.get_column_type().to_string()),
7569
))
7670
}),
7771
]

rust/cubestore/cubestore/src/queryplanner/info_schema/info_schema_schemata.rs

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,8 @@ impl InfoSchemaTableDef for SchemataInfoSchemaTableDef {
2626

2727
fn columns(&self) -> Vec<Box<dyn Fn(Arc<Vec<Self::T>>) -> ArrayRef>> {
2828
vec![Box::new(|tables| {
29-
Arc::new(StringArray::from(
30-
tables
31-
.iter()
32-
.map(|row| row.get_row().get_name().as_str())
33-
.collect::<Vec<_>>(),
29+
Arc::new(StringArray::from_iter_values(
30+
tables.iter().map(|row| row.get_row().get_name()),
3431
))
3532
})]
3633
}

rust/cubestore/cubestore/src/queryplanner/info_schema/info_schema_tables.rs

Lines changed: 22 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -40,48 +40,38 @@ impl InfoSchemaTableDef for TablesInfoSchemaTableDef {
4040
fn columns(&self) -> Vec<Box<dyn Fn(Arc<Vec<Self::T>>) -> ArrayRef>> {
4141
vec![
4242
Box::new(|tables| {
43-
Arc::new(StringArray::from(
44-
tables
45-
.iter()
46-
.map(|row| row.schema.get_row().get_name().as_str())
47-
.collect::<Vec<_>>(),
43+
Arc::new(StringArray::from_iter_values(
44+
tables.iter().map(|row| row.schema.get_row().get_name()),
4845
))
4946
}),
5047
Box::new(|tables| {
51-
Arc::new(StringArray::from(
48+
Arc::new(StringArray::from_iter_values(
5249
tables
5350
.iter()
54-
.map(|row| row.table.get_row().get_table_name().as_str())
55-
.collect::<Vec<_>>(),
51+
.map(|row| row.table.get_row().get_table_name()),
5652
))
5753
}),
5854
Box::new(|tables| {
59-
Arc::new(TimestampNanosecondArray::from(
60-
tables
61-
.iter()
62-
.map(|row| {
63-
row.table
64-
.get_row()
65-
.build_range_end()
66-
.as_ref()
67-
.map(|t| t.timestamp_nanos())
68-
})
69-
.collect::<Vec<_>>(),
70-
))
55+
Arc::new(TimestampNanosecondArray::from_iter(tables.iter().map(
56+
|row| {
57+
row.table
58+
.get_row()
59+
.build_range_end()
60+
.as_ref()
61+
.map(|t| t.timestamp_nanos())
62+
},
63+
)))
7164
}),
7265
Box::new(|tables| {
73-
Arc::new(TimestampNanosecondArray::from(
74-
tables
75-
.iter()
76-
.map(|row| {
77-
row.table
78-
.get_row()
79-
.seal_at()
80-
.as_ref()
81-
.map(|t| t.timestamp_nanos())
82-
})
83-
.collect::<Vec<_>>(),
84-
))
66+
Arc::new(TimestampNanosecondArray::from_iter(tables.iter().map(
67+
|row| {
68+
row.table
69+
.get_row()
70+
.seal_at()
71+
.as_ref()
72+
.map(|t| t.timestamp_nanos())
73+
},
74+
)))
8575
}),
8676
]
8777
}

rust/cubestore/cubestore/src/queryplanner/info_schema/system_cache.rs

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -47,17 +47,14 @@ impl InfoSchemaTableDef for SystemCacheTableDef {
4747
))
4848
}),
4949
Box::new(|items| {
50-
Arc::new(TimestampNanosecondArray::from(
51-
items
52-
.iter()
53-
.map(|row| {
54-
row.get_row()
55-
.get_expire()
56-
.as_ref()
57-
.map(|t| t.timestamp_nanos())
58-
})
59-
.collect::<Vec<_>>(),
60-
))
50+
Arc::new(TimestampNanosecondArray::from_iter(items.iter().map(
51+
|row| {
52+
row.get_row()
53+
.get_expire()
54+
.as_ref()
55+
.map(|t| t.timestamp_nanos())
56+
},
57+
)))
6158
}),
6259
Box::new(|items| {
6360
Arc::new(StringArray::from_iter(

0 commit comments

Comments
 (0)