Skip to content

Commit 19243d9

Browse files
committed
RFC: Closure expression optimization for predicate pushdown
1 parent 11b93a4 commit 19243d9

File tree

1 file changed

+292
-0
lines changed

1 file changed

+292
-0
lines changed
Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
- Feature Name: closure_expression_optimization
2+
- Start Date: 2024-12-08
3+
- RFC PR: [nushell/rfcs#0000](https://github.com/nushell/rfcs/pull/0000)
4+
- Nushell Issue: [nushell/nushell#0000](https://github.com/nushell/nushell/issues/0000)
5+
6+
# Summary
7+
8+
Enable nushell to analyze and optimize closures passed to commands like `where`, `select`, and `sort-by` by inspecting their AST structure. This allows optimizations such as predicate pushdown to data sources (databases, Polars DataFrames) and reordering of pipeline stages.
9+
10+
# Motivation
11+
12+
Currently, closures in nushell are opaque to the runtime. When you write:
13+
14+
```nushell
15+
open data.db | query "SELECT * FROM users" | where { $in.age > 30 }
16+
```
17+
18+
Nushell must fetch all rows and filter them in-memory, even though the filter could be pushed to the database as `WHERE age > 30`.
19+
20+
Similarly, with Polars:
21+
22+
```nushell
23+
polars open large.parquet | polars into-nu | where { $in.status == "active" }
24+
```
25+
26+
The entire parquet file is loaded into memory before filtering, rather than letting Polars apply the predicate during scan.
27+
28+
SQL optimizers have done predicate pushdown for decades because `σ(A ⋈ B) ≡ σ(A) ⋈ σ(B)` when the predicate only touches A's columns. Nushell could benefit from the same optimization if closures were analyzable.
29+
30+
The key insight is that nushell already parses closures into an AST—they aren't compiled to opaque bytecode like Python lambdas. The infrastructure exists; we just need to build the analysis pass.
31+
32+
# Guide-level explanation
33+
34+
## What changes for users
35+
36+
Nothing changes syntactically. Users write the same closures they always have. The difference is that certain patterns will execute faster because nushell can optimize them.
37+
38+
For example, this pipeline:
39+
40+
```nushell
41+
polars open sales.parquet | polars into-nu | where { $in.year == 2024 } | where { $in.amount > 1000 }
42+
```
43+
44+
Could be automatically optimized to push both predicates to Polars, equivalent to:
45+
46+
```nushell
47+
polars open sales.parquet | polars filter ((polars col year) == 2024) | polars filter ((polars col amount) > 1000) | polars into-nu
48+
```
49+
50+
## Optimizable vs non-optimizable closures
51+
52+
Not all closures can be optimized. The optimizer recognizes specific patterns:
53+
54+
**Optimizable:**
55+
```nushell
56+
{ $in.age > 30 } # Simple comparison
57+
{ $in.name == "Alice" } # Equality check
58+
{ $in.age > 30 and $in.active } # Logical combinations
59+
{ $in.score >= $threshold } # Captured variables (evaluated first)
60+
{ $in.name | str starts-with "A" } # Known-pure string operations
61+
```
62+
63+
**Not optimizable (falls back to normal execution):**
64+
```nushell
65+
{ $in.age > (expensive_computation) } # Arbitrary subexpressions
66+
{ $in | custom-command } # Unknown command purity
67+
{ mut x = 0; $x += $in.val; $x } # Mutation
68+
{ print $in.name; $in.age > 30 } # Side effects
69+
```
70+
71+
When a closure cannot be optimized, it executes exactly as it does today—no behavior change, just no optimization.
72+
73+
## Inspecting optimization decisions
74+
75+
A new flag could be added to show optimization decisions:
76+
77+
```nushell
78+
> open db.sqlite | query "SELECT * FROM users" | where { $in.age > 30 } --explain
79+
# Predicate `$in.age > 30` pushed to SQL: WHERE age > 30
80+
```
81+
82+
# Reference-level explanation
83+
84+
## Closure analysis
85+
86+
The optimizer walks the closure's AST and attempts to extract a "predicate expression" that can be translated to the target system. The analysis proceeds as follows:
87+
88+
1. **Entry point check**: Verify the closure body is a single expression (not a block with multiple statements).
89+
90+
2. **Expression classification**: Recursively classify each AST node:
91+
- `BinaryOp(==, !=, <, >, <=, >=)` with field access and literal → `Translatable`
92+
- `BinaryOp(and, or)` with translatable children → `Translatable`
93+
- `UnaryOp(not)` with translatable child → `Translatable`
94+
- Field access on `$in``FieldRef`
95+
- Literal values → `Literal`
96+
- Captured variable reference → evaluate eagerly, treat as `Literal`
97+
- Known-pure commands (e.g., `str starts-with`) → `Translatable` if args are translatable
98+
- Anything else → `Opaque`
99+
100+
3. **Purity verification**: Confirm no side effects exist in the expression tree. This requires maintaining a registry of pure commands.
101+
102+
4. **Translation**: Convert the classified AST to the target representation (SQL WHERE clause, Polars expression, etc.).
103+
104+
## Pure command registry
105+
106+
A new registry tracks which commands are pure (no side effects, deterministic output for same input):
107+
108+
```rust
109+
pub struct PurityRegistry {
110+
pure_commands: HashSet<String>,
111+
}
112+
113+
impl PurityRegistry {
114+
pub fn is_pure(&self, cmd: &str) -> bool {
115+
self.pure_commands.contains(cmd)
116+
}
117+
}
118+
```
119+
120+
Initial pure commands would include: `str length`, `str starts-with`, `str ends-with`, `str contains`, `str trim`, `math abs`, `math round`, etc.
121+
122+
## Integration points
123+
124+
### Database sources
125+
126+
When `open` or `query` detects a downstream `where` with a translatable predicate:
127+
128+
```
129+
Pipeline: [SqlSource, Where(closure)]
130+
→ [SqlSource(with_where_clause)]
131+
```
132+
133+
### Polars plugin
134+
135+
The Polars plugin could expose an optimization hook:
136+
137+
```rust
138+
trait OptimizableSource {
139+
fn accepts_predicate(&self, pred: &PredicateExpr) -> bool;
140+
fn push_predicate(&mut self, pred: PredicateExpr);
141+
}
142+
```
143+
144+
### Pipeline reordering
145+
146+
Beyond pushdown, the optimizer could reorder operations:
147+
148+
```nushell
149+
# Before: sort entire dataset, then filter
150+
ls | sort-by size | where { $in.size > 1mb }
151+
152+
# After: filter first, sort smaller dataset
153+
ls | where { $in.size > 1mb } | sort-by size
154+
```
155+
156+
This is valid when the predicate doesn't depend on sort order.
157+
158+
## Captured variables
159+
160+
Captured variables must be evaluated before predicate translation:
161+
162+
```nushell
163+
let threshold = 30
164+
where { $in.age > $threshold }
165+
# Translates to: WHERE age > 30 (not WHERE age > threshold)
166+
```
167+
168+
The optimizer evaluates `$threshold` at optimization time and substitutes the concrete value.
169+
170+
## Escape hatch
171+
172+
If optimization causes issues, users can force opaque execution:
173+
174+
```nushell
175+
where {|| $in.age > 30 } # Double-pipe signals "don't optimize"
176+
```
177+
178+
Or a command flag:
179+
180+
```nushell
181+
where --no-optimize { $in.age > 30 }
182+
```
183+
184+
# Drawbacks
185+
186+
1. **Complexity**: Adds significant complexity to the pipeline execution model. More code to maintain, more edge cases to handle.
187+
188+
2. **Semantic subtlety**: Optimization changes *when* code runs. A closure like `{ $in.age > (rand) }` would behave differently if the `rand` call were hoisted vs evaluated per-row. The optimizer must be conservative.
189+
190+
3. **Debugging difficulty**: When predicates are pushed down, error messages and stack traces may be confusing—the error occurs in the database, not in nushell.
191+
192+
4. **Limited applicability**: Many nushell users work with small datasets where optimization overhead exceeds benefits.
193+
194+
5. **Pure command maintenance**: The purity registry must be maintained as commands are added/modified.
195+
196+
# Rationale and alternatives
197+
198+
## Why AST analysis over expression sublanguage
199+
200+
**Alternative**: Create a restricted expression DSL separate from closures (like Polars does).
201+
202+
**Rationale against**: This fragments the language. Users would need to learn when to use closures vs expressions. AST analysis keeps the syntax unified—write closures everywhere, get optimization where possible.
203+
204+
## Why not JIT compilation
205+
206+
**Alternative**: JIT compile closures and use runtime profiling to guide optimization.
207+
208+
**Rationale against**: Massive implementation complexity. JIT requires platform-specific code generation, sophisticated runtime infrastructure. AST analysis is simpler and portable.
209+
210+
## Why not type-directed optimization
211+
212+
**Alternative**: Use a richer type system (algebraic effects, purity types) to determine what's optimizable.
213+
214+
**Rationale against**: Would require significant language changes. AST pattern matching is pragmatic and doesn't change the surface language.
215+
216+
## Impact of not doing this
217+
218+
Nushell remains slower than it could be for database and dataframe workloads. Users working with large datasets continue to need manual optimization (writing SQL directly, using Polars expressions explicitly).
219+
220+
# Prior art
221+
222+
## LINQ (C#)
223+
224+
LINQ's `Expression<Func<T, bool>>` captures lambdas as expression trees rather than compiled delegates. This enables Entity Framework to translate:
225+
226+
```csharp
227+
users.Where(u => u.Age > 30)
228+
```
229+
230+
to SQL. Nushell's situation is analogous—we have ASTs, we just don't exploit them.
231+
232+
## Spark SQL
233+
234+
Spark DataFrames build logical plans from operations. User-defined functions (UDFs) break optimization because they're opaque. Spark explicitly warns about this and provides "Pandas UDFs" with limited semantics for partial optimization.
235+
236+
## jq
237+
238+
jq is a pure, total language for JSON transformation. Its restricted semantics make optimization safe—no side effects, guaranteed termination. Nushell closures are more powerful but could identify a "jq-like subset" that's optimizable.
239+
240+
## Polars lazy expressions
241+
242+
Polars expressions (`pl.col("age") > 30`) are designed for optimization from the start. They're not general-purpose code—they're a DSL. Nushell's approach would be to recognize when closures happen to match this DSL's semantics.
243+
244+
# Unresolved questions
245+
246+
1. **Purity granularity**: Should purity be per-command or per-invocation? `http get` is impure, but `str length` is pure. What about `random int` (pure but non-deterministic)?
247+
248+
2. **Optimization visibility**: How do users know when optimization occurred? Silent optimization is convenient but makes debugging harder.
249+
250+
3. **Cross-plugin protocol**: How do plugins advertise optimization capabilities? A new protocol message type?
251+
252+
4. **Partial pushdown**: If a predicate is `$in.age > 30 and (complex_thing)`, can we push the first part and keep the second?
253+
254+
5. **Correctness testing**: How do we ensure optimized and unoptimized paths produce identical results?
255+
256+
# Future possibilities
257+
258+
## Projection pushdown
259+
260+
Beyond predicates, push column selection:
261+
262+
```nushell
263+
open db.sqlite | query "SELECT * FROM users" | select name age
264+
# → SELECT name, age FROM users
265+
```
266+
267+
## Join optimization
268+
269+
Analyze multiple data sources and optimize join order:
270+
271+
```nushell
272+
$users | join $orders --on id | where { $in.total > 100 }
273+
# Could push predicate to orders table before join
274+
```
275+
276+
## Cost-based optimization
277+
278+
With statistics about data sources (row counts, index availability), make smarter decisions about when pushdown helps vs hurts.
279+
280+
## User-defined purity annotations
281+
282+
Allow users to mark custom commands as pure:
283+
284+
```nushell
285+
def my-transform [x] --pure {
286+
$x | str upcase | str trim
287+
}
288+
```
289+
290+
## Incremental/streaming optimization
291+
292+
For streaming data sources, maintain optimizer state across chunks to enable cross-chunk optimizations.

0 commit comments

Comments
 (0)