Skip to content

Commit f82899e

Browse files
committed
Add dense_rank
Related to #710
1 parent 6e8735c commit f82899e

File tree

8 files changed

+133
-10
lines changed

8 files changed

+133
-10
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
* Adding `aarch64-apple-darwin` and `aarch64-unknown-linux-gnu` to CI builds.
1919
* Adding `to_fixed` moonblade function.
2020
* Adding decimal places optional argument to `ratio` & `percentage` aggregation functions.
21+
* Adding `frac` & `dense_rank` aggregation functions to `xan window`.
2122

2223
*Fixes*
2324

docs/cmd/window.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,16 @@
33

44
```txt
55
Compute window aggregations such as cumulative sums, rolling means, leading and
6-
lagging values etc.
6+
lagging values, rankings etc.
77
88
This command is able to compute multiple aggregations in a single pass over the
99
file, and never uses more memory that required to fit the largest desired window
1010
for rolling stats and leads/lags.
1111
12+
Ranking aggregations however (such as `frac` or `dense_rank`), still require to
13+
buffer the whole file in memory (or at least whole groups when using -g/--groupby),
14+
since they cannot be computed otherwise.
15+
1216
Computing a cumulative sum:
1317
1418
$ xan window 'cumsum(n)' file.csv
@@ -21,6 +25,14 @@ Adding a lagged column:
2125
2226
$ xan window 'lag(n) as "n-1"' file.csv
2327
28+
Ranking numerical values:
29+
30+
$ xan window 'dense_rank(n) as rank' file.csv
31+
32+
Computing fraction of cell wrt total sum of target column:
33+
34+
$ xan window 'frac(n) as frac' file.csv
35+
2436
This command is also able to reset the statistics each time a new contiguous group
2537
of rows is encountered using the -g/--groupby flag. This means, however, that
2638
the file must be sorted by columns representing group identities beforehand:

docs/moonblade/aggs.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,12 +48,12 @@ the number of nodes in a graph represented by a CSV edge list.
4848
- **mode**(*\<expr\>*) -> `string`: Value appearing the most, breaking ties arbitrarily in favor of the first value in lexicographical order.
4949
- **most_common**(*k*, *\<expr\>*, *separator?*) -> `string`: List of top k most common values returned by expression joined by a pipe character ('|') or by the provided separator. Ties will be broken by lexicographical order.
5050
- **most_common_counts**(*k*, *\<expr\>*, *separator?*) -> `string`: List of top k most common counts returned by expression joined by a pipe character ('|') or by the provided separator. Ties will be broken by lexicographical order.
51-
- **percentage**(*\<expr\>*) -> `string`: Return the percentage of truthy values returned by expression.
51+
- **percentage**(*\<expr\>*, *decimals?*) -> `string`: Return the percentage of truthy values returned by expression, up to an optional number of decimal places.
5252
- **quantile**(*\<expr\>*, *q*) -> `number`: Return the desired quantile of numerical values.
5353
- **q1**(*\<expr\>*) -> `number`: Return the first quartile of numerical values.
5454
- **q2**(*\<expr\>*) -> `number`: Return the second quartile of numerical values.
5555
- **q3**(*\<expr\>*) -> `number`: Return the third quartile of numerical values.
56-
- **ratio**(*\<expr\>*) -> `number`: Return the ratio of truthy values returned by expression.
56+
- **ratio**(*\<expr\>*, *decimals?*) -> `number`: Return the ratio of truthy values returned by expression, up to an optional number of decimal places.
5757
- **rms**(*\<expr\>*) -> `number`: Return the Root Mean Square of numerical values.
5858
- **stddev**(*\<expr\>*) -> `number`: Population standard deviation. Same as `stddev_pop`.
5959
- **stddev_pop**(*\<expr\>*) -> `number`: Population standard deviation. Same as `stddev`.

docs/moonblade/window.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
- **cummax**(*\<expr\>*) -> `number`: Returns the cumulative maximum of the numbers yielded by given expression.
44
- **cummin**(*\<expr\>*) -> `number`: Returns the cumulative minimum of the numbers yielded by given expression.
55
- **cumsum**(*\<expr\>*) -> `number`: Returns the cumulative sum of the numbers yielded by given expression.
6+
- **dense_rank**(*\<expr\>*) -> `number`: Returns the dense rank (there will be no gaps, but ties remain possible for a same rank) of numbers yielded by given expression. Beware, as this requires buffering whole file or group.
7+
- **frac**(*\<expr\>*, *decimals?*) -> `number`: Returns the fraction represented by numbers yielded by given expression over the total sum of them. Beware, as this requires buffering whole file or group.
68
- **lag**(*\<expr\>*, *steps?*, *\<expr\>?*) -> `any`: Returns a value yielded by given expression, lagged by n steps or 1 step by default. Can take a second expression after the number of steps to return a default value for rows that come before first lagged value.
79
- **lead**(*\<expr\>*, *steps?*, *\<expr\>?*) -> `any`: Returns a value yielded by given expression, leading by n steps or 1 step by default. Can take a second expression after the number of steps to return a default value for rows that come after last lead value.
810
- **rolling_avg**(*window_size*, *\<expr\>*) -> `number`: Returns the rolling average in given window size of numbers yielded by given expression. Same as `rolling_mean`.

src/cmd/window.rs

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,16 @@ use crate::CliResult;
66

77
static USAGE: &str = "
88
Compute window aggregations such as cumulative sums, rolling means, leading and
9-
lagging values etc.
9+
lagging values, rankings etc.
1010
1111
This command is able to compute multiple aggregations in a single pass over the
1212
file, and never uses more memory that required to fit the largest desired window
1313
for rolling stats and leads/lags.
1414
15+
Ranking aggregations however (such as `frac` or `dense_rank`), still require to
16+
buffer the whole file in memory (or at least whole groups when using -g/--groupby),
17+
since they cannot be computed otherwise.
18+
1519
Computing a cumulative sum:
1620
1721
$ xan window 'cumsum(n)' file.csv
@@ -24,6 +28,14 @@ Adding a lagged column:
2428
2529
$ xan window 'lag(n) as \"n-1\"' file.csv
2630
31+
Ranking numerical values:
32+
33+
$ xan window 'dense_rank(n) as rank' file.csv
34+
35+
Computing fraction of cell wrt total sum of target column:
36+
37+
$ xan window 'frac(n) as frac' file.csv
38+
2739
This command is also able to reset the statistics each time a new contiguous group
2840
of rows is encountered using the -g/--groupby flag. This means, however, that
2941
the file must be sorted by columns representing group identities beforehand:

src/moonblade/agg/window.rs

Lines changed: 56 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,7 @@ enum ConcreteWindowAggregation {
121121
RollingSum(ConcreteExpr, RollingSum),
122122
RollingWelford(ConcreteExpr, WelfordStat, RollingWelford),
123123
Frac(ConcreteExpr, Sum, Option<usize>),
124+
DenseRank(ConcreteExpr, Vec<(DynamicNumber, usize)>, VecDeque<usize>),
124125
}
125126

126127
fn eval_expression_to_number(
@@ -144,7 +145,7 @@ impl ConcreteWindowAggregation {
144145
}
145146

146147
fn requires_total_buffer(&self) -> bool {
147-
matches!(self, Self::Frac(_, _, _))
148+
matches!(self, Self::Frac(_, _, _) | Self::DenseRank(_, _, _))
148149
}
149150

150151
fn aggregate_total(
@@ -161,12 +162,46 @@ impl ConcreteWindowAggregation {
161162
sum.add(value.try_as_number().map_err(|err| err.specify("frac"))?);
162163
}
163164
}
165+
Self::DenseRank(expr, numbers, _) => {
166+
let value = eval_expression(expr, Some(index), record, headers_index)?;
167+
let number = value
168+
.try_as_number()
169+
.map_err(|err| err.specify("dense_rank"))?;
170+
171+
numbers.push((number, numbers.len()));
172+
}
164173
_ => (),
165174
};
166175

167176
Ok(())
168177
}
169178

179+
fn finalize_total(&mut self) {
180+
if let Self::DenseRank(_, numbers, ranks) = self {
181+
numbers.sort();
182+
ranks.resize(numbers.len(), 0);
183+
184+
let mut rank: usize = 0;
185+
let mut last_number: Option<DynamicNumber> = None;
186+
187+
for (n, i) in numbers.iter() {
188+
match last_number {
189+
None => {
190+
last_number = Some(*n);
191+
rank += 1;
192+
}
193+
Some(l) if l != *n => {
194+
last_number = Some(*n);
195+
rank += 1;
196+
}
197+
_ => {}
198+
};
199+
200+
ranks[*i] = rank;
201+
}
202+
}
203+
}
204+
170205
fn run(
171206
&mut self,
172207
index: usize,
@@ -277,6 +312,7 @@ impl ConcreteWindowAggregation {
277312
Some(d) => DynamicValue::from(frac.map(|f| format!("{:.p$}", f, p = d))),
278313
})
279314
}
315+
Self::DenseRank(_, _, ranks) => Ok(DynamicValue::from(ranks.pop_front().unwrap())),
280316
}
281317
}
282318

@@ -304,6 +340,10 @@ impl ConcreteWindowAggregation {
304340
Self::Frac(_, sum, _) => {
305341
sum.clear();
306342
}
343+
Self::DenseRank(_, numbers, ranks) => {
344+
numbers.clear();
345+
ranks.clear();
346+
}
307347
};
308348
}
309349
}
@@ -313,7 +353,7 @@ fn get_function(name: &str) -> Option<FunctionArguments> {
313353
"row_number" | "row_index" => FunctionArguments::nullary(),
314354
"frac" => FunctionArguments::with_range(1..=2),
315355
"lag" | "lead" => FunctionArguments::with_range(1..=3),
316-
"cumsum" | "cummin" | "cummax" => FunctionArguments::unary(),
356+
"cumsum" | "cummin" | "cummax" | "dense_rank" => FunctionArguments::unary(),
317357
"rolling_sum" | "rolling_mean" | "rolling_avg" | "rolling_var" | "rolling_stddev" => {
318358
FunctionArguments::binary()
319359
}
@@ -461,6 +501,14 @@ fn concretize_window_aggregations(
461501
ConcreteWindowAggregation::Frac(expr, Sum::new(), decimals),
462502
));
463503
}
504+
"dense_rank" => {
505+
let expr = concretize_expression(agg.args.pop().unwrap(), headers, None)?;
506+
507+
concrete_aggs.push((
508+
agg.agg_name,
509+
ConcreteWindowAggregation::DenseRank(expr, Vec::new(), VecDeque::new()),
510+
));
511+
}
464512
_ => unreachable!(),
465513
};
466514
}
@@ -607,8 +655,6 @@ impl WindowAggregationProgram {
607655
self.run_with_record_impl(index, record, false)
608656
}
609657

610-
// TODO: beware alignement issues, frac should work with holes, for instance
611-
// TODO: avoid double expression evaluation (beware of unalignement when a value cannot hold)
612658
pub fn flush<F, E>(&mut self, mut from_index: usize, mut callback: F) -> Result<(), E>
613659
where
614660
F: FnMut(ByteRecord) -> Result<(), E>,
@@ -621,6 +667,10 @@ impl WindowAggregationProgram {
621667
}
622668
}
623669

670+
for (_, agg) in self.aggs.iter_mut() {
671+
agg.finalize_total();
672+
}
673+
624674
for (index, record) in total_buffer.iter() {
625675
if let Some(output_record) = self.run_with_record(*index, record)? {
626676
callback(output_record)?;
@@ -650,7 +700,7 @@ impl WindowAggregationProgram {
650700
F: FnMut(ByteRecord) -> Result<(), E>,
651701
E: From<SpecifiedEvaluationError>,
652702
{
653-
let records = self.flush(from_index, callback)?;
703+
self.flush(from_index, callback)?;
654704

655705
if self.aggs.iter().any(|(_, agg)| agg.requires_total_buffer()) {
656706
self.total_buffer = Some(Vec::new());
@@ -668,6 +718,6 @@ impl WindowAggregationProgram {
668718
agg.clear();
669719
}
670720

671-
Ok(records)
721+
Ok(())
672722
}
673723
}

src/moonblade/doc/window.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,18 @@
1717
"returns": "number",
1818
"help": "Returns the cumulative sum of the numbers yielded by given expression."
1919
},
20+
{
21+
"name": "dense_rank",
22+
"arguments": ["<expr>"],
23+
"returns": "number",
24+
"help": "Returns the dense rank (there will be no gaps, but ties remain possible for a same rank) of numbers yielded by given expression. Beware, as this requires buffering whole file or group."
25+
},
26+
{
27+
"name": "frac",
28+
"arguments": ["<expr>", "decimals?"],
29+
"returns": "number",
30+
"help": "Returns the fraction represented by numbers yielded by given expression over the total sum of them. Beware, as this requires buffering whole file or group."
31+
},
2032
{
2133
"name": "lag",
2234
"arguments": ["<expr>", "steps?", "<expr>?"],

tests/test_window.rs

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -419,3 +419,37 @@ fn window_frac() {
419419

420420
assert_eq!(got, expected);
421421
}
422+
423+
#[test]
424+
fn window_dense_rank() {
425+
let wrk = Workdir::new("window_dense_rank");
426+
wrk.create(
427+
"numbers.csv",
428+
vec![
429+
svec!["n"],
430+
svec!["20"],
431+
svec!["10"],
432+
svec!["30"],
433+
svec!["10"],
434+
svec!["20"],
435+
svec!["20"],
436+
svec!["20"],
437+
],
438+
);
439+
let mut cmd = wrk.command("window");
440+
cmd.arg("dense_rank(n) as rank").arg("numbers.csv");
441+
442+
let got: Vec<Vec<String>> = wrk.read_stdout(&mut cmd);
443+
let expected = vec![
444+
svec!["n", "rank"],
445+
svec!["20", "2"],
446+
svec!["10", "1"],
447+
svec!["30", "3"],
448+
svec!["10", "1"],
449+
svec!["20", "2"],
450+
svec!["20", "2"],
451+
svec!["20", "2"],
452+
];
453+
454+
assert_eq!(got, expected);
455+
}

0 commit comments

Comments
 (0)