File tree Expand file tree Collapse file tree 2 files changed +42
-1
lines changed
datafusion/core/tests/fuzz_cases/aggregation_fuzzer Expand file tree Collapse file tree 2 files changed +42
-1
lines changed Original file line number Diff line number Diff line change @@ -100,7 +100,28 @@ impl DatasetGeneratorConfig {
100100
101101/// Dataset generator
102102///
103- /// It will generate one random [`Dataset`] when `generate` function is called.
103+ /// It will generate random [`Dataset`]s when the `generate` function is called. For each
104+ /// sort key in `sort_keys_set`, an additional sorted dataset will be generated, and the
105+ /// dataset will be chunked into staggered batches.
106+ ///
107+ /// # Example
108+ /// For `DatasetGenerator` with `sort_keys_set = [["a"], ["b"]]`, it will generate 2
109+ /// datasets. The first one will be sorted by column `a` and get randomly chunked
110+ /// into staggered batches. It might look like the following:
111+ /// ```text
112+ /// a b
113+ /// ----
114+ /// 1 2 <-- batch 1
115+ /// 1 1
116+ ///
117+ /// 2 1 <-- batch 2
118+ ///
119+ /// 3 3 <-- batch 3
120+ /// 4 3
121+ /// 4 1
122+ /// ```
123+ ///
124+ /// # Implementation details:
104125///
105126/// The generation logic in `generate`:
106127///
Original file line number Diff line number Diff line change 1515// specific language governing permissions and limitations
1616// under the License.
1717
18+ //! Fuzzer for aggregation functions
19+ //!
20+ //! The main idea behind aggregate fuzzing is: for aggregation, DataFusion has many
21+ //! specialized implementations for performance. For example, when the group cardinality
22+ //! is high, DataFusion will skip the first stage of two-stage hash aggregation; when
23+ //! the input is ordered by the group key, there is a separate implementation to perform
24+ //! streaming group by.
25+ //! This fuzzer checks the results of different specialized implementations and
26+ //! ensures their results are consistent. The execution path can be controlled by
27+ //! changing the input ordering or by setting related configuration parameters in
28+ //! `SessionContext`.
29+ //!
30+ //! # Architecture
31+ //! - `aggregate_fuzz.rs` includes the entry point for fuzzer runs.
32+ //! - `QueryBuilder` is used to generate candidate queries.
33+ //! - `DatasetGenerator` is used to generate random datasets.
34+ //! - `SessionContextGenerator` is used to generate `SessionContext` with
35+ //! different configuration parameters to control the execution path of aggregate
36+ //! queries.
37+
1838use arrow:: array:: RecordBatch ;
1939use arrow:: util:: pretty:: pretty_format_batches;
2040use datafusion:: prelude:: SessionContext ;
You can’t perform that action at this time.
0 commit comments