More comment to aggregation fuzzer (apache#15048)

2010YOUY01 · web-flow · commit 34efd1fbae39 · 2025-03-06T11:24:22.000-05:00
diff --git a/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs b/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs
@@ -100,7 +100,28 @@ impl DatasetGeneratorConfig {
 
 /// Dataset generator
 ///
-/// It will generate one random [`Dataset`] when `generate` function is called.
+/// It will generate random [`Dataset`]s when the `generate` function is called. For each
+/// sort key in `sort_keys_set`, an additional sorted dataset will be generated, and the
+/// dataset will be chunked into staggered batches.
+///
+/// # Example
+/// For `DatasetGenerator` with `sort_keys_set = [["a"], ["b"]]`, it will generate 2
+/// datasets. The first one will be sorted by column `a` and get randomly chunked
+/// into staggered batches. It might look like the following:
+/// ```text
+/// a b
+/// ----
+/// 1 2 <-- batch 1
+/// 1 1
+///
+/// 2 1 <-- batch 2
+///
+/// 3 3 <-- batch 3
+/// 4 3
+/// 4 1
+/// ```
+///
+/// # Implementation details:
 ///
 /// The generation logic in `generate`:
 ///
diff --git a/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs b/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs
@@ -15,6 +15,26 @@
 // specific language governing permissions and limitations
 // under the License.
 
+//! Fuzzer for aggregation functions
+//!
+//! The main idea behind aggregate fuzzing is: for aggregation, DataFusion has many
+//! specialized implementations for performance. For example, when the group cardinality
+//! is high, DataFusion will skip the first stage of two-stage hash aggregation; when
+//! the input is ordered by the group key, there is a separate implementation to perform
+//! streaming group by.
+//! This fuzzer checks the results of different specialized implementations and
+//! ensures their results are consistent. The execution path can be controlled by
+//! changing the input ordering or by setting related configuration parameters in
+//! `SessionContext`.
+//!
+//! # Architecture
+//! - `aggregate_fuzz.rs` includes the entry point for fuzzer runs.
+//! - `QueryBuilder` is used to generate candidate queries.
+//! - `DatasetGenerator` is used to generate random datasets.
+//! - `SessionContextGenerator` is used to generate `SessionContext` with
+//!   different configuration parameters to control the execution path of aggregate
+//!   queries.
+
 use arrow::array::RecordBatch;
 use arrow::util::pretty::pretty_format_batches;
 use datafusion::prelude::SessionContext;