DAGger is a testing framework for data-intensive computing (DISC) systems such as Apache Spark and Flink. It generates executable dataflow programs by constructing and progressively refining directed acyclic graphs (DAGs) of operators, enabling systematic testing of query optimizers and execution engines beyond SQL-only interfaces.
Unlike traditional SQL fuzzers, DAGger operates directly on logical dataflow graphs, aligning test generation with the internal representations used by modern DISC frameworks.
DAGger generates test programs in three stages:
-
DAG Generation
Generate a context-free directed acyclic graph that captures only the shape of a dataflow pipeline, without committing to specific operators or parameters. -
Abstract Dataflow Graph (ADFG) Construction
Assign each node in the DAG a concrete operator (e.g.,read,filter,join) using a machine-readable API specification, producing an abstract dataflow graph. -
State-Aware Parameterization
Fill in operator parameters (e.g., column names, predicates, join keys) using a propagated schema state, yielding a concrete, executable program.
This staged design separates structural generation, operator selection, and semantic validation, making DAGger modular and extensible across frameworks.
At a high level, DAGger consists of the following components:
+--------------------+
| Stage 1 |
| DAG Generation |
| (Structure Only) |
+----------+---------+
|
v
+--------------------+
| Stage 2 |
| Operator |
| Assignment (ADFG) |
+----------+---------+
|
v
+--------------------+
| Stage 3 |
| State-Aware |
| Parameterization |
+----------+---------+
|
v
+--------------------+
| Code Generation |
| & Execution |
+--------------------+
DAGger currently supports the following dataflow frameworks and language bindings:
- Apache Spark (Scala)
- Apache Flink (Python)
- Dask (Python)
- Polars (Python)
Support is backend-specific and implemented via separate execution and code generation modules.
Adding support for a new dataflow framework involves two main steps:
-
Environment Integration
Framework-specific execution logic (e.g., job submission, runtime setup, result handling) should be added undersrc/main/scala/fuzzer/adapters. -
Code Generation
Lowering an abstract dataflow graph (ADFG) to executable code for a target framework is handled by adapter modules located insrc/main/scala/fuzzer/framework. Each subdirectory corresponds to a supported framework and language binding.
This separation allows DAGger to remain modular: the core fuzzing logic is framework-agnostic, while backend-specific behavior is isolated in adapters and framework integration layers.
DAGger has uncovered multiple previously unknown bugs across widely used dataflow frameworks. Several of these issues have been confirmed by framework developers, demonstrating that DAGger’s generated workloads exercise optimizer and execution paths that are difficult to reach with existing testing tools.
| Framework | Issue ID | Status |
|---|---|---|
| Flink | FLINK-38397 | Confirmed |
| Polars | #25971 | Confirmed |
| Dask | #12257 | Confirmed |
| Polars | #26322 | Confirmed |
| Spark | SPARK-54196 | Confirmed by system, Pending response |
| Spark | SPARK-51798 | Pending response |
| Flink | FLINK-38366 | Pending response |
| Flink | FLINK-38446 | Pending response |
| Flink | FLINK-38637 | Pending response |