You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
chore: bump version to 0.1.1 and fix publish workflow
- Bump version from 0.1.0.post1 to 0.1.1
- Remove direct path references (benchkit @ {root:uri}) that PyPI rejects
- Add verbose output to publish steps for better error diagnostics
- Add skip-existing for TestPyPI to handle re-runs gracefully
- Add TestPyPI trusted publisher setup instructions
Copy file name to clipboardExpand all lines: README.md
+21-11Lines changed: 21 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@
18
18
19
19
---
20
20
21
-
**Datarax** (*Data + Array/JAX*) is a high-performance, extensible data pipeline framework specifically engineered for JAX-based machine learning workflows. It leverages JAX's JIT compilation, automatic differentiation, and hardware acceleration to build efficient, scalable data loading, preprocessing, and augmentation pipelines on CPUs, GPUs, and TPUs.
21
+
**Datarax** (*Data + Array/JAX*) is an extensible data pipeline framework built for JAX-based machine learning workflows. It leverages JAX's JIT compilation, automatic differentiation, and hardware acceleration to build data loading, preprocessing, and augmentation pipelines that run on CPUs, GPUs, and TPUs.
22
22
23
23
## Key Features
24
24
@@ -33,18 +33,28 @@
33
33
34
34
## Why Datarax?
35
35
36
-
Datarax's differentiable pipeline architecture enables optimization paradigms that are impossible with traditional data loaders. Here are three real-world examples:
36
+
JAX has mature libraries for models (Flax), optimizers (Optax), and checkpointing (Orbax), but lacks a dedicated data pipeline framework that operates at the same level of abstraction. Existing options are either framework-agnostic loaders that return NumPy arrays (losing JIT/autodiff benefits) or wrappers around tf.data/PyTorch that introduce cross-framework overhead. Datarax aims to fill this gap. The framework is under active development with ongoing performance optimization — the architecture is functional, but throughput and API surface are still being refined.
Traditional augmentation search (AutoAugment) requires 15,000 GPU-hours of RL. With datarax's differentiable operators, [DADA-style gradient-based search](examples/advanced/differentiable/01_dada_learned_augmentation_guide.py) achieves the same accuracy in **~0.1 GPU-hours** — because gradients flow through the augmentation pipeline.
38
+
### JAX-Native from the Ground Up
39
+
Every component — sources, operators, batchers, samplers, sharders — is a Flax NNX module. Pipeline state is managed through NNX's variable system, which means operators can hold learnable parameters, be serialized with Orbax, and participate in JAX transformations (`jit`, `vmap`, `grad`) without special handling.
Camera ISPs are tuned for human perception, not AI tasks. Datarax's DAG executor lets you [build a differentiable ISP pipeline](examples/advanced/differentiable/02_learned_isp_guide.py) where detection loss backpropagates through every processing stage, automatically optimizing for **what the model actually needs**.
41
+
### Differentiable Data Pipelines
42
+
Because operators are NNX modules, gradients flow through the entire pipeline. This enables approaches that are not possible with standard data loaders:
43
43
44
-
### Cross-Domain Extensibility (Audio Synthesis in 3 Operators)
45
-
Datarax isn't just for images. By implementing [3 custom operators for DDSP audio synthesis](examples/advanced/differentiable/03_ddsp_audio_synthesis_guide.py), you get a complete differentiable audio pipeline — with **100x less training data** than neural audio models — proving the framework extends to any domain.
44
+
-[Gradient-based augmentation search](examples/advanced/differentiable/01_dada_learned_augmentation_guide.py) — replacing RL-based methods like AutoAugment with direct optimization
45
+
-[Task-optimized preprocessing](examples/advanced/differentiable/02_learned_isp_guide.py) — backpropagating task loss through every processing stage
46
+
-[Differentiable audio synthesis](examples/advanced/differentiable/03_ddsp_audio_synthesis_guide.py) — extending the same pattern to non-vision domains
See the [differentiable pipeline examples](docs/examples/advanced/differentiable/) for details.
49
+
50
+
### DAG Execution Model
51
+
Pipelines are directed acyclic graphs, not linear chains. The `>>` operator composes sequential steps, `|` creates parallel branches, and control-flow nodes (`Branch`, `Merge`, `SplitField`) handle conditional and multi-path logic. The DAG executor manages scheduling, caching, and rebatching across the graph.
52
+
53
+
### Deterministic Reproducibility
54
+
Shuffling uses Grain's Feistel cipher permutation, which generates a full-epoch permutation in O(1) memory without materializing the index array. Combined with explicit RNG key threading through every stochastic operator, pipelines produce identical output given the same seed — across restarts, devices, and host counts.
55
+
56
+
### Built-in Competitive Benchmarking
57
+
The benchmarking engine profiles datarax against 12+ frameworks (Grain, tf.data, PyTorch DataLoader, DALI, Ray Data, and others) across standardized scenarios. Results feed a regression guard that catches performance regressions in CI and a gap analysis that identifies optimization targets relative to the fastest framework per scenario. This benchmark-driven development loop is how datarax tracks its progress toward competitive throughput — current results and optimization status are tracked in the [benchmarking documentation](docs/benchmarks/index.md).
48
58
49
59
## Installation
50
60
@@ -169,7 +179,7 @@ complex_pipeline = (
169
179
170
180
## Architecture
171
181
172
-
```
182
+
```text
173
183
src/datarax/
174
184
core/ # Base modules: DataSourceModule, OperatorModule, Element, Batcher, Sampler, Sharder
175
185
dag/ # DAG executor and node system (source, operator, batch, cache, control flow)
@@ -193,7 +203,7 @@ src/datarax/
193
203
194
204
## Benchmarking
195
205
196
-
Datarax includes a benchmarking suite for competitive comparison against 12 data loading frameworks across 25 scenarios spanning vision, NLP, tabular, multimodal, I/O, distributed, and pipeline complexity workloads.
206
+
Datarax includes a benchmarking suite for comparison against 12+ data loading frameworks across a range of workload scenarios (vision, NLP, tabular, multimodal, distributed).
0 commit comments