PR #263: Data flow edge ops documentation

melissawm · copybara-github · commit 83d215ba7044 · 2024-12-17T11:26:26.000-08:00
Imported from GitHub PR #263 Here's a follow-up to #175 . This PR fixes some broken links and typos/grammar/styling from the first PR, and also adds a section about Data flow edge ops to the Propagation page. I didn't feel like it would be much use to repeat what is already in the autogenerated docs, but moved one example from the reference to the explanation page to keep the reference docs succint. Copybara import of the project: -- d6aa27b by Melissa Weber Mendonça <melissawm@gmail.com>: Data flow edge ops documentation -- b7fb8e1 by Melissa Weber Mendonça <melissawm@gmail.com>: Updates to sharding representation doc -- 796f7ca by Melissa Weber Mendonça <melissawm@gmail.com>: Fix broken links -- a2e6f37 by Melissa Weber Mendonça <melissawm@gmail.com>: Address review comments -- 5430d58 by Melissa Weber Mendonça <melissawm@gmail.com>: Add example back to ops.td -- 6725dbc by Melissa Weber Mendonça <melissawm@gmail.com>: Add suggestions from review Merging this change closes #263 COPYBARA_INTEGRATE_REVIEW=#263 from melissawm:data-flow-docs 6725dbc PiperOrigin-RevId: 707180112
diff --git a/docs/getting_started_jax.ipynb b/docs/getting_started_jax.ipynb
@@ -609,7 +609,7 @@
     "\n",
     "#### What are split Axes in Shardy, aka \"x\":(2)2?\n",
     "\n",
-    "Refer to \"Axis splitting and sub-axes\" in [Axis splitting and sub-axes](https://github.com/openxla/shardy/tree/main/docs/sharding_representation.md#axis-splitting-and-sub-axes)\n"
+    "Refer to [Axis splitting and sub-axes](sharding_representation.md#axis_splitting_and_sub-axes).\n"
    ]
   },
   {
diff --git a/docs/propagation.md b/docs/propagation.md
@@ -38,7 +38,7 @@ We compose multiple conflict resolution strategies in a hierarchy:
     and ignore all others. We also make sure that propagation won't override
     user defined shardings with lower priority (`>i`), even if they are ignored
     during previous iterations.
-2.  **Operation based priorities**. We propagate shardings, based on the
+2.  **Operation based priorities**. We propagate shardings based on the
     operation type. The "pass-through" operations (e.g., element-wise operations
     and reshape) have the highest priority, while operations with shape
     transformation (e.g., dot and reduce) have lower priority.
@@ -61,18 +61,18 @@ user priority, a full op-priority propagation is applied.
 
 The sharding rule introduces an abstraction of every operation that provides the
 actual propagation algorithm with the information it needs to propagate
-shardings from operands to results or across operands, etc., without having to
-reason about specific operation types and their attributes. This is essentially
+shardings from operands to results or across operands without having to reason
+about specific operation types and their attributes. This is essentially
 factoring out the op-specific logic and providing a shared representation (data
 structure) for all ops for the purpose of propagation only. In its simplest
 form, it just provides this function:
 
-```c
+```c++
 GetOpShardingRule(Operation *) -> OpShardingRuleAttr
 ```
 
 The rule allows us to write the propagation algorithm only once in a generic way
-that is based on this data structure (OpShardingRule), instead of replicating
+that is based on this data structure (`OpShardingRule`), instead of replicating
 similar pieces of code across many ops, vastly reducing the possibility for bugs
 or inconsistent behavior across ops.
 
@@ -101,11 +101,11 @@ factor. However, it is not enough for reshapes.
 
 The following reshape merges two dimensions into one:
 
-```
+```mlir
 %out = mhlo.reshape(%in) : (tensor<2x4x32xf32>) -> tensor<8x32xf32>
 ```
 
-Here both dimensions 0 and 1 of the input correspond to dimension 0 of the
+Here, both dimensions 0 and 1 of the input correspond to dimension 0 of the
 output. Say we start by giving factors to the input:
 
 ```
@@ -121,18 +121,25 @@ need a single dimension to reference multiple factors:
 
 The same can be done if the reshape were to split a dimension:
 
+```mlir
+%out = mhlo.reshape(%in) : (tensor<8x32xf32>) -> tensor<2x4x32xf32>
+```
+
+Here,
+
 ```
-%out = mhlo.reshape(%in) : (tensor<8x32xf32>) -> tensor<2x4x32xf32> ((ij), k) -> (i,j,k) : i=2, j=4, k=32
+((ij), k) -> (i,j,k) : i=2, j=4, k=32
 ```
 
 The dimension of size 8 here is essentially composed of the factors 2 and 4,
-which is why we are calling the factors (i,j,k) factors.
+which is why we are calling the factors `(i,j,k)` factors.
 
 These factors can also work with cases where there is no full dimension that
 corresponds to one of the factors:
 
-```
-%out = mhlo.reshape(%in) : (tensor<8x4xf32>) -> tensor<2x16xf32> ((ij), k) -> (i,(jk)) : i=2, j=4, k=4
+```mlir
+%out = mhlo.reshape(%in) : (tensor<8x4xf32>) -> tensor<2x16xf32>
+// ((ij), k) -> (i,(jk)) : i=2, j=4, k=4
 ```
 
 This example also emphasizes why we need to store the factor sizes - since we
@@ -146,16 +153,16 @@ In Shardy, we have the hierarchy of tensors, dimensions, and factors. They
 represent data at different levels. A factor is a sub-dimension. It is an
 internal hierarchy used in sharding propagation. Each dimension may correspond
 to one or more factors. The mapping between dimension and factor is defined by
-OpShardingRule.
+`OpShardingRule`.
 
 ![Schema showing the Shardy propagation algorithm.](images/propagation_algorithm.png)
 
 **Shardy propagates sharding axes along factors instead of dimensions**. To do
-that, we have three steps as shown in the figure below
+that, we have three steps as shown in the figure below:
 
-1.  Project DimSharding to FactorSharding
-2.  Propagate sharding axes in the space of FactorSharding
-3.  Project the updated FactorSharding to get the updated DimSharding
+1.  Project `DimSharding` to `FactorSharding`
+2.  Propagate sharding axes in the space of `FactorSharding`
+3.  Project the updated `FactorSharding` to get the updated `DimSharding`
 
 ![Schema showing sharding propagation across FactorSharding and DimSharding.](images/projected_sharding.png)
 
@@ -210,3 +217,60 @@ along F0, propagate `["c"]` along F1, and propagate nothing along F2.
 T0  | "a", **"b"** | **"c"**  | "f" |
 T1  | "a", "b"     | "c", "d" | "g" |
 T2  | **"a", "b"** | "c", "e" |     |
+
+### Data flow ops
+
+The above propagation step description applies to most ops. However, there are
+cases where a sharding rule is not appropriate. For those cases, Shardy defines
+*data flow* ops.
+
+A data flow edge of some op X defines a bridge between a set of *sources* and a
+set of *targets*, such that all sources and targets should be sharded in the
+same way. Examples of such ops are `stablehlo::OptimizationBarrierOp`,
+`stablehlo::WhileOp`, `stablehlo::CaseOp` and also
+[`sdy::ManualComputationOp`](./sdy_dialect#sdymanual_computation_sdymanualcomputationop).
+Ultimately, any op that implements
+[ShardableDataFlowOpInterface](sdy_op_interfaces#shardabledataflowopinterface_shardabledataflowopinterface)
+is considered a data flow op.
+
+An op can have multiple data flow edges that are orthogonal to one another. For
+example:
+
+```mlir
+    y_0, ..., y_n = while (x_0, ..., x_n)
+                    ((pred_arg_0,... , pred_arg_n) { ... })
+                    ((body_arg_0,..., body_arg_n) {
+                    ...
+                    return return_value_0, ..., return_value_n
+                    })
+```
+
+This while op has `n` data flow edges: the i-th data flow edges is between
+sources `x_i`, `return_value_i` and targets `y_i`, `pred_arg_i`, `body_arg_i`.
+
+Shardy will propagate shardings between all sources and targets of a data flow
+edge as if it was a regular op with the sources as operands and targets as
+results, and an identity `sdy.op_sharding_rule`. That means that forward
+propagation is from sources to targets and backwards propagation is from targets
+to sources.
+
+Several methods must be implemented by the user describing how to get the
+sources and targets of each data flow edge through their *owner*, and also how
+to get and set the shardings of edge *owners*. An owner is a user-specified
+target of the data flow edge used by Shardy's propagation. The user can choose
+it arbitrarily but it needs to be static.
+
+For example, given the `custom_op` defined below:
+
+```c
+  y_1, ..., y_n = custom_op (x_1, ..., x_n)
+                  ((body_arg_1,..., body_arg_n) {
+                    ...
+                    return return_value_1, ..., return_value_n
+                  })
+```
+
+This custom_op has two types for data flow edges: `n` edges each between
+`return_value_i` (sources) and `y_i` (targets) and `n` edges between `x_i`
+(sources) and `body_arg_i` (targets). In this case, the edge owners are the same
+as the targets.
diff --git a/docs/sharding_representation.md b/docs/sharding_representation.md
@@ -20,7 +20,7 @@ names and sizes.
 
 The proposed sharding representation is bound to a specific logical mesh by its
 name, and can only reference axis names from that mesh. The sharding of a tensor
-specifies along which axes (of a specific logical mesh), each dimension of the
+specifies along which axes (of a specific logical mesh) each dimension of the
 tensor is sharded, ordered from major to minor. The tensor is replicated along
 all other axes of the mesh.
 
@@ -47,7 +47,7 @@ We can then shard the following rank 2 tensor `[[a, b], [c, d]]` as follows:
     that are not used to shard a dimension are implicitly replicated, but the
     sharding can specify axes that are explicitly replicated and therefore
     cannot be used to shard a dimension later on.
-*   [**Axis splitting and sub-axes**](#axis-splitting-and-sub-axes) - a (full)
+*   [**Axis splitting and sub-axes**](#axis_splitting_and_sub-axes) - a (full)
     mesh axis can be split into multiple sub-axes that can be individually used
     to shard a dimension or be explicitly replicated.
 *   [**Multiple logical meshes**](#multiple-logical-meshes) - different
@@ -68,7 +68,7 @@ We expand the basic structure and each key component in this section.
 ### Basic structure
 
 The dimension shardings tell us for each dimension of the tensor, along which
-axes (or [sub-axes](#axis-splitting-and-sub-axes)) it is sharded from major to
+axes (or [sub-axes](#axis_splitting_and_sub-axes)) it is sharded from major to
 minor. All other axes that don't shard a dimension are implicitly replicated (or
 [explicitly replicated](#explicitly-replicated-axes)).
 
@@ -100,9 +100,7 @@ Each dimension of a tensor can either be open or closed.
 An open dimension is open for propagation to further shard it along additional
 axes, i.e. the specified dimension sharding doesn't have to be the final
 sharding of that dimension. This is similar (but not exactly the same as) to
-
-*   [`jax.sharding.PartitionSpec.UNCONSTRAINED`](https://jax.readthedocs.io/en/latest/jax.sharding.html#jax.sharding.PartitionSpec)
-*   GSPMD's `unspecified_dims`
+GSPMD's `unspecified_dims`.
 
 If a dimension is open we add a `?` following the axes that the dimension is
 already sharded on (see example below).
@@ -161,7 +159,7 @@ We can extend our example from above to have an explicitly replicated axis.
 // Since "y" is explicitly replicated, it can't be used to shard the 2nd
 // dimension that is open. However, "z" is implicitly replicated so it can be
 // used to shard that dimension. The local shape of this tensor (i.e. the
-// shape on a single device), would // be tensor<2x8xf32>.
+// shape on a single device), would be tensor<2x8xf32>.
 sharding<@mesh_xyz, [{"x"}, {?}], replicated={"y"}> : tensor<4x8xf32>
 ```
 
@@ -213,13 +211,14 @@ We have a few options for dealing with such cases:
 *   Disallow, and all-gather sub-axes that shard the input/output.
 
 Currently we allow sub-axes on the inputs/outputs in the propagation pipeline.
-Let us know if you want a way to disable this.
+[Let us know](https://github.com/openxla/shardy/issues) if you want a way to
+disable this.
 
 #### Representation
 
 In the same way that we can reference specific full axes from the mesh by their
 name, we can reference specific sub-axes by their size and the product of all
-sub-axis (of the same axis name) sizes to their left (that are major to them) .
+sub-axis (of the same axis name) sizes to their left (that are major to them).
 
 To extract a specific sub-axis of size `k` from a full axis `"x"` of size `n`,
 we effectively reshape the size `n` (in the mesh) into `[m, k, n/(m*k)]` and use
@@ -301,7 +300,7 @@ sharding<@mesh_xyz, [{"x"}, {"y":(2)2}], replicated={"y":(1)2}> : tensor<4x8xf32
 Replicated sub-axis of the same full axis should be ordered in increasing order
 by their pre-size, for example:
 
-```c
+```c++
 replicated={"y":(4)2, "x", "y":(1)2} ~> replicated={"x", "y":(1)2, "y":(4)2}
 ```
 
@@ -342,11 +341,9 @@ assigned to a different mesh, by naively resharding the tensor to match the
 destination mesh. In GSPMD this is what is usually done to resolve conflicting
 meshes.
 
-We provide two examples below:
-
 Users can specify multiple meshes with different named axes (e.g. via
-`jax.sharding.NamedSharding`), that have the same order of devices. In this
-example, `<@mesh_0, "b">` is identical to `<@mesh_1, "z">.`
+`jax.sharding.NamedSharding`), that have the same order of devices. Consider
+this example, `<@mesh_0, "b">` is identical to `<@mesh_1, "z">`:
 
 ```c++
 @mesh_0 = {<["a"=4, "b"=2]>, device_ids=[0, 1, 2, 3, 4, 5, 6, 7]}
@@ -358,8 +355,8 @@ moment (different being different axis names/sizes and `device_ids`).
 
 ### Priorities
 
-Priority is a way to prioritize certain partitioning+propagation decisions over
-others, and allows for incremental partitioning of a program.
+Priority is a way to prioritize certain partitioning and propagation decisions
+over others, and allows for incremental partitioning of a program.
 
 Priorities are values attached to some or all dimensions of a sharding
 representation (replicated axes don't have priorities).
@@ -374,9 +371,10 @@ For example:
 ```
 
 Priorities give users more fine grained control over propagation, e.g., batch
-parallelism first, then megatron, and finally ZeRO sharding. This allows for
-strong guarantees about what's partitioned and allows for better debuggability
-by having more fine grained sharding strategies (can see how the program looks
+parallelism first, then [megatron](arxiv.org/abs/1909.08053), and finally
+[ZeRO](https://arxiv.org/abs/1910.02054) sharding. This allows for strong
+guarantees about what's partitioned and allows for better debuggability by
+having more fine grained sharding strategies (can see how the program looks
 after just megatron in isolation).
 
 We allow attaching a priority to each dimension sharding (0 by default), which
diff --git a/shardy/dialect/sdy/ir/ops.td b/shardy/dialect/sdy/ir/ops.td
@@ -268,6 +268,7 @@ def Sdy_DataFlowEdgeOp : Sdy_Op<"data_flow_edge",
 
     For example:
 
+
     ```mlir
       y_0, ..., y_n = while (x_0, ..., x_n)
                       ((pred_arg_0,... , pred_arg_n) { ... })

Original file line number	Diff line number	Diff line change
`@@ -609,7 +609,7 @@`
`609`	`609`	`"\n",`
`610`	`610`	`"#### What are split Axes in Shardy, aka \"x\":(2)2?\n",`
`611`	`611`	`"\n",`
`612`		`- "Refer to \"Axis splitting and sub-axes\" in [Axis splitting and sub-axes](https://github.com/openxla/shardy/tree/main/docs/sharding_representation.md#axis-splitting-and-sub-axes)\n"`
	`612`	`+ "Refer to [Axis splitting and sub-axes](sharding_representation.md#axis_splitting_and_sub-axes).\n"`
`613`	`613`	`]`
`614`	`614`	`},`
`615`	`615`	`{`