intel
diff --git a/‎docs/rfcs/20220804-ptensor/README.md‎
Lines changed: 11 additions & 13 deletions b/‎docs/rfcs/20220804-ptensor/README.md‎
Lines changed: 11 additions & 13 deletions
diff --git a/‎include/imex/Conversion/Passes.td‎
Lines changed: 8 additions & 7 deletions b/‎include/imex/Conversion/Passes.td‎
Lines changed: 8 additions & 7 deletions
diff --git a/‎include/imex/Dialect/Dist/IR/DistOps.h‎
Lines changed: 8 additions & 1 deletion b/‎include/imex/Dialect/Dist/IR/DistOps.h‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎include/imex/Dialect/Dist/IR/DistOps.td‎
Lines changed: 195 additions & 31 deletions b/‎include/imex/Dialect/Dist/IR/DistOps.td‎
Lines changed: 195 additions & 31 deletions
@@ -26,9 +26,7 @@ Additionally we propose appropriate passes
 3. Converting __intel-sycl.device_region__ to appropriate runtime calls
 
 ### ptensor Type
-Since operations are expected to execute in the same location as its input tensors, it is necessary to carry the tensor-location from the point of its allocation to the point of the operation. For this, we introduce a type which logically extends the `mlir::tensor` type with two boolean attributes:
-* `device`: indicates if it should live on a device
-* `dist`: indicates if it should be distributed
+Since operations are expected to execute in the same location as its input tensors, it is necessary to carry the tensor-location from the point of its allocation to the point of the operation. For this, we introduce a type which logically extends the `mlir::MemRefType` with a boolean attribute `device`, indicateing if it should live on a device.
 
 The actual device and distributed team can be assigned by the approriate operands of creation operations (see below).
 
@@ -37,7 +35,7 @@ The tensors themselves are assumed to eventually lower to memrefs.
 Notice: By default device and distribution support is disabled and so renders conventional host operations.
 
 ### __PTensor__ Operations
-The initial set of operations matches the requirements of the core of [array-API](https://data-apis.org/array-api/latest/API_specification/index.html). The operations in the PTensor dialect operate on ptensors. To allow operations on standard tensors and memrefs the PTensor dialect provides the operation `from_ranked` to convert MemRefs and RankedTensors to ptensors with default `device` and `team`.
+The initial set of operations matches the requirements of the core of [array-API](https://data-apis.org/array-api/latest/API_specification/index.html). The operations in the PTensor dialect operate on ptensors. To allow operations on standard tensors and memrefs the PTensor dialect provides the operation `from_ranked` to convert MemRefs and MemRefs to ptensors with default `device` and `team`.
 
 Notice: some of the operations mutate existing ptensors.
 
@@ -50,7 +48,7 @@ It constitutes an error if an operation has multiple (input and output) argument
 Similarly, it constitutes an error if an operation has multiple (input and output) arguments of type ptensor and their `team` attribute is not the same on all ptensor arguments.
 
 #### Broadcasting/Ranked Tensors
-PTensor operates on ranked tensors. In rare cases the shape of input tensor(s) needs to be known as well. Unranked tensors are not supported.
+PTensor operates on MemRefs. In rare cases the shape of input tensor(s) needs to be known as well. Unranked memrefs are not supported.
 
 PTensor operations follow the [broadcasting semantics of the array-API](https://data-apis.org/array-api/latest/API_specification/broadcasting.html).
 
@@ -76,7 +74,7 @@ The below set of operations accrues from the following rules:
     * `$side = ['lower', 'upper']`
   * `delete(tensor) : (ptensor) -> void`
   * `from_dlpack(obj) : (ptr) -> ptensor.ptensor`
-  * `from_ranked(ranked) : (Memref|RankedTensor) -> ptensor.ptensor`
+  * `from_ranked(ranked) : (Memref|MemRef) -> ptensor.ptensor`
 * Tensor attributes
   * `shape(rhs) : (ptensor.ptensor) -> shape.shape`
   * `rank(rhs) : (ptensor.ptensor) -> int64`
@@ -131,18 +129,18 @@ The below set of operations accrues from the following rules:
   * `test{$top}(rhs, axis) : (ptensor.ptensor, int) -> ptensor.ptensor`
     * `$rop = ['any', 'all']`
   * Utility functions not part of the array-API
-    * Get the (local) ranked tensor from a ptensor:
-      `extract_rtensor(tensor) : (ptensor.ptensor) -> RankedTensor`
-    * Initialize a ptensor value from a RankedTensor, device, team and handle:
-      `init_ptensor(rtensor, device, team, handle) {onDevice : bool, dist : bool} : (RankedTensor, AnyType, AnyType, AnyType, AnyType -> ptensor.ptensor`
+    * Get the (local) memref from a ptensor:
+      `extract_memref(tensor) : (ptensor.ptensor) -> MemRef`
+    * Initialize a ptensor value from a MemRef, device, team and handle:
+      `init_ptensor(memref, device, team, handle) {onDevice : bool, dist : bool} : (MemRef, AnyType, AnyType, AnyType, AnyType -> ptensor.ptensor`
 
 ### __Dist__ Dialect
 The Dist dialect provides operations dealing with tensors which are partitioned and distributed across multiple processes. The operations assume some kind of a runtime which handles aspects like communication and partitioning.
 - `register_ptensor(shape) : (tensor<?xi64) -> (int64)`
 - `unregister_ptensor(dtensor_id) : (i64) -> void`
 - `local_shape(dtensor_id) : (i64) -> (tensor<?xi64)`
 - `local_offsets(dtensor_id) : (i64) -> (tensor<?xi64)`
-- `allreduce(team, op, ltensor) : (i64, i64, RankedTensor) -> void`
+- `allreduce(team, op, ltensor) : (i64, i64, MemRef) -> void`
 
 For details watch out for a separate RFC.
 
@@ -163,9 +161,9 @@ All passes which consume `ptensor`s and -operations comply to compute-follows-da
 
 #### --convert-ptensor-to-linalg
 This pass completely lowers ptensor operations:
-- __Tensor__: `ptensor.ptensor` will be type-converted to a RankedTensor
+- __Tensor__: `ptensor.ptensor` will be type-converted to a MemRef
   - Wtihin the pass each PTensor gets "instantiated" by a `init_ptensor` which also accepts `team`, `handle` and `device`. This allows accessing device and distributed runtime attributes during lowering.
-  - function boundaries are currently not handled explicitly. device and dist information will be lost and normal RankedTensors are exchanged.
+  - function boundaries are currently not handled explicitly. device and dist information will be lost and normal MemRefs are exchanged.
 - __Linalg__: The actual functionality will be represented by one or more operations of the Linalg dialect.
 - __intel_sycl__: Appropriate `intel_sycl.device_region` will be put around operations which have inputs of type `ptensor.ptensor` with a non-null `device` attribute.
 - utility dialects like __memref__, __shape__, __affine__, __func__ and __arith__
 
@@ -43,7 +43,7 @@ def ConvertPTensorToLinalg : Pass<"convert-ptensor-to-linalg", "::mlir::ModuleOp
     #### Output IR
 
     - a PTensorType will be lowered to a unrealized_conversion_cast to tuple of
-      * rtensor: RankedTensor
+      * tensor: RankedTensor
       * device: device where the tensor lives (AnyType, default=None)
       * team: a group of processes among which the tensor is distributed
         (AnyType, default=None)
@@ -59,9 +59,10 @@ def ConvertPTensorToLinalg : Pass<"convert-ptensor-to-linalg", "::mlir::ModuleOp
   let dependentDialects = ["::mlir::linalg::LinalgDialect",
                            "::mlir::AffineDialect",
                            "::mlir::func::FuncDialect",
-                           "::mlir::tensor::TensorDialect",
                            "::mlir::arith::ArithDialect",
-                           "::mlir::shape::ShapeDialect"];
+                           "::mlir::tensor::TensorDialect",
+                           "::mlir::memref::MemRefDialect",
+                           "::mlir::bufferization::BufferizationDialect"];
   let options = [];
 }
 
@@ -78,11 +79,11 @@ def ConvertDistToStandard: Pass<"convert-dist-to-standard", "::mlir::ModuleOp">
     Necessary prototypes of runtime functions will be added.
   }];
   let constructor = "::imex::createConvertDistToStandardPass()";
-  let dependentDialects = ["::mlir::linalg::LinalgDialect",
+  let dependentDialects = ["::imex::ptensor::PTensorDialect",
+                           "::mlir::linalg::LinalgDialect",
                            "::mlir::func::FuncDialect",
-                           "::mlir::tensor::TensorDialect",
-                           "::mlir::arith::ArithDialect",
-                           "::mlir::shape::ShapeDialect"];
+                           "::mlir::memref::MemRefDialect",
+                           "::mlir::arith::ArithDialect"];
   let options = [];
 }
 
 
@@ -18,11 +18,18 @@
 #include <mlir/IR/BuiltinTypes.h>
 #include <mlir/IR/Dialect.h>
 #include <mlir/IR/OpDefinition.h>
+#include <mlir/IR/OpImplementation.h>
 #include <mlir/IR/Types.h>
 #include <mlir/Interfaces/SideEffectInterfaces.h>
 
 namespace imex {
-namespace dist {} // namespace dist
+namespace ptensor {
+class PTensorType;
+} // namespace ptensor
+namespace dist {
+enum INFO : int { GSHAPE, LTENSOR, LOFFSETS, TEAM, INFO_LAST };
+extern ::imex::ptensor::PTensorType getPTensorType(::mlir::Value t);
+} // namespace dist
 } // namespace imex
 
 #include <imex/Dialect/Dist/IR/DistOpsDialect.h.inc>
 
@@ -30,12 +30,34 @@ def Dist_Dialect : Dialect {
 
     // A longer description of our dialect.
     let description = [{
-            The dist dialect describes interfaces for interacting with
+        The dist dialect describes interfaces for interacting with
 	    a runtime which handles distributed aspects of PTensor operations.
-        }];
+    }];
+
+    let dependentDialects = [
+        "::imex::ptensor::PTensorDialect"
+    ];
 
     // The C++ namespace that the dialect class definition resides in.
     let cppNamespace = "::imex::dist";
+    let useDefaultTypePrinterParser = 1;
+}
+
+// common base classes for types in Dist dialect
+class Dist_Type<string name, string typeMnemonic, list<Trait> traits = []>
+    : TypeDef<Dist_Dialect, name, traits> {
+    let mnemonic = typeMnemonic;
+}
+
+def Dist_Tensor : Dist_Type<"DistTensor", "dtensor">
+{
+  let summary = "A type used to bind distributed information to a PTensor";
+  let description = [{
+    A distributed PTensor needs information like offset and shape of local partition.
+    The DistTensor type is used to define operations to carry and extract such information.
+  }];
+  let parameters = (ins "::imex::ptensor::PTensorType":$p_tensor_type);
+  let assemblyFormat = "`<` $p_tensor_type `>`";
 }
 
 // Base class for dialect operations. This operation inherits from the base
@@ -46,48 +68,190 @@ def Dist_Dialect : Dialect {
 class Dist_Op<string mnemonic, list<Trait> traits = []> :
     Op<Dist_Dialect, mnemonic, traits>;
 
-// Add function prototypes used for calling into distributed runtime
 def RuntimePrototypesOp : Dist_Op<"runtime_prototypes"> {
+    let summary = "Add function prototypes used for calling into distributed runtime";
+}
+
+def NProcsOp : Dist_Op<"nprocs", [Pure]> {
+    let summary = "Number of processes for given team";
+    let arguments = (ins AnyType:$team);
+    let results = (outs Index);
+    let builders = [
+        // auto-deduce return type
+        OpBuilder<(ins "::mlir::Value":$team), [{
+            build($_builder, $_state, $_builder.getIndexType(), team);
+        }]>,
+    ];
 }
 
-// Register a ptensor of given shape with a (potentially distributed) runtime.
-// Returns an id to uniquely identify the tensor instance in future interactino with the runtime.
-// The runtime does not own or manage any PTensor memory. When needed by an operation,
-// (local) data needs to be provided.
-def RegisterPTensorOp : Dist_Op<"register_ptensor", []> {
-    // Global shape needed for initial registration. Views are handled by a separate op.
-    let arguments = (ins AnyType: $shape);
+def PRankOp : Dist_Op<"prank", [Pure]> {
+    let summary = "Process rank in team";
+    let arguments = (ins AnyType:$team);
+    let results = (outs Index);
+    let builders = [
+        // auto-deduce return type
+        OpBuilder<(ins "::mlir::Value":$team), [{
+            build($_builder, $_state, $_builder.getIndexType(), team);
+        }]>,
+    ];
+}
 
-    // result is an Integer Id
-    let results = (outs I64);
+def InitDistTensorOp : Dist_Op<"init_dist_tensor", [SameVariadicOperandSize, Pure]> {
+    let summary = "Bind a PTensor to distributed meta information";
+    let description = [{
+        The attached PTensor is the local partiton of the distributed PTensor.
+        The distributed meta information about a new PTensor provides
+          - the global shape
+          - the process-local offsets
+          - the distributed team
+    }];
+    let arguments = (ins Variadic<Index>:$g_shape, AnyType:$p_tensor, Variadic<Index>:$l_offsets, AnyType:$team);
+    let results = (outs Dist_Tensor);
+    let builders = [
+        // auto-deduce return type
+        OpBuilder<(ins "::mlir::ValueRange":$g_shape, "::mlir::Value":$p_tensor, "::mlir::ValueRange":$l_offsets, "::mlir::Value":$team), [{
+            build($_builder, $_state,
+                  ::imex::dist::DistTensorType::get($_builder.getContext(), p_tensor.getType().dyn_cast<::imex::ptensor::PTensorType>()),
+                  g_shape, p_tensor, l_offsets, team);
+        }]>,
+    ];
 }
 
-// Get the offsets (one for each dimension) of the local partition of a distributed PTensor in number of elements.
-// Partitionings can be N-dimensional but must cut only the first N dimensions.
-def LocalOffsetsOp : Dist_Op<"local_offsets", []> {
-    // Id of tensor as returned by RegisterPTensorOp
-    let arguments = (ins I64Attr: $rank, I64: $ptensor);
+def GlobalShapeOfOp : Dist_Op<"global_shape_of", []> {
+    let summary = "Get global shape of distributed tensor.";
+    let arguments = (ins AnyType:$d_tensor);
+    let results = (outs Variadic<Index>:$g_shape);
+    let builders = [
+      // auto-deduce return type from from operands
+      OpBuilder<(ins "::mlir::Value":$d_tensor), [{
+        auto rank = d_tensor.getType().dyn_cast<::imex::dist::DistTensorType>().getPTensorType().getRank();
+        auto IndexType = $_builder.getIndexType();
+        ::mlir::SmallVector<::mlir::Type> rt(rank, IndexType);
+        build($_builder, $_state, ::mlir::TypeRange(rt), d_tensor);
+      }]>,
+    ];
+}
 
-    // result is a 1d memref
-    let results = (outs AnyType);
+def LocalOffsetsOfOp : Dist_Op<"local_offsets_of", []> {
+    let summary = "Get local offsets of distributed tensor.";
+    let arguments = (ins AnyType:$d_tensor);
+    let results = (outs Variadic<Index>:$l_offsets);
+    let builders = [
+      // auto-deduce return type from from operands
+      OpBuilder<(ins "::mlir::Value":$d_tensor), [{
+        auto rank = d_tensor.getType().dyn_cast<::imex::dist::DistTensorType>().getPTensorType().getRank();
+        auto IndexType = $_builder.getIndexType();
+        ::mlir::SmallVector<::mlir::Type> rt(rank, IndexType);
+        build($_builder, $_state, ::mlir::TypeRange(rt), d_tensor);
+      }]>,
+    ];
 }
 
-// Get the shape (one size for each dimension) of the local partition of a distributed PTensor in number of elements.
-// Partitionings can be N-dimensional but must cut only the first N dimensions.
-def LocalShapeOp : Dist_Op<"local_shape", []> {
-    // Id of tensor as returned by RegisterPTensorOp
-    let arguments = (ins I64Attr: $rank, I64: $ptensor);
+def LocalTensorOfOp : Dist_Op<"local_tensor_of", []> {
+    let summary = "Get local tensor of distributed tensor.";
+    let arguments = (ins AnyType:$d_tensor);
+    let results = (outs AnyType:$l_tensor);
+    let builders = [
+      // auto-deduce return type from from operands
+      OpBuilder<(ins "::mlir::Value":$d_tensor), [{
+        auto ttype = d_tensor.getType().dyn_cast<::imex::dist::DistTensorType>();
+        build($_builder, $_state, ttype.getPTensorType(), d_tensor);
+      }]>,
+    ];
+}
 
-    // result is a 1d memref
-    let results = (outs AnyType);
+def TeamOfOp : Dist_Op<"team_of", []> {
+    let summary = "Get team of distributed tensor.";
+    let arguments = (ins AnyType:$d_tensor);
+    let results = (outs AnyType:$team);
+    let builders = [
+      // auto-deduce return type from from operands
+      OpBuilder<(ins "::mlir::Value":$d_tensor), [{
+        build($_builder, $_state, $_builder.getIndexType(), d_tensor);
+      }]>,
+    ];
 }
 
-// Inplace allreduce
-def AllReduceOp : Dist_Op<"allreduce", []> {
-    // reduction operation and and local tensor
-    let arguments = (ins AnyAttr: $op, AnyTensor: $tensor);
+def LocalPartitionOp : Dist_Op<"local_partition", [SameVariadicResultSize, Pure]> {
+    let summary = "Compute the shape and offsets of the local partition in number of elements (one for each dimension).";
+    let arguments = (ins Index:$num_procs, Index:$p_rank, Variadic<Index>:$g_shape);
+    let results = (outs Variadic<Index>:$l_offsets, Variadic<Index>:$l_shape);
+    let builders = [
+        // auto-deduce return type
+        OpBuilder<(ins "::mlir::Value":$num_procs, "::mlir::Value":$prank, "::mlir::ValueRange":$gshape), [{
+            auto IndexType = $_builder.getIndexType();
+            ::mlir::SmallVector<::mlir::Type> rt(gshape.size()*2, IndexType);
+            build($_builder,
+                  $_state,
+                  ::mlir::TypeRange(rt),
+                  num_procs,
+                  prank,
+                  gshape);
+        }]>,
+    ];
+}
+
+def LocalOfSliceOp : Dist_Op<"local_of_slice",
+    [SameVariadicOperandSize, SameVariadicResultSize, Pure]> {
+    let summary = "Compute local overlap of a distributed tensor and slice";
+    let description = [{
+        Slice and tensor operate on the global index space. This operation computes the
+        local part of the slice as owned by the local partition of the tensor. The operation
+        returns local offsets and sizes (e.g. relative to the local memref). Additionally,
+        it computes and returns the offsets of the resulting local slice relative to the global input slice.
+    }];
+
+    let arguments = (ins
+        AnyType:$d_tensor,
+        Variadic<Index>:$offsets,
+        Variadic<Index>:$sizes,
+        Variadic<Index>:$strides
+    );
+    let results = (outs Variadic<Index>:$l_offsets, Variadic<Index>:$l_sizes, Variadic<Index>:$g_offsets);
+
+    let assemblyFormat = [{
+        $d_tensor `[` $offsets `]``[` $sizes `]``[` $strides `]` attr-dict `:` qualified(type($d_tensor)) `to` `(`qualified(type(results))`)`
+    }];
+
+    let builders = [
+        // auto-deduce return type
+        OpBuilder<(ins "::mlir::Value":$d_tensor, "::mlir::ValueRange":$offsets, "::mlir::ValueRange":$sizes, "::mlir::ValueRange":$strides), [{
+            auto IndexType = $_builder.getIndexType();
+            ::mlir::SmallVector<::mlir::Type> rt(offsets.size()*3, IndexType);
+            build($_builder, $_state, ::mlir::TypeRange(rt), d_tensor, offsets, sizes, strides);
+        }]>,
+    ];
+}
 
-    // result is allreduced input tensor
+def LocalToGlobalOp : Dist_Op<"local_to_global", [Pure]> {
+    let summary = "Translate local indices into global indices";
+    let description = [{
+        Input indices are interprete as relative to the local part of the given DTensor.
+    }];
+
+    let arguments = (ins AnyType:$d_tensor, Variadic<Index>:$l_indices);
+    let results = (outs Variadic<Index>:$g_indices);
+
+    let builders = [
+        // auto-deduce return type
+        OpBuilder<(ins "::mlir::Value":$d_tensor, "::mlir::ValueRange":$lindices), [{
+            auto IndexType = $_builder.getIndexType();
+            ::mlir::SmallVector<::mlir::Type> rt(lindices.size(), IndexType);
+            build($_builder, $_state, ::mlir::TypeRange(rt), d_tensor, lindices);
+        }]>,
+    ];
+    //   let assemblyFormat = [{
+    //     $d_tensor attr-dict `:` qualified(type($source)) `to` `(`qualified(type(results))`)`
+    //   }];
+}
+
+def AllReduceOp : Dist_Op<"allreduce", []> {
+    let summary = "Inplace allreduce";
+    let description = [{
+        Result is the allreduced input tensor.
+    }];
+    // reduction operation and local tensor
+    let arguments = (ins AnyAttr:$op, AnyMemRef:$data);
     let results = (outs AnyType);
 }