fixup! fixup! fixup! Enhance DArray Distribution with Processor Assignment

jpsamaroo · jpsamaroo · commit 2be6ac1737af · 2025-07-09T20:46:27.000-07:00
diff --git a/docs/src/darray.md b/docs/src/darray.md
@@ -211,8 +211,6 @@ across the workers in the Julia cluster in a relatively even distribution;
 future operations on a `DArray` may produce a different distribution from the
 one chosen by previous calls.
 
-<!--  -->
-
 ### Explicit Processor Mapping of DArray Blocks
 
 This feature allows you to control how `DArray` blocks (chunks) are assigned to specific processors within the cluster. Controlling data locality is crucial for optimizing the performance of distributed algorithms.
@@ -227,31 +225,33 @@ The `assignment` argument accepts the following values:
 
 * `:blockrow`:
 
-    * Divides the matrix blocks row-wise (vertically in the terminal). Each processor gets a contiguous chunk of row blocks. 
+    * Divides the matrix blocks row-wise (vertically in the terminal). Each processor gets a contiguous chunk of row blocks.
 
 * `:blockcol`:
 
     * Divides the matrix blocks column-wise (horizontally in the terminal). Each processor gets a contiguous chunk of column blocks.
 
 * `:cyclicrow`:
+
   * Assigns row-blocks to processors in a round-robin fashion. Blocks are distributed one row-block at a time. Useful for parallel row-wise tasks.
 
 * `:cycliccol`:
+
   * Assigns column-blocks to processors in a round-robin fashion. Blocks are distributed one column-block at a time. Useful for parallel column-wise tasks.
 
 * Any other symbol used for `assignment` results in an error.
 
 * `AbstractArray{<:Int, N}`:
 
     * Provide an integer **N**-dimensional array of worker IDs. The dimension **N** must match the number of dimensions of the `DArray`.
-    * Dagger maps blocks to worker IDs in a block-cyclic manner according to this processor-array. The block at index `(i,j,...)` is assigned to the first thread of the processor with ID `assignment[mod1(i, size(assignment,1)), mod1(j, size(assignment,2)), ...]`. This pattern repeats block-cyclically across all dimensions.
+    * Dagger maps blocks to worker IDs in a block-cyclic manner according to this processor-array. The block at index `(i,j,...)` is assigned to the first CPU thread of the worker with ID `assignment[mod1(i, size(assignment,1)), mod1(j, size(assignment,2)), ...]`. This pattern repeats block-cyclically across all dimensions.
 
 * `AbstractArray{<:Processor, N}`:
 
     * Provide an **N**-dimensional array of `Processor` objects. The dimension **N** must match the number of dimensions of the `DArray` blocks.
     * Blocks are mapped in a block-cyclic manner according to the `Processor` objects in the assignment array. The block at index `(i,j,...)` is assigned to the processor at `assignment[mod1(i, size(assignment,1)), mod1(j, size(assignment,2)), ...]`. This pattern repeats block-cyclically across all dimensions.
 
-####   Examples and Usage
+#### Examples and Usage
 
 The `assignment` argument works similarly for `DArray`, `DVector`, and `DMatrix`, as well as the `distribute` function. The key difference lies in the dimensionality of the resulting distributed array. For functions like `rand`, `randn`, `sprand`, `ones`, and `zeros`, `assignment` is an keyword argument.
 
@@ -261,11 +261,11 @@ The `assignment` argument works similarly for `DArray`, `DVector`, and `DMatrix`
 
 * `DMatrix`: Specifically for 2-dimensional distributed arrays.
 
-* `distribute`: General function to distribute arrays.
+* `distribute`: General function to distribute arrays of any dimensionality.
 
 * `rand`, `randn`, `sprand`, `ones`, `zeros`: Functions to create DArrays with initial values, also supporting `assignment`.
 
-Here are some examples using a setup with one master processor and three worker processors.
+Here are some examples using a setup with one master process and three worker processes.
 
 First, let's create some sample arrays for `distribute` (and constructor functions):
 
@@ -281,10 +281,10 @@ M = zeros(5, 5, 5) # 3D array
     Ad = distribute(A, Blocks(2, 2), :arbitrary)
     # DMatrix(A, Blocks(2, 2), :arbitrary)
 
-    vd = distribute(v, Blocks(3), :arbitrary) 
+    vd = distribute(v, Blocks(3), :arbitrary)
     # DVector(v, Blocks(3), :arbitrary)
 
-    Md = distribute(M, Blocks(2, 2, 2), :arbitrary) 
+    Md = distribute(M, Blocks(2, 2, 2), :arbitrary)
     # DArray(M, Blocks(2,2,2), :arbitrary)
 
     Rd = rand(Blocks(2, 2), 7, 11; assignment=:arbitrary)
@@ -303,7 +303,7 @@ M = zeros(5, 5, 5) # 3D array
 
 2.  **Structured Assignments:**
 
-  * **`:blockrow` Assignment:** 
+  * **`:blockrow` Assignment:**
 
     ```julia
     Ad = distribute(A, Blocks(1, 2), :blockrow)
@@ -329,7 +329,7 @@ M = zeros(5, 5, 5) # 3D array
     ThreadProc(4, 1)  ThreadProc(4, 1)  ThreadProc(4, 1)  ThreadProc(4, 1)  ThreadProc(4, 1)  ThreadProc(4, 1) ThreadProc(4, 1)
     ```
 
-  * **`:blockcol` Assignment:** 
+  * **`:blockcol` Assignment:**
 
     ```julia
     Ad = distribute(A, Blocks(2, 2), :blockcol)
@@ -350,9 +350,9 @@ M = zeros(5, 5, 5) # 3D array
     ThreadProc(1, 1)  ThreadProc(1, 1)  ThreadProc(2, 1)  ThreadProc(2, 1)  ThreadProc(3, 1)  ThreadProc(4, 1)
     ThreadProc(1, 1)  ThreadProc(1, 1)  ThreadProc(2, 1)  ThreadProc(2, 1)  ThreadProc(3, 1)  ThreadProc(4, 1)
     ThreadProc(1, 1)  ThreadProc(1, 1)  ThreadProc(2, 1)  ThreadProc(2, 1)  ThreadProc(3, 1)  ThreadProc(4, 1)
-    ```  
+    ```
 
-* **`:cyclicrow` Assignment:** 
+* **`:cyclicrow` Assignment:**
 
     ```julia
     Ad = distribute(A, Blocks(1, 2), :cyclicrow)
@@ -378,7 +378,7 @@ M = zeros(5, 5, 5) # 3D array
     ThreadProc(3, 1)  ThreadProc(3, 1)  ThreadProc(3, 1)  ThreadProc(3, 1)  ThreadProc(3, 1)  ThreadProc(3, 1) ThreadProc(3, 1)
     ```
 
-* **`:cycliccol` Assignment:** 
+* **`:cycliccol` Assignment:**
 
     ```julia
     Ad = distribute(A, Blocks(2, 2), :cycliccol)
@@ -405,21 +405,21 @@ M = zeros(5, 5, 5) # 3D array
 
     ```julia
     assignment_2d = [2 1; 4 3]
-    Ad = distribute(A, Blocks(2, 2), assignment_2d) 
-    # DMatrix(A, Blocks(2, 2), [3 1; 4 2])
-    
+    Ad = distribute(A, Blocks(2, 2), assignment_2d)
+    # DMatrix(A, Blocks(2, 2), [2 1; 4 3])
+
     assignment_1d = [2,3,1,4]
-    vd = distribute(v, Blocks(3), assignment_1d) 
+    vd = distribute(v, Blocks(3), assignment_1d)
     # DVector(v, Blocks(3), [2,3,1,4])
-    
+
     assignment_3d = cat([1 2; 3 4], [4 3; 2 1], dims=3)
     Md = distribute(M, Blocks(2, 2, 2), assignment_3d) 
     # DArray(M, Blocks(2, 2, 2), cat([1 2; 3 4], [4 3; 2 1], dims=3))
     Rd = sprand(Blocks(2, 2), 7, 11, 0.2; assignment=assignment_2d)
     # distribute(sprand(7,11, 0.2), Blocks(2, 2), assignment_2d)
     ```
 
-    The assignment is a integer matrix of `Processor` ID’s,  the blocks are assigned in block-cyclic manner to first thread `Processor` ID’s. The assignment for `Ad` (and `Rd`) would be
+    The assignment is an integer matrix of worker IDs, the blocks are assigned in block-cyclic manner to the first CPU thread of each worker. The assignment for `Ad` (and `Rd`) would be:
 
     ```julia
     4×6 Matrix{Dagger.ThreadProc}:
@@ -434,22 +434,22 @@ M = zeros(5, 5, 5) # 3D array
     ```julia
     assignment_2d = [Dagger.ThreadProc(3, 2) Dagger.ThreadProc(1, 1);
                      Dagger.ThreadProc(4, 3) Dagger.ThreadProc(2, 2)]
-    Ad = distribute(A, Blocks(2, 2), assignment_2d) 
+    Ad = distribute(A, Blocks(2, 2), assignment_2d)
     # DMatrix(A, Blocks(2, 2), assignment_2d)
-    
+
     assignment_1d = [Dagger.ThreadProc(2,1), Dagger.ThreadProc(3,1), Dagger.ThreadProc(1,1), Dagger.ThreadProc(4,1)]
-    vd = distribute(v, Blocks(3), assignment_1d) 
+    vd = distribute(v, Blocks(3), assignment_1d)
     # DVector(v, Blocks(3), assignment_1d)
-    
+
     assignment_3d = cat([Dagger.ThreadProc(1,1) Dagger.ThreadProc(2,1); Dagger.ThreadProc(3,1) Dagger.ThreadProc(4,1)],
                         [Dagger.ThreadProc(4,1) Dagger.ThreadProc(3,1); Dagger.ThreadProc(2,1) Dagger.ThreadProc(1,1)], dims=3)
-    Md = distribute(M, Blocks(2, 2, 2), assignment_3d) 
+    Md = distribute(M, Blocks(2, 2, 2), assignment_3d)
     # DArray(M, Blocks(2, 2, 2), assignment_3d)
     Rd = rand(Blocks(2, 2), 7, 11; assignment=assignment_2d))
     # distribute(rand(7,11), Blocks(2, 2), assignment_2d)
     ```
 
-    The assignment is a matrix of `Processor` objects, the blocks are assigned in block-cyclic manner to `Processor` objects. The assignment for `Ad` (and `Rd`) would be:
+    The assignment is a matrix of `Processor` objects, the blocks are assigned in block-cyclic manner to each processor. The assignment for `Ad` (and `Rd`) would be:
 
     ```julia
     4×6 Matrix{Dagger.ThreadProc}:
@@ -459,8 +459,6 @@ M = zeros(5, 5, 5) # 3D array
       ThreadProc(4, 3)  ThreadProc(2, 2)  ThreadProc(4, 3)  ThreadProc(2, 2)  ThreadProc(4, 3)  ThreadProc(2, 2)
     ```
 
-<!--  -->
-
 ## Broadcasting
 
 As the `DArray` is a subtype of `AbstractArray` and generally satisfies Julia's
diff --git a/src/array/alloc.jl b/src/array/alloc.jl
@@ -15,33 +15,37 @@ mutable struct AllocateArray{T,N} <: ArrayOp{T,N}
         sizeA = map(length, d.indexes)
         procgrid = nothing
         availprocs = collect(Dagger.compatible_processors())
-        sort!(availprocs, by = x -> (x.owner, x.tid))
+        if !(assignment isa AbstractArray{<:Processor, N})
+            filter!(p -> p isa ThreadProc, availprocs)
+            sort!(availprocs, by = x -> (x.owner, x.tid))
+        end
+        np = length(availprocs)
         if assignment isa Symbol
             if assignment == :arbitrary
                 procgrid = nothing
             elseif assignment == :blockrow
                 q = ntuple(i -> i == 1 ? Int(ceil(sizeA[1] / p.blocksize[1])) : 1, N)
-                rows_per_proc, extra = divrem(Int(ceil(sizeA[1] / p.blocksize[1])), num_processors())
-                counts = [rows_per_proc + (i <= extra ? 1 : 0) for i in 1:num_processors()]
+                rows_per_proc, extra = divrem(Int(ceil(sizeA[1] / p.blocksize[1])), np)
+                counts = [rows_per_proc + (i <= extra ? 1 : 0) for i in 1:np]
                 procgrid = reshape(vcat(fill.(availprocs, counts)...), q)
             elseif assignment == :blockcol
                 q = ntuple(i -> i == N ? Int(ceil(sizeA[N] / p.blocksize[N])) : 1, N)
-                cols_per_proc, extra = divrem(Int(ceil(sizeA[N] / p.blocksize[N])), num_processors())
-                counts = [cols_per_proc + (i <= extra ? 1 : 0) for i in 1:num_processors()]
+                cols_per_proc, extra = divrem(Int(ceil(sizeA[N] / p.blocksize[N])), np)
+                counts = [cols_per_proc + (i <= extra ? 1 : 0) for i in 1:np]
                 procgrid = reshape(vcat(fill.(availprocs, counts)...), q)
             elseif assignment == :cyclicrow
-                q = ntuple(i -> i == 1 ? num_processors() : 1, N)
+                q = ntuple(i -> i == 1 ? np : 1, N)
                 procgrid = reshape(availprocs, q)
             elseif assignment == :cycliccol
-                q = ntuple(i -> i == N ? num_processors() : 1, N)
+                q = ntuple(i -> i == N ? np : 1, N)
                 procgrid = reshape(availprocs, q)
             else
                 error("Unsupported assignment symbol: $assignment, use :arbitrary, :blockrow, :blockcol, :cyclicrow or :cycliccol")
             end
         elseif assignment isa AbstractArray{<:Int, N}
             missingprocs = filter(q -> q ∉ procs(), assignment)
             isempty(missingprocs) || error("Specified workers are not available: $missingprocs")
-            procgrid = [Dagger.ThreadProc(proc, 1) for proc in assignment]
+            procgrid = [ThreadProc(proc, 1) for proc in assignment]
         elseif assignment isa AbstractArray{<:Processor, N}
             missingprocs = filter(q -> q ∉ availprocs, assignment)
             isempty(missingprocs) || error("Specified processors are not available: $missingprocs")
@@ -86,7 +90,11 @@ function stage(ctx, a::AllocateArray)
         else
             scope = ExactScope(a.procgrid[CartesianIndex(mod1.(Tuple(I), size(a.procgrid))...)])
         end
-        Dagger.@spawn compute_scope=scope allocate_array(a.f, a.eltype, args...)
+        if a.want_index
+            Dagger.@spawn compute_scope=scope allocate_array(a.f, a.eltype, i, args...)
+        else
+            Dagger.@spawn compute_scope=scope allocate_array(a.f, a.eltype, args...)
+        end
     end
     return DArray(a.eltype, a.domain, a.domainchunks, chunks, a.partitioning)
 end
diff --git a/src/array/darray.jl b/src/array/darray.jl
@@ -513,33 +513,37 @@ distribute(A::AbstractArray, assignment::AssignmentType = :arbitrary) = distribu
 function distribute(A::AbstractArray{T,N}, dist::Blocks{N}, assignment::AssignmentType{N} = :arbitrary) where {T,N}
     procgrid = nothing
     availprocs = collect(Dagger.compatible_processors())
-    sort!(availprocs, by = x -> (x.owner, x.tid))
+    if !(assignment isa AbstractArray{<:Processor, N})
+        filter!(p -> p isa ThreadProc, availprocs)
+        sort!(availprocs, by = x -> (x.owner, x.tid))
+    end
+    np = length(availprocs)
     if assignment isa Symbol
         if assignment == :arbitrary
             procgrid = nothing
         elseif assignment == :blockrow
             p = ntuple(i -> i == 1 ? Int(ceil(size(A,1) / dist.blocksize[1])) : 1, N)
-            rows_per_proc, extra = divrem(Int(ceil(size(A,1) / dist.blocksize[1])), num_processors())
-            counts = [rows_per_proc + (i <= extra ? 1 : 0) for i in 1:num_processors()]
-            procgrid = reshape(vcat(fill.(availprocs, counts)...), p)   
+            rows_per_proc, extra = divrem(Int(ceil(size(A,1) / dist.blocksize[1])), np)
+            counts = [rows_per_proc + (i <= extra ? 1 : 0) for i in 1:np]
+            procgrid = reshape(vcat(fill.(availprocs, counts)...), p)
         elseif assignment == :blockcol
             p = ntuple(i -> i == N ? Int(ceil(size(A,N) / dist.blocksize[N])) : 1, N)
-            cols_per_proc, extra = divrem(Int(ceil(size(A,N) / dist.blocksize[N])), num_processors())
-            counts = [cols_per_proc + (i <= extra ? 1 : 0) for i in 1:num_processors()]
+            cols_per_proc, extra = divrem(Int(ceil(size(A,N) / dist.blocksize[N])), np)
+            counts = [cols_per_proc + (i <= extra ? 1 : 0) for i in 1:np]
             procgrid = reshape(vcat(fill.(availprocs, counts)...), p)
         elseif assignment == :cyclicrow
-            p = ntuple(i -> i == 1 ? num_processors() : 1, N)
+            p = ntuple(i -> i == 1 ? np : 1, N)
             procgrid = reshape(availprocs, p)
         elseif assignment == :cycliccol
-            p = ntuple(i -> i == N ? num_processors() : 1, N)
+            p = ntuple(i -> i == N ? np : 1, N)
             procgrid = reshape(availprocs, p)
         else
             error("Unsupported assignment symbol: $assignment, use :arbitrary, :blockrow, :blockcol, :cyclicrow or :cycliccol")
         end
     elseif assignment isa AbstractArray{<:Int, N}
         missingprocs = filter(p -> p ∉ procs(), assignment)
         isempty(missingprocs) || error("Specified workers are not available: $missingprocs")
-        procgrid = [Dagger.ThreadProc(proc, 1) for proc in assignment]
+        procgrid = [ThreadProc(proc, 1) for proc in assignment]
     elseif assignment isa AbstractArray{<:Processor, N}
         missingprocs = filter(p -> p ∉ availprocs, assignment)
         isempty(missingprocs) || error("Specified processors are not available: $missingprocs")
@@ -563,7 +567,7 @@ DArray(A::AbstractArray{T,N}, part::Blocks{N}, assignment::AssignmentType{N} = :
 
 DVector(A::AbstractVector{T}, assignment::AssignmentType{1} = :arbitrary) where T = DVector(A, AutoBlocks(), assignment)
 DMatrix(A::AbstractMatrix{T}, assignment::AssignmentType{2} = :arbitrary) where T = DMatrix(A, AutoBlocks(), assignment)
-DArray(A::AbstractArray, assignment::AssignmentType = :arbitrary) = DArray(A, AutoBlocks(), assignment) 
+DArray(A::AbstractArray, assignment::AssignmentType = :arbitrary) = DArray(A, AutoBlocks(), assignment)
 
 DVector(A::AbstractVector{T}, ::AutoBlocks, assignment::AssignmentType{1} = :arbitrary) where T = DVector(A, auto_blocks(A), assignment)
 DMatrix(A::AbstractMatrix{T}, ::AutoBlocks, assignment::AssignmentType{2} = :arbitrary) where T = DMatrix(A, auto_blocks(A), assignment)