Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -162,8 +162,8 @@ steps:
codecov: true
agents:
queue: "juliaecosystem"
os: linux
arch: x86_64
os: macos
arch: aarch64
env:
CI_USE_OPENCL: "1"

Expand Down
5 changes: 4 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,10 @@ makedocs(;
"Scopes" => "scopes.md",
"Processors" => "processors.md",
"Task Queues" => "task-queues.md",
"Datadeps" => "datadeps.md",
"Datadeps" => [
"Basics" => "datadeps.md",
"Stencils" => "stencils.md",
],
"GPUs" => "gpu.md",
"Option Propagation" => "propagation.md",
"Logging and Visualization" => [
Expand Down
33 changes: 33 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -361,6 +361,39 @@ DA = rand(Blocks(32, 32), 256, 128)
collect(DA) # returns a `Matrix{Float64}`
```

-----

## Quickstart: Stencil Operations

Dagger's `@stencil` macro allows for easy specification of stencil operations on `DArray`s, often used in simulations and image processing. These operations typically involve updating an element based on the values of its neighbors.

For more details: [Stencil Operations](@ref)

### Applying a Simple Stencil

Here's how to apply a stencil that averages each element with its immediate neighbors, using a `Wrap` boundary condition (where neighbor access at the array edges wrap around).

```julia
using Dagger
import Dagger: @stencil, Wrap

# Create a 5x5 DArray, partitioned into 2x2 blocks
A = rand(Blocks(2, 2), 5, 5)
B = zeros(Blocks(2,2), 5, 5)

Dagger.spawn_datadeps() do
@stencil begin
# For each element in A, calculate the sum of its 3x3 neighborhood
# (including itself) and store the average in B.
# Values outside the array bounds are determined by Wrap().
B[idx] = sum(@neighbors(A[idx], 1, Wrap())) / 9.0
end
end

# B now contains the averaged values.
```
In this example, `idx` refers to the coordinates of each element being processed. `@neighbors(A[idx], 1, Wrap())` fetches the 3x3 neighborhood around `A[idx]`. The `1` indicates a neighborhood distance of 1 from the central element, and `Wrap()` specifies the boundary behavior.

## Quickstart: Datadeps

Datadeps is a feature in Dagger.jl that facilitates parallelism control within designated regions, allowing tasks to write to their arguments while ensuring dependencies are respected.
Expand Down
183 changes: 183 additions & 0 deletions docs/src/stencils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# Stencil Operations

The `@stencil` macro in Dagger.jl provides a convenient way to perform stencil computations on `DArray`s. It operates within a `Dagger.spawn_datadeps()` block and allows you to define operations that apply to each element of a `DArray`, potentially accessing values from each element's neighbors.

## Basic Usage

The fundamental structure of a `@stencil` block involves iterating over an implicit index, named `idx` in the following example , which represents the coordinates of an element in the processed `DArray`s.

```julia
using Dagger
import Dagger: @stencil, Wrap, Pad

# Initialize a DArray
A = zeros(Blocks(2, 2), Int, 4, 4)

Dagger.spawn_datadeps() do
@stencil begin
A[idx] = 1 # Assign 1 to every element of A
end
end

@assert all(collect(A) .== 1)
```

In this example, `A[idx] = 1` is executed for each chunk of `A`. The `idx` variable corresponds to the indices within each chunk.

## Neighborhood Access with `@neighbors`

The true power of stencils comes from accessing neighboring elements. The `@neighbors` macro facilitates this.

`@neighbors(array[idx], distance, boundary_condition)`

- `array[idx]`: The array and current index from which to find neighbors.
- `distance`: An integer specifying the extent of the neighborhood (e.g., `1` for a 3x3 neighborhood in 2D).
- `boundary_condition`: Defines how to handle accesses beyond the array boundaries. Available conditions are:
- `Wrap()`: Wraps around to the other side of the array.
- `Pad(value)`: Pads with a specified `value`.

### Example: Averaging Neighbors with `Wrap`

```julia
import Dagger: Wrap

# Initialize a DArray
A = ones(Blocks(1, 1), Int, 3, 3)
A[2,2] = 10 # Central element has a different value
B = zeros(Blocks(1, 1), Float64, 3, 3)

Dagger.spawn_datadeps() do
@stencil begin
# Calculate the average of the 3x3 neighborhood (including the center)
B[idx] = sum(@neighbors(A[idx], 1, Wrap())) / 9.0
end
end

# Manually calculate expected B for verification
expected_B = zeros(Float64, 3, 3)
A_collected = collect(A)
for r in 1:3, c in 1:3
local_sum = 0.0
for dr in -1:1, dc in -1:1
nr, nc = mod1(r+dr, 3), mod1(c+dc, 3)
local_sum += A_collected[nr, nc]
end
expected_B[r,c] = local_sum / 9.0
end

@assert collect(B) ≈ expected_B
```

### Example: Convolution with `Pad`

```julia
import Pad

# Initialize a DArray
A = ones(Blocks(2, 2), Int, 4, 4)
B = zeros(Blocks(2, 2), Int, 4, 4)

Dagger.spawn_datadeps() do
@stencil begin
B[idx] = sum(@neighbors(A[idx], 1, Pad(0))) # Pad with 0
end
end

# Expected result for a 3x3 sum filter with zero padding
expected_B_padded = [
4 6 6 4;
6 9 9 6;
6 9 9 6;
4 6 6 4
]
@assert collect(B) == expected_B_padded
```

## Sequential Semantics

Expressions within a `@stencil` block are executed sequentially in terms of their effect on the data. This means that the result of one statement is visible to the subsequent statements, as if they were applied "all at once" across all indices before the next statement begins.

```julia
A = zeros(Blocks(2, 2), Int, 4, 4)
B = zeros(Blocks(2, 2), Int, 4, 4)

Dagger.spawn_datadeps() do
@stencil begin
A[idx] = 1 # First, A is initialized
B[idx] = A[idx] * 2 # Then, B is computed using the new values of A
end
end

expected_A = [1 for r in 1:4, c in 1:4]
expected_B_seq = expected_A .* 2

@assert collect(A) == expected_A
@assert collect(B) == expected_B_seq
```

## Operations on Multiple `DArray`s

You can read from and write to multiple `DArray`s within a single `@stencil` block, provided they have compatible chunk structures.

```julia
A = ones(Blocks(1, 1), Int, 2, 2)
B = DArray(fill(3, 2, 2), Blocks(1, 1))
C = zeros(Blocks(1, 1), Int, 2, 2)

Dagger.spawn_datadeps() do
@stencil begin
C[idx] = A[idx] + B[idx]
end
end
@assert all(collect(C) .== 4)
```

## Example: Game of Life

The following demonstrates a more complex example: Conway's Game of Life.

```julia
# Ensure Plots and other necessary packages are available for the example
using Plots

N = 27 # Size of one dimension of a tile
nt = 3 # Number of tiles in each dimension (results in nt x nt grid of tiles)
niters = 10 # Number of iterations for the animation

tiles = zeros(Blocks(N, N), Bool, N*nt, N*nt)
outputs = zeros(Blocks(N, N), Bool, N*nt, N*nt)

# Create a fun initial state (e.g., a glider and some random noise)
tiles[13, 14] = true
tiles[14, 14] = true
tiles[15, 14] = true
tiles[15, 15] = true
tiles[14, 16] = true
# Add some random noise in one of the tiles
@view(tiles[(2N+1):3N, (2N+1):3N]) .= rand(Bool, N, N)



anim = @animate for _ in 1:niters
Dagger.spawn_datadeps() do
@stencil begin
outputs[idx] = begin
nhood = @neighbors(tiles[idx], 1, Wrap())
neighs = sum(nhood) - tiles[idx] # Sum neighborhood, but subtract own value
if tiles[idx] && neighs < 2
0 # Dies of underpopulation
elseif tiles[idx] && neighs > 3
0 # Dies of overpopulation
elseif !tiles[idx] && neighs == 3
1 # Becomes alive by reproduction
else
tiles[idx] # Keeps its prior value
end
end
tiles[idx] = outputs[idx] # Update tiles for the next iteration
end
end
heatmap(Int.(collect(outputs))) # Generate a heatmap visualization
end
path = mp4(anim; fps=5, show_msg=true).filename # Create an animation of the heatmaps over time
```
56 changes: 38 additions & 18 deletions ext/CUDAExt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -252,24 +252,6 @@ Dagger.move(from_proc::CPUProc, to_proc::CuArrayDeviceProc, x::Function) = x
Dagger.move(from_proc::CPUProc, to_proc::CuArrayDeviceProc, x::Chunk{T}) where {T<:Function} =
Dagger.move(from_proc, to_proc, fetch(x))

# Adapt BLAS/LAPACK functions
import LinearAlgebra: BLAS, LAPACK
for lib in [BLAS, LAPACK]
for name in names(lib; all=true)
name == nameof(lib) && continue
startswith(string(name), '#') && continue
endswith(string(name), '!') || continue

for culib in [CUBLAS, CUSOLVER]
if name in names(culib; all=true)
fn = getproperty(lib, name)
cufn = getproperty(culib, name)
@eval Dagger.move(from_proc::CPUProc, to_proc::CuArrayDeviceProc, ::$(typeof(fn))) = $cufn
end
end
end
end

# Task execution
function Dagger.execute!(proc::CuArrayDeviceProc, f, args...; kwargs...)
@nospecialize f args kwargs
Expand All @@ -291,6 +273,44 @@ function Dagger.execute!(proc::CuArrayDeviceProc, f, args...; kwargs...)
end
end

# Adapt BLAS/LAPACK functions
import LinearAlgebra: BLAS, LAPACK
for lib in [BLAS, LAPACK]
for name in names(lib; all=true)
name == nameof(lib) && continue
startswith(string(name), '#') && continue
endswith(string(name), '!') || continue

for culib in [CUBLAS, CUSOLVER]
if name in names(culib; all=true)
fn = getproperty(lib, name)
cufn = getproperty(culib, name)
@eval Dagger.move(from_proc::CPUProc, to_proc::CuArrayDeviceProc, ::$(typeof(fn))) = $cufn
end
end
end
end

CuArray(H::Dagger.HaloArray) = convert(CuArray, H)
Base.convert(::Type{C}, H::Dagger.HaloArray) where {C<:CuArray} =
Dagger.HaloArray(C(H.center),
C.(H.edges),
C.(H.corners),
H.halo_width)
Adapt.adapt_structure(to::CUDA.KernelAdaptor, H::Dagger.HaloArray) =
Dagger.HaloArray(adapt(to, H.center),
adapt.(Ref(to), H.edges),
adapt.(Ref(to), H.corners),
H.halo_width)
function Dagger.inner_stencil_proc!(::CuArrayDeviceProc, f, output, read_vars)
Dagger.Kernel(_inner_stencil!)(f, output, read_vars; ndrange=size(output))
return
end
@kernel function _inner_stencil!(f, output, read_vars)
idx = @index(Global, Cartesian)
f(idx, output, read_vars)
end

Dagger.gpu_processor(::Val{:CUDA}) = CuArrayDeviceProc
Dagger.gpu_can_compute(::Val{:CUDA}) = CUDA.has_cuda()
Dagger.gpu_kernel_backend(::CuArrayDeviceProc) = CUDABackend()
Expand Down
20 changes: 20 additions & 0 deletions ext/IntelExt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,26 @@ function Dagger.execute!(proc::oneArrayDeviceProc, f, args...; kwargs...)
end
end

oneArray(H::Dagger.HaloArray) = convert(oneArray, H)
Base.convert(::Type{C}, H::Dagger.HaloArray) where {C<:oneArray} =
Dagger.HaloArray(C(H.center),
C.(H.edges),
C.(H.corners),
H.halo_width)
Adapt.adapt_structure(to::oneAPI.KernelAdaptor, H::Dagger.HaloArray) =
Dagger.HaloArray(adapt(to, H.center),
adapt.(Ref(to), H.edges),
adapt.(Ref(to), H.corners),
H.halo_width)
function Dagger.inner_stencil_proc!(::oneArrayDeviceProc, f, output, read_vars)
Dagger.Kernel(_inner_stencil!)(f, output, read_vars; ndrange=size(output))
return
end
@kernel function _inner_stencil!(f, output, read_vars)
idx = @index(Global, Cartesian)
f(idx, output, read_vars)
end

Dagger.gpu_processor(::Val{:oneAPI}) = oneArrayDeviceProc
Dagger.gpu_can_compute(::Val{:oneAPI}) = oneAPI.functional()
Dagger.gpu_kernel_backend(::oneArrayDeviceProc) = oneAPIBackend()
Expand Down
17 changes: 16 additions & 1 deletion ext/MetalExt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,21 @@ function Dagger.execute!(proc::MtlArrayDeviceProc, f, args...; kwargs...)
end
end

MtlArray(H::Dagger.HaloArray) = convert(MtlArray, H)
Base.convert(::Type{C}, H::Dagger.HaloArray) where {C<:MtlArray} =
Dagger.HaloArray(C(H.center),
C.(H.edges),
C.(H.corners),
H.halo_width)
function Dagger.inner_stencil_proc!(::MtlArrayDeviceProc, f, output, read_vars)
Dagger.Kernel(_inner_stencil!)(f, output, read_vars; ndrange=size(output))
return
end
@kernel function _inner_stencil!(f, output, read_vars)
idx = @index(Global, Cartesian)
f(idx, output, read_vars)
end

function Base.show(io::IO, proc::MtlArrayDeviceProc)
print(io, "MtlArrayDeviceProc(worker $(proc.owner), device $(something(_get_metal_device(proc)).name))")
end
Expand All @@ -284,7 +299,7 @@ Dagger.gpu_kernel_backend(proc::MtlArrayDeviceProc) = MetalBackend()
# TODO: Switch devices
Dagger.gpu_with_device(f, proc::MtlArrayDeviceProc) = f()

function Dagger.gpu_synchronize(proc::MtlArrayDeviceProc)q
function Dagger.gpu_synchronize(proc::MtlArrayDeviceProc)
with_context(proc) do
Metal.synchronize()
end
Expand Down
Loading