Skip to content

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Aug 6, 2025

This commit implements a Task optimizations by adding a invoked field to the Task definition, which optionally causes it to do an invoke call internally instead of a regular call.

Key changes:

  • Add invoked field to Task structure, supporting Type, Method, MethodInstance, and CodeInstance, just like invoke
  • Implement _task builtin function to construct Task
  • Create PartialTask lattice element for precise task result and error type tracking
  • Unify several CallInfo types into IndirectCallInfo
  • Optimized calls to _task constructor to inject CodeInstance. Calling fully_covers isn't necessary, as the runtime will check that.

In the future we can consider making this field user-inaccessible (like scope) and then be able to optimize based on the type returned from fetch.

Also added documentation to base/docs/basedocs.jl on how to easily add more Builtin functions, following this as an example.

🤖 Generated with help from Claude Code

This commit implements a Task optimizations by adding a `invoked` field
to the Task definition, which optionally causes it to do an `invoke`
call internally instead of a regular `call`.

Key changes:
  - Add `invoked` field to Task structure, supporting Type, Method, MethodInstance, and CodeInstance, just like invoke
  - Implement `_task` builtin function to construct Task
  - Create PartialTask lattice element for precise task result and error type tracking
  - Unify several CallInfo types into IndirectCallInfo
  - Optimized calls to `_task` constructor to inject CodeInstance.
    Calling `fully_covers` isn't necessary, as the runtime will check that.

In the future we can make this field user-inaccessible (like scope) and
then be able to optimize based on the type returned from `fetch`.

Also added documentation to base/docs/basedocs.jl on how to easily add
more Builtin functions, following this as an example.

🤖 Generated with help from [Claude Code](https://claude.ai/code)
@vtjnash vtjnash requested a review from aviatesk August 6, 2025 15:31
- `modifyglobal!(mod, var, op, x, order)` where `op(getval(), x)` is called
- `memoryrefmodify!(memref, op, x, order, boundscheck)` where `op(getval(), x)` is called
- `Intrinsics.atomic_pointermodify(ptr, op, x, order)` where `op(getval(), x)` is called
- `Core._task(f, size)` where `f()` will be called when the task runs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't task and finalizer different from the modify functions in that the modify functions eagerly call the argument at the call site, while task and finalizer are deferred? I thought that was important.

@@ -498,6 +498,10 @@ function serialize(s::AbstractSerializer, t::Task)
if istaskstarted(t) && !istaskdone(t)
error("cannot serialize a running Task")
end
if isdefined(t, :invoked)
error("cannot serialize a Task constructed with invoke info")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is helpful, since if you hit this you can't really control it. We could just skip this field and hope it works out.

@@ -2,7 +2,11 @@

## basic task functions and TLS

Core.Task(@nospecialize(f), reserved_stack::Int=0) = Core._Task(f, reserved_stack, ThreadSynchronizer())
Core.Task(@nospecialize(f), reserved_stack::Int=0) = begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Core.Task(@nospecialize(f), reserved_stack::Int=0) = begin
function Core.Task(@nospecialize(f), reserved_stack::Int=0)

@@ -2160,6 +2160,18 @@ JL_CALLABLE(jl_f__svec_ref)
return jl_svecref(s, idx-1);
}

JL_CALLABLE(jl_f__task)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much overhead this has. Might need codegen support.

@JeffBezanson
Copy link
Member

Any benchmarks?

end
# Check that size argument is an Int
size_arg = argtypes[3]
if !(widenconst(size_arg) ⊑ Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !(widenconst(size_arg) Int)
if !hasintersect(widenconst(size_arg), Int)

if la < 3 || la > 4
return Future(CallMeta(Bottom, Any, EFFECTS_THROWS, NoCallInfo()))
end
# Check that size argument is an Int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check that size argument is an Int

Comment on lines +3217 to +3220
# Check argument count: _task(func, size) or _task(func, size, ci)
if la < 3 || la > 4
return Future(CallMeta(Bottom, Any, EFFECTS_THROWS, NoCallInfo()))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check argument count: _task(func, size) or _task(func, size, ci)
if la < 3 || la > 4
return Future(CallMeta(Bottom, Any, EFFECTS_THROWS, NoCallInfo()))
end
if !isempty(argtypes) && !isvarargtype(argtypes[end])
if !(3 <= la <= 4)
return Future(CallMeta(Bottom, Any, EFFECTS_THROWS, NoCallInfo()))
end
elseif isempty(argtypes) || la > 5
return Future(CallMeta(Bottom, Any, EFFECTS_THROWS, NoCallInfo()))
end

@aviatesk
Copy link
Member

aviatesk commented Aug 6, 2025

Any benchmarks?

I was unable to confirm optimization results even in the simplest case on my end unfortunately.
Perhaps task construction/scheduling is becoming the bottleneck?

julia> func2(i) = fetch(Threads.@spawn sin(i));

julia> using BenchmarkTools

julia> @benchmark func2(42)

master:

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  709.000 ns …  34.583 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      12.667 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    12.857 μs ± 976.688 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                            ▁▃▅▅█▇▄▄▅▄▃▂▁▁      ▂
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄████████████████▇▆▇ █
  709 ns        Histogram: log(frequency) by time       16.2 μs <

 Memory estimate: 352 bytes, allocs estimate: 6.

this PR:

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  11.333 μs …  29.542 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.708 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.602 μs ± 587.109 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▆▃         ▁   ▁▄▅▆▅▄▃▄▇▇██▆▆▅▂▁▁▂ ▁▁▁   ▁                  ▃
  ███▇▇▆▆▇█▇▇▇█▇▇████████████████████████▇▇▇██▇▇▇█▆▅▆█▆▆▆▇▆▄▆▅ █
  11.3 μs       Histogram: log(frequency) by time      14.4 μs <

 Memory estimate: 352 bytes, allocs estimate: 6.

@vtjnash
Copy link
Member Author

vtjnash commented Aug 6, 2025

It was entirely process_events and scheduling issues for me, which are separately addressable. The main point is inference and trim, not performance at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants