Skip to content

Conversation

@GeorgeWeb
Copy link
Contributor

@GeorgeWeb GeorgeWeb commented Aug 30, 2024

intel/llvm: intel/llvm#15369

@github-actions github-actions bot added the hip HIP adapter specific issues label Aug 30, 2024
@GeorgeWeb GeorgeWeb added the opencl OpenCL adapter specific issues label Sep 4, 2024
@GeorgeWeb GeorgeWeb marked this pull request as ready for review September 11, 2024 14:59
@GeorgeWeb GeorgeWeb requested review from a team as code owners September 11, 2024 14:59
@GeorgeWeb GeorgeWeb requested a review from jchlanda September 11, 2024 14:59
Copy link
Contributor

@frasercrmck frasercrmck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

@GeorgeWeb GeorgeWeb force-pushed the georgi/unsupported-max-coop-wgsize branch 3 times, most recently from fa515a0 to ed6fbad Compare September 27, 2024 15:20
…unsupported instead of returning misleading default value
@GeorgeWeb GeorgeWeb force-pushed the georgi/unsupported-max-coop-wgsize branch from ed6fbad to 55bd563 Compare October 2, 2024 11:20
@aarongreig aarongreig added the ready to merge Added to PR's which are ready to merge label Oct 2, 2024
@aarongreig aarongreig merged commit df6da35 into oneapi-src:main Oct 7, 2024
72 of 74 checks passed
martygrant pushed a commit to intel/llvm that referenced this pull request Oct 10, 2024
…xceeded launch limits on more backends (hip and opencl) (#15369)

The HIP and OpenCL backend implementations of the query had a default
return `1` group implementation, which is an incorrect assumptions. They
will now be marked as _Unsupported_ with the accompanying UR chagnes
(see oneapi-src/unified-runtime#2038), so for
these cases the `kernel_queue_specific::max_num_work_groups` launch
query will rely on the fallback that returns either `1` or `0` groups
based on hardware resource limitation checks for the kernel.

---------

Co-authored-by: Aaron Greig <[email protected]>
@jinz2014
Copy link

@GeorgeWeb
Could you please explain that it is not supported on the HIP backend ?

@jinz2014
Copy link

What about the following function for HIP ?

UR_APIEXPORT ur_result_t UR_APICALL urKernelSuggestMaxCooperativeGroupCountExp(
    ur_kernel_handle_t hKernel, ur_device_handle_t hDevice, uint32_t workDim,
    const size_t *pLocalWorkSize, size_t dynamicSharedMemorySize,
    uint32_t *pGroupCountRet) {
  UR_ASSERT(hKernel, UR_RESULT_ERROR_INVALID_KERNEL);

  std::ignore = hDevice;

  size_t localWorkSize = pLocalWorkSize[0];
  localWorkSize *= (workDim >= 2 ? pLocalWorkSize[1] : 1);
  localWorkSize *= (workDim == 3 ? pLocalWorkSize[2] : 1);

  // We need to set the active current device for this kernel explicitly here,
  // because the occupancy querying API does not take device parameter.
  ur_device_handle_t Device = hKernel->getProgram()->getDevice();
  ScopedContext Active(Device);
  try {
    // We need to calculate max num of work-groups using per-device semantics.

    int MaxNumActiveGroupsPerCU{0};
    UR_CHECK_ERROR(hipOccupancyMaxActiveBlocksPerMultiprocessor(
        &MaxNumActiveGroupsPerCU, hKernel->get(), localWorkSize,
        dynamicSharedMemorySize));
    detail::ur::assertion(MaxNumActiveGroupsPerCU >= 0);
    // Handle the case where we can't have all SMs active with at least 1 group
    // per SM. In that case, the device is still able to run 1 work-group, hence
    // we will manually check if it is possible with the available HW resources.
    if (MaxNumActiveGroupsPerCU == 0) {
      size_t MaxWorkGroupSize{};
      urKernelGetGroupInfo(
          hKernel, Device, UR_KERNEL_GROUP_INFO_WORK_GROUP_SIZE,
          sizeof(MaxWorkGroupSize), &MaxWorkGroupSize, nullptr);
      size_t MaxLocalSizeBytes{};
      urDeviceGetInfo(Device, UR_DEVICE_INFO_LOCAL_MEM_SIZE,
                      sizeof(MaxLocalSizeBytes), &MaxLocalSizeBytes, nullptr);
      if (localWorkSize > MaxWorkGroupSize ||
          dynamicSharedMemorySize > MaxLocalSizeBytes)
        *pGroupCountRet = 0;
      else
        *pGroupCountRet = 1;
    } else {
      // Multiply by the number of SMs (CUs = compute units) on the device in
      // order to retreive the total number of groups/blocks that can be
      // launched.
      *pGroupCountRet = Device->getNumComputeUnits() * MaxNumActiveGroupsPerCU;
    }
  } catch (ur_result_t Err) {
    return Err;
  }
  return UR_RESULT_SUCCESS;
}

@GeorgeWeb
Copy link
Contributor Author

Could you please explain that it is not supported on the HIP backend ?

Hi @jinz2014.
It isn't unsupported by the HIP runtime itself as the required API exists, but we just haven't implemented it in the HIP adapter of Unified Runtime yet. This is mostly due to that the functionality was requested for Cuda targets at the time.

You are more than welcome to open a Github issue for the team in case they may have the bandwidth for this or straight up a PR request for HIP with the implementation you suggested in your other comment.

Thank you for being proactive on the project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hip HIP adapter specific issues opencl OpenCL adapter specific issues ready to merge Added to PR's which are ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants