Skip to content

Revisit the necessity the various control knobs in L0 adapterΒ #1528

@pbalcer

Description

@pbalcer

As mentioned in #1454, L0 adapter uses a lot of environment variables to control its behavior. This results in the codebase having a lot of conditional execution flows, making it hard to follow at times. It also makes it difficult for the user to understand which variables need to be set for best performance on their systems.
Ideally, the number of knobs controllable through env variables would be reduced to a minimum set where there are unavoidable trade-offs related to the platform and the environment. As a first step towards that, we need to revisit the existing L0 adapter variables to see which ones can be deprecated or removed entirely.

List of existing variables:

Environment Variable Comments
UR_L0_DEBUG Used to enable ZE validation layers and enable some debug output. Most of the functionality can be supplanted by the new logger.
UR_L0_LEAKS_DEBUG Used to enable leak tracking. Could be rolled into ZE_DEBUG and/or supplanted by the logger.
UR_L0_SERIALIZE Controls call serialization.
UR_L0_TRACK_INDIRECT_ACCESS_MEMORY This tracks all memory allocations live at the time of a kernel execution and defers memory deallocation until the kernel execution has finished. In its current form this has quite a bit of overhead, but could possibly be optimized to use some form of epoch-based reclamation, at which point we could deprecate the variable.
UR_L0_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING Is already marked as deprecated by SYCL.
UR_L0_USE_NATIVE_USM_MEMCPY2D This was an introduced as a workaround for a bug in the driver, see intel/llvm#9973. We should probably remove the variable and decide the behavior based on automatically detected L0 driver version.
UR_L0_MAX_NUMBER_OF_EVENTS_PER_EVENT_POOL Controls # of events created per event pool. This should probably remain with a sane default.
UR_L0_COMMANDLISTS_CLEANUP_THRESHOLD Threshold for cleaning up events in a command list. Same as above.
UR_L0_USE_COPY_ENGINE Controls which copy engines are used. This is sometimes useful for avoiding hardware limitations when SYCL/UR is running alongside something else.
UR_L0_USE_IMMEDIATE_COMMANDLISTS Overwrites the default type of command lists used for a queue. Should probably be removed at some point, but is currently useful for performance profiling. The workloads that want to use regular command lists should select them programmatically.
UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS This is needed to get best performance on some systems, but is a trade-off against correctness.
UR_L0_USE_DRIVER_INORDER_LISTS A recently introduced feature flag. Should be removed at some point once the feature is proven to be stable.
SYCL_USM_HOSTPTR_IMPORT This controls whether host buffers are imported into USM on creation. This has been shown to increase performance across some workloads. More evaluation is likely needed across wider range of benchmarks to see whether this can by enabled by default and the flag removed.
SYCL_PI_LEVEL_ZERO_USE_MULTIPLE_COMMANDLIST_BARRIERS This is enabled by default and is the correct behavior. Should probably be deprecated and then removed.
UR_L0_IN_ORDER_BARRIER_BY_SIGNAL This is an optimization that's enabled by default. Deprecate and remove.
UR_L0_DISABLE_EVENTS_CACHING Disables event caching. I can't find where this is needed or even possibly useful (except maybe for some debugging).
UR_L0_REUSE_DISCARDED_EVENTS This is enabled by default and probably could be deprecated.
SYCL_PI_LEVEL_ZERO_FILTER_EVENT_WAIT_LIST This is a disabled by default optimization that checks whether an event has completed prior to appending an operation on a command list that has the event as a dependency. Performance profiling needs to be done to determine whether we can enable or remove this option.
UR_L0_DEVICE_SCOPE_EVENTS Controls creation of proxy-host events to avoid host visible events. Disabled by default, and might not have any tangible benefits. Candidate for deprecation.
UR_L0_USE_COPY_ENGINE_FOR_FILL Controls whether the copy engine is used in memory fill operations. Defaults to off. Its fairly uninvasive, and likely useful for performance debugging.
SYCL_PI_LEVEL_ZERO_SINGLE_ROOT_DEVICE_BUFFER_MIGRATION Controls page migration between subdevices. Defaults to enabled. This is probably tricky for the end-users to decide on. This should probably be a programmatic API.
UR_L0_USE_COPY_ENGINE_FOR_D2D_COPY Control whether the copy engine is used for device-to-device transfers. Defaults to off. This is again fairly uninvasive. Probably requires a lot of benchmarking to decide what's the best option for each platform.
UR_L0_EAGER_INIT This controls whether command lists are created upfront. Defaults to off. I think we should just enable it by default and remove the option. I'm not sure whether we save anything here but some negligible amount of memory by doing lazy initialization.
UR_L0_QUEUE_FINISH_HOLD_LOCK This is a recently introduced feature flag to avoid deadlocks between queue and its events. To be removed once the feature is proven to be stable.
UR_L0_COPY_BATCH_SIZE, UR_L0_BATCH_SIZE These variables control how batching is performed with normal command lists. Defaults to dynamic adjustment. This should probably be deprecated, but, on the other hand, might be useful for performance profiling. The code is fairly uninvasive.
UR_L0_USE_COMPUTE_ENGINE Same as UR_L0_USE_COPY_ENGINE, this is often useful in debugging for hardcoding which copy engines are used.
UR_L0_USE_COPY_ENGINE_FOR_IN_ORDER_QUEUE Controls whether copy engine is used for inorder queues. Defaults to enabled. Should probably stay, like other similar options.
UR_L0_IMMEDIATE_COMMANDLISTS_EVENT_CLEANUP_THRESHOLD This is the number of events collected for the immediate command list prior to cleanup. Defaults to 1024. This cleanup can be painful, especially for out-of-order command lists. This should probably be removed, as users have no visibility on how to set this properly. Upcoming optimizations should negate any benefit from this variable.
UR_L0_USM_ALLOCATOR see below.
UR_L0_DISABLE_USM_ALLOCATOR see below.
UR_L0_USM_ALLOCATOR_TRACE Allocator related options are going to be eventually moved to UMF, so these variables will be removed or mapped to UMF ones once that happens.
UR_L0_USM_RESIDENT Controls whether allocations are forced to be resident for device/host/shared memory. Defaults to making device memory resident. This is something that ideally software should control though a programmatic API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    level-zeroL0 adapter specific issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions