-
Notifications
You must be signed in to change notification settings - Fork 124
Open
Labels
level-zeroL0 adapter specific issuesL0 adapter specific issues
Description
As mentioned in #1454, L0 adapter uses a lot of environment variables to control its behavior. This results in the codebase having a lot of conditional execution flows, making it hard to follow at times. It also makes it difficult for the user to understand which variables need to be set for best performance on their systems.
Ideally, the number of knobs controllable through env variables would be reduced to a minimum set where there are unavoidable trade-offs related to the platform and the environment. As a first step towards that, we need to revisit the existing L0 adapter variables to see which ones can be deprecated or removed entirely.
List of existing variables:
| Environment Variable | Comments |
|---|---|
| UR_L0_DEBUG | Used to enable ZE validation layers and enable some debug output. Most of the functionality can be supplanted by the new logger. |
| UR_L0_LEAKS_DEBUG | Used to enable leak tracking. Could be rolled into ZE_DEBUG and/or supplanted by the logger. |
| UR_L0_SERIALIZE | Controls call serialization. |
| UR_L0_TRACK_INDIRECT_ACCESS_MEMORY | This tracks all memory allocations live at the time of a kernel execution and defers memory deallocation until the kernel execution has finished. In its current form this has quite a bit of overhead, but could possibly be optimized to use some form of epoch-based reclamation, at which point we could deprecate the variable. |
| UR_L0_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING | Is already marked as deprecated by SYCL. |
| UR_L0_USE_NATIVE_USM_MEMCPY2D | This was an introduced as a workaround for a bug in the driver, see intel/llvm#9973. We should probably remove the variable and decide the behavior based on automatically detected L0 driver version. |
| UR_L0_MAX_NUMBER_OF_EVENTS_PER_EVENT_POOL | Controls # of events created per event pool. This should probably remain with a sane default. |
| UR_L0_COMMANDLISTS_CLEANUP_THRESHOLD | Threshold for cleaning up events in a command list. Same as above. |
| UR_L0_USE_COPY_ENGINE | Controls which copy engines are used. This is sometimes useful for avoiding hardware limitations when SYCL/UR is running alongside something else. |
| UR_L0_USE_IMMEDIATE_COMMANDLISTS | Overwrites the default type of command lists used for a queue. Should probably be removed at some point, but is currently useful for performance profiling. The workloads that want to use regular command lists should select them programmatically. |
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | This is needed to get best performance on some systems, but is a trade-off against correctness. |
| UR_L0_USE_DRIVER_INORDER_LISTS | A recently introduced feature flag. Should be removed at some point once the feature is proven to be stable. |
| SYCL_USM_HOSTPTR_IMPORT | This controls whether host buffers are imported into USM on creation. This has been shown to increase performance across some workloads. More evaluation is likely needed across wider range of benchmarks to see whether this can by enabled by default and the flag removed. |
| SYCL_PI_LEVEL_ZERO_USE_MULTIPLE_COMMANDLIST_BARRIERS | This is enabled by default and is the correct behavior. Should probably be deprecated and then removed. |
| UR_L0_IN_ORDER_BARRIER_BY_SIGNAL | This is an optimization that's enabled by default. Deprecate and remove. |
| UR_L0_DISABLE_EVENTS_CACHING | Disables event caching. I can't find where this is needed or even possibly useful (except maybe for some debugging). |
| UR_L0_REUSE_DISCARDED_EVENTS | This is enabled by default and probably could be deprecated. |
| SYCL_PI_LEVEL_ZERO_FILTER_EVENT_WAIT_LIST | This is a disabled by default optimization that checks whether an event has completed prior to appending an operation on a command list that has the event as a dependency. Performance profiling needs to be done to determine whether we can enable or remove this option. |
| UR_L0_DEVICE_SCOPE_EVENTS | Controls creation of proxy-host events to avoid host visible events. Disabled by default, and might not have any tangible benefits. Candidate for deprecation. |
| UR_L0_USE_COPY_ENGINE_FOR_FILL | Controls whether the copy engine is used in memory fill operations. Defaults to off. Its fairly uninvasive, and likely useful for performance debugging. |
| SYCL_PI_LEVEL_ZERO_SINGLE_ROOT_DEVICE_BUFFER_MIGRATION | Controls page migration between subdevices. Defaults to enabled. This is probably tricky for the end-users to decide on. This should probably be a programmatic API. |
| UR_L0_USE_COPY_ENGINE_FOR_D2D_COPY | Control whether the copy engine is used for device-to-device transfers. Defaults to off. This is again fairly uninvasive. Probably requires a lot of benchmarking to decide what's the best option for each platform. |
| UR_L0_EAGER_INIT | This controls whether command lists are created upfront. Defaults to off. I think we should just enable it by default and remove the option. I'm not sure whether we save anything here but some negligible amount of memory by doing lazy initialization. |
| UR_L0_QUEUE_FINISH_HOLD_LOCK | This is a recently introduced feature flag to avoid deadlocks between queue and its events. To be removed once the feature is proven to be stable. |
| UR_L0_COPY_BATCH_SIZE, UR_L0_BATCH_SIZE | These variables control how batching is performed with normal command lists. Defaults to dynamic adjustment. This should probably be deprecated, but, on the other hand, might be useful for performance profiling. The code is fairly uninvasive. |
| UR_L0_USE_COMPUTE_ENGINE | Same as UR_L0_USE_COPY_ENGINE, this is often useful in debugging for hardcoding which copy engines are used. |
| UR_L0_USE_COPY_ENGINE_FOR_IN_ORDER_QUEUE | Controls whether copy engine is used for inorder queues. Defaults to enabled. Should probably stay, like other similar options. |
| UR_L0_IMMEDIATE_COMMANDLISTS_EVENT_CLEANUP_THRESHOLD | This is the number of events collected for the immediate command list prior to cleanup. Defaults to 1024. This cleanup can be painful, especially for out-of-order command lists. This should probably be removed, as users have no visibility on how to set this properly. Upcoming optimizations should negate any benefit from this variable. |
| UR_L0_USM_ALLOCATOR | see below. |
| UR_L0_DISABLE_USM_ALLOCATOR | see below. |
| UR_L0_USM_ALLOCATOR_TRACE | Allocator related options are going to be eventually moved to UMF, so these variables will be removed or mapped to UMF ones once that happens. |
| UR_L0_USM_RESIDENT | Controls whether allocations are forced to be resident for device/host/shared memory. Defaults to making device memory resident. This is something that ideally software should control though a programmatic API. |
MichalMrozek
Metadata
Metadata
Assignees
Labels
level-zeroL0 adapter specific issuesL0 adapter specific issues