Skip to content

amd bf16 gpu float support

c4aa5c4
Select commit
Loading
Failed to load commit list.
Open

amd bf16 gpu float support #2859

amd bf16 gpu float support
c4aa5c4
Select commit
Loading
Failed to load commit list.
ROCm Repo Management API / Jenkins failed Nov 14, 2025 in 2h 18m 23s

Test required TF and ROCm versions/Test required TF and ROCm versions/Run tests: error in 'error' step

Test required TF and ROCm versions / Test required TF and ROCm versions / Test required TF and ROCm versions / Run tests / Shell Script

Error in sh step, with arguments docker exec ff6473cbce5178b0b6a4e667ab6cca26cb7614603eff6a14eeb94f449be1f77a tensorflow/tools/ci_build/linux/rocm/run_gpu_multi.sh.

script returned exit code 1
Build log
[2025-11-14T12:30:38.914Z] + docker exec ff6473cbce5178b0b6a4e667ab6cca26cb7614603eff6a14eeb94f449be1f77a tensorflow/tools/ci_build/linux/rocm/run_gpu_multi.sh
[2025-11-14T12:30:38.914Z] ++ grep -c '^processor' /proc/cpuinfo
[2025-11-14T12:30:38.914Z] + N_BUILD_JOBS=128
[2025-11-14T12:30:38.914Z] + N_TEST_JOBS=1
[2025-11-14T12:30:38.914Z] + echo ''
[2025-11-14T12:30:38.914Z] + echo 'Bazel will use 128 concurrent build job(s) and 1 concurrent test job(s).'
[2025-11-14T12:30:38.914Z] + echo ''
[2025-11-14T12:30:38.914Z] + [[ -n '' ]]
[2025-11-14T12:30:38.914Z] + [[ -z /opt/rocm-7.0.2 ]]
[2025-11-14T12:30:38.914Z] + ROCM_INSTALL_DIR=/opt/rocm-7.0.2
[2025-11-14T12:30:38.914Z] 
[2025-11-14T12:30:38.914Z] Bazel will use 128 concurrent build job(s) and 1 concurrent test job(s).
[2025-11-14T12:30:38.914Z] 
[2025-11-14T12:30:38.914Z] ++ which python3
[2025-11-14T12:30:38.914Z] + export PYTHON_BIN_PATH=/usr/bin/python3
[2025-11-14T12:30:38.914Z] + PYTHON_BIN_PATH=/usr/bin/python3
[2025-11-14T12:30:38.914Z] ++ python3 -c 'import sys;print(f'\''{sys.version_info.major}.{sys.version_info.minor}'\'')'
[2025-11-14T12:30:38.914Z] + PYTHON_VERSION=3.11
[2025-11-14T12:30:38.914Z] + export TF_PYTHON_VERSION=3.11
[2025-11-14T12:30:38.914Z] + TF_PYTHON_VERSION=3.11
[2025-11-14T12:30:38.914Z] + export TF_NEED_ROCM=1
[2025-11-14T12:30:38.914Z] + TF_NEED_ROCM=1
[2025-11-14T12:30:38.914Z] + export ROCM_PATH=/opt/rocm-7.0.2
[2025-11-14T12:30:38.914Z] + ROCM_PATH=/opt/rocm-7.0.2
[2025-11-14T12:30:38.914Z] + '[' -f /usertools/rocm.bazelrc ']'
[2025-11-14T12:30:38.914Z] + bazel --bazelrc=/usertools/rocm.bazelrc test --local_test_jobs=1 --jobs=128 --config=sigbuild_local_cache --config=rocm --config=nonpip_multi_gpu --action_env=TF_PYTHON_VERSION=3.11
[2025-11-14T12:30:38.914Z] 2025/11/14 12:30:38 Downloading https://releases.bazel.build/6.5.0/release/bazel-6.5.0-linux-x86_64...
[2025-11-14T12:31:57.683Z] Extracting Bazel installation...
[2025-11-14T12:31:57.683Z] Starting local Bazel server and connecting to it...
[2025-11-14T12:31:57.683Z] INFO: Invocation ID: 64078f4c-b9c8-4773-b08b-cb12217f2168
[2025-11-14T12:31:57.683Z] INFO: Reading 'startup' options from /tf/tensorflow/.bazelrc: --windows_enable_symlinks
[2025-11-14T12:31:57.683Z] INFO: Options provided by the client:
[2025-11-14T12:31:57.683Z]   Inherited 'common' options: --isatty=0 --terminal_columns=80
[2025-11-14T12:31:57.683Z] INFO: Reading rc options for 'test' from /tf/tensorflow/.bazelrc:
[2025-11-14T12:31:57.683Z]   Inherited 'common' options: --experimental_repo_remote_exec
[2025-11-14T12:31:57.683Z] INFO: Reading rc options for 'test' from /etc/bazel.bazelrc:
[2025-11-14T12:31:57.683Z]   Inherited 'build' options: --action_env=DOCKER_CACHEBUSTER=1760616272700139632 --host_action_env=DOCKER_HOST_CACHEBUSTER=1760616272835767562
[2025-11-14T12:31:57.683Z] INFO: Reading rc options for 'test' from /tf/tensorflow/.bazelrc:
[2025-11-14T12:31:57.683Z]   Inherited 'build' options: --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility
[2025-11-14T12:31:57.683Z] INFO: Reading rc options for 'test' from /usertools/gpu.bazelrc:
[2025-11-14T12:31:57.683Z]   Inherited 'build' options: --action_env=CACHEBUSTER=565341047
[2025-11-14T12:31:57.683Z] INFO: Reading rc options for 'test' from /usertools/gpu.bazelrc:
[2025-11-14T12:31:57.683Z]   'test' options: --test_output=errors --test_timeout=920,2400,7200,9600 --local_test_jobs=4 --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition build:short_logs in file /tf/tensorflow/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition build:v2 in file /tf/tensorflow/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition build:sigbuild_local_cache in file /usertools/gpu.bazelrc: --disk_cache=/tf/cache
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition build:rocm in file /tf/tensorflow/.bazelrc: --config=rocm_base --config=release_cpu_linux_base --action_env=CLANG_COMPILER_PATH=/usr/lib/llvm-18/bin/clang --action_env=TF_ROCM_CLANG=1 --linkopt=-fuse-ld=lld --linkopt=-Wl,--undefined-version --copt=-Wno-gnu-offsetof-extensions --copt=-Wno-unused-result
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition build:rocm_base in file /tf/tensorflow/.bazelrc: --crosstool_top=@local_config_rocm//crosstool:toolchain --define=using_rocm_hipcc=true --define=tensorflow_mkldnn_contraction_kernel=0 --repo_env TF_NEED_ROCM=1 --config=no_tfrt
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition build:no_tfrt in file /tf/tensorflow/.bazelrc: --deleted_packages=tensorflow/compiler/mlir/tfrt,tensorflow/compiler/mlir/tfrt/benchmarks,tensorflow/compiler/mlir/tfrt/ir,tensorflow/compiler/mlir/tfrt/ir/mlrt,tensorflow/compiler/mlir/tfrt/jit/python_binding,tensorflow/compiler/mlir/tfrt/jit/transforms,tensorflow/compiler/mlir/tfrt/python_tests,tensorflow/compiler/mlir/tfrt/tests,tensorflow/compiler/mlir/tfrt/tests/ifrt,tensorflow/compiler/mlir/tfrt/tests/mlrt,tensorflow/compiler/mlir/tfrt/tests/ir,tensorflow/compiler/mlir/tfrt/tests/analysis,tensorflow/compiler/mlir/tfrt/tests/jit,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_tfrt,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_jitrt,tensorflow/compiler/mlir/tfrt/tests/tf_to_corert,tensorflow/compiler/mlir/tfrt/tests/tf_to_tfrt_data,tensorflow/compiler/mlir/tfrt/tests/saved_model,tensorflow/compiler/mlir/tfrt/transforms/lhlo_gpu_to_tfrt_gpu,tensorflow/compiler/mlir/tfrt/transforms/mlrt,tensorflow/core/runtime_fallback,tensorflow/core/runtime_fallback/conversion,tensorflow/core/runtime_fallback/kernel,tensorflow/core/runtime_fallback/opdefs,tensorflow/core/runtime_fallback/runtime,tensorflow/core/runtime_fallback/util,tensorflow/core/runtime_fallback/test,tensorflow/core/runtime_fallback/test/gpu,tensorflow/core/runtime_fallback/test/saved_model,tensorflow/core/runtime_fallback/test/testdata,tensorflow/core/tfrt/stubs,tensorflow/core/tfrt/tfrt_session,tensorflow/core/tfrt/mlrt,tensorflow/core/tfrt/mlrt/attribute,tensorflow/core/tfrt/mlrt/kernel,tensorflow/core/tfrt/mlrt/bytecode,tensorflow/core/tfrt/mlrt/interpreter,tensorflow/compiler/mlir/tfrt/translate/mlrt,tensorflow/compiler/mlir/tfrt/translate/mlrt/testdata,tensorflow/core/tfrt/gpu,tensorflow/core/tfrt/run_handler_thread_pool,tensorflow/core/tfrt/runtime,tensorflow/core/tfrt/saved_model,tensorflow/core/tfrt/graph_executor,tensorflow/core/tfrt/saved_model/tests,tensorflow/core/tfrt/tpu,tensorflow/core/tfrt/utils,tensorflow/core/tfrt/utils/debug,tensorflow/core/tfrt/saved_model/python,tensorflow/core/tfrt/graph_executor/python,tensorflow/core/tfrt/saved_model/utils
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition build:release_cpu_linux_base in file /tf/tensorflow/.bazelrc: --repo_env=CC=/usr/lib/llvm-18/bin/clang --repo_env=BAZEL_COMPILER=/usr/lib/llvm-18/bin/clang --action_env=CLANG_COMPILER_PATH=/usr/lib/llvm-18/bin/clang --linkopt=-fuse-ld=lld
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition test:rocm in file /usertools/gpu.bazelrc: --test_env=HSA_TOOLS_LIB=libroctracer64.so --test_sharding_strategy=disabled --action_env=TF_ENABLE_ONEDNN_OPTS=0 --action_env=OPENBLAS_CORETYPE=Haswell
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition test:nonpip_multi_gpu in file /usertools/gpu.bazelrc: --config=nonpip_filters_multi_gpu -- //tensorflow/core/nccl:nccl_manager_test_2gpu //tensorflow/python/distribute/integration_test:mwms_peer_failure_test_2gpu //tensorflow/python/distribute:checkpoint_utils_test_2gpu //tensorflow/python/distribute:checkpointing_test_2gpu //tensorflow/python/distribute:collective_all_reduce_strategy_test_2gpu //tensorflow/python/distribute:collective_all_reduce_strategy_test_xla_2gpu //tensorflow/python/distribute:custom_training_loop_gradient_test_2gpu //tensorflow/python/distribute:custom_training_loop_input_test_2gpu //tensorflow/python/distribute:distribute_utils_test_2gpu //tensorflow/python/distribute:input_lib_test_2gpu //tensorflow/python/distribute:input_lib_type_spec_test_2gpu //tensorflow/python/distribute:metrics_v1_test_2gpu //tensorflow/python/distribute:mirrored_variable_test_2gpu //tensorflow/python/distribute:parameter_server_strategy_test_2gpu //tensorflow/python/distribute:values_test_2gpu //tensorflow/python/distribute:ps_values_test_2gpu //tensorflow/python/distribute:random_generator_test_2gpu //tensorflow/python/distribute:test_util_test_2gpu //tensorflow/python/distribute:tf_function_test_2gpu //tensorflow/python/distribute:vars_test_2gpu //tensorflow/python/distribute:warm_starting_util_test_2gpu //tensorflow/python/training:saver_test_2gpu //tensorflow/python/distribute/v1:cross_device_ops_test_2gpu //tensorflow/python/distribute:cross_device_ops_test_2gpu //tensorflow/python/distribute:mirrored_strategy_test_2gpu //tensorflow/python/kernel_tests:collective_ops_test_2gpu //tensorflow/python/ops:collective_ops_gpu_test_2gpu //tensorflow/python/ops:nccl_ops_test_2gpu //tensorflow/dtensor/python/tests:multi_client_test_2gpus //tensorflow/dtensor/python/tests:multi_client_test_nccl_2gpus //tensorflow/dtensor/python/tests:multi_client_test_nccl_local_2gpus //tensorflow/python/distribute/experimental:multi_worker_mirrored_strategy_test_2gpus //tensorflow/python/distribute:strategy_common_test_2gpu //tensorflow/python/distribute:strategy_common_test_xla_2gpu //tensorflow/python/distribute:strategy_gather_test_2gpu //tensorflow/python/distribute:strategy_gather_test_xla_2gpu
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition test:nonpip_filters_multi_gpu in file /usertools/gpu.bazelrc: --test_tag_filters=-no_gpu,-cuda-only --build_tag_filters=-no_gpu,-cuda-only --test_lang_filters=py --flaky_test_attempts=2 --test_size_filters=small,medium,large --test_env=TF_PER_DEVICE_MEMORY_LIMIT_MB=2048
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition build:linux in file /tf/tensorflow/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --copt=-Wno-error=unused-but-set-variable --linkopt=-Wl,--undefined-version --host_linkopt=-Wl,--undefined-version --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
[2025-11-14T12:31:57.683Z] INFO: Found applicable config definition build:dynamic_kernels in file /tf/tensorflow/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] DEBUG: /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/external/local_xla/third_party/py/python_repo.bzl:110:10: Using hermetic Python 3.11
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 
[2025-11-14T12:31:57.683Z] Loading: 0 packages loaded
[2025-11-14T12:31:57.683Z] ERROR: no such target '//tensorflow/python/distribute/v1:cross_device_ops_test_2gpu': target 'cross_device_ops_test_2gpu' not declared in package 'tensorflow/python/distribute/v1' defined by /tf/tensorflow/tensorflow/python/distribute/v1/BUILD (did you mean 'cross_device_ops_test_gpu'? Tip: use `query "//tensorflow/python/distribute/v1:*"` to see all the targets in that package)
[2025-11-14T12:31:57.683Z] INFO: Elapsed time: 34.809s
[2025-11-14T12:31:57.683Z] INFO: 0 processes.
[2025-11-14T12:31:57.683Z] FAILED: Build did NOT complete successfully (9 packages loaded)
[2025-11-14T12:31:57.683Z] ERROR: Couldn't start the build. Unable to run tests

Test required TF and ROCm versions / Test required TF and ROCm versions / Test required TF and ROCm versions / Run tests / Error signal

Error in error step, with arguments Error detected when building or testing TensorFlow.

Error detected when building or testing TensorFlow

Details

  • Test required TF and ROCm versions (2 hr 18 min)
    • Test required TF and ROCm versions (2 hr 18 min)
      • Test required TF and ROCm versions (2 hr 18 min)
        • Clean up workspace on node (3 sec)
        • Initialization (1.8 sec)
        • Cloning repositories (3 min 57 sec)
        • Run tests (6 min 27 sec)
          Error: script returned exit code 1 - logs
          Error: Error detected when building or testing TensorFlow - logs