generated from amazon-archives/__template_DevGuide
-
Notifications
You must be signed in to change notification settings - Fork 182
Open
Labels
Description
When I train a Resnet50 model on the Trainium and run it with the following command:
--target=trn1 --framework=XLA --optlevel=1
I am getting this compiler error:
2025-03-26T03:43:31Z ERROR 42759 [neuronxcc.driver.CommandDriver]: ***************************************************************
2025-03-26T03:43:31Z ERROR 42759 [neuronxcc.driver.CommandDriver]: An Internal Compiler Error has occurred
2025-03-26T03:43:31Z ERROR 42759 [neuronxcc.driver.CommandDriver]: ***************************************************************
2025-03-26T03:43:31Z ERROR 42759 [neuronxcc.driver.CommandDriver]:
2025-03-26T03:43:31Z USER 42759 [neuronxcc.driver.CommandDriver]: Warning: Non-output memory location with no reader: {bias_memset.719}@SB<0,0>(128x2)#Internal DebugInfo: <bias_memset.719||UNDEF||[128, 1, 1]>
[NLA001] Unhandled exception with message: === BIR error ===
Reason: Access pattern out of bound.
Instruction: I-6012-337-accel_sg0000
Opcode: Memset
Instruction Source: (bfloat16<27 x 460> $6012[i0_250_0_0_0, i0_250_0_0_1, i0_250_0_1, i2_250_0, i0_250_1_0, i2_163_2067_0_i2_163_2067_1_1_0_0_0, i2_163_2067_0_i2_163_2067_1_1_0_0_1, c0_2878_1_1_4745, c1_2879_4745_0_1_1, c1_2879_4745_1]:6012)0:
Argument AP:
Access Pattern: [[231,27],[1,232],[1,1]]
Offset: 0
Memory Location: {_convolution.13.4743_i90_sg0000}@SB<0,55840>(27x924)#Internal DebugInfo: <_convolution.13.4743||UNDEF||[27, 460, 1]>
- Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
I have attached two files to help with understanding the issue (the traffic.txt file is actually a Python file, but I could not upload it as it was)