Is it possible to run Llama-3.1-8B-Instruct on a trn1.2xlarge instance?

I ran the following steps on a trn1.2xlarge instance:

1. Activate the Neuron environment:
`source /opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/bin/activate
`
2. Download the model:
```
huggingface-cli download --token hf_xxxxxxxxxxxxxxxxx meta-llama/Llama-3.1-8B-Instruct \
  --local-dir ./model_hf/Llama-3.1-8B-Instruct/

```

3. Run the inference command:
```
inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
  --model-path /home/ubuntu/model_hf/Llama-3.1-8B-Instruct/ \
  --compiled-model-path /home/ubuntu/traced_model/Llama-3.1-8B-Instruct/ \
  --torch-dtype bfloat16 \
  --tp-degree 1 \
  --batch-size 2 \
  --max-context-length 32 \
  --seq-len 64 \
  --on-device-sampling \
  --enable-bucketing \
  --top-k 1 \
  --pad-token-id 2 \
  --prompt "I believe the meaning of life is" \
  --prompt "The color of the sky is" \
  --check-accuracy-mode token-matching \
  --benchmark
```

The instance has 32GB of memory, which should be sufficient to hold the Llama-3.1-8B-Instruct model weights (about 16GB). However, the compilation process was unexpectedly interrupted.

```
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:14: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from .mappings import (
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:14: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from .mappings import (
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:14: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from .mappings import (
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/blockwise.py:42: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  component, error = import_nki(config)
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/blockwise.py:42: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  component, error = import_nki(config)
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/blockwise.py:42: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  component, error = import_nki(config)
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/blockwise.py:42: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  component, error = import_nki(config)
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/blockwise.py:42: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  component, error = import_nki(config)
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/blockwise.py:42: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  component, error = import_nki(config)
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/attention/utils.py:14: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed_inference.modules.custom_calls import neuron_cumsum
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py:26: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed_inference.models.dbrx.modeling_dbrx import NeuronDbrxForCausalLM
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py:28: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed_inference.models.mixtral.modeling_mixtral import NeuronMixtralForCausalLM
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/mllama/modeling_mllama.py:68: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from .modeling_mllama_vision import NeuronMllamaVisionModel  # noqa: E402
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/utils/accuracy.py:32: UserWarning: Intel extension for pytorch not found. For faster CPU references install `intel-extension-for-pytorch`.
  warnings.warn(
Loading configs...
WARNING:root:NeuronConfig init: Unexpected keyword arguments: {'model_type': 'llama', 'task_type': 'causal-lm', 'model_path': '/home/ubuntu/model_hf/Llama-3.1-8B-Instruct/', 'compiled_model_path': '/home/ubuntu/traced_model/Llama-3.1-8B-Instruct/', 'benchmark': True, 'check_accuracy_mode': <CheckAccuracyMode.TOKEN_MATCHING: 'token-matching'>, 'divergence_difference_tol': 0.001, 'prompts': ['I believe the meaning of life is', 'The color of the sky is'], 'top_k': 1, 'top_p': 1.0, 'temperature': 1.0, 'do_sample': False, 'dynamic': False, 'pad_token_id': 2, 'top_k_kernel_enabled': False, 'on_device_sampling': True, 'early_expert_affinity_modulation': False, 'disable_normalize_top_k_affinities': False, 'fused_shared_experts': False, 'enable_torch_dist': False, 'is_chunked_prefill': False, 'enable_lora': False, 'max_loras': 1, 'max_lora_rank': 16, 'benchmark_report_path': './benchmark_report.json', 'skip_warmup': False, 'skip_compile': False, 'compile_only': False, 'compile_dry_run': False, 'hlo_debug': False, 'apply_seq_ids_mask': False, 'enable_output_completion_notifications': False}

Compiling and saving model...
INFO:Neuron:Saving the neuron_config to /home/ubuntu/traced_model/Llama-3.1-8B-Instruct/
INFO:Neuron:Generating HLOs for the following models: ['context_encoding_model', 'token_generation_model']
[2025-08-20 22:44:15.794: I neuronx_distributed/parallel_layers/parallel_state.py:628] > initializing tensor model parallel with size 1
[2025-08-20 22:44:15.794: I neuronx_distributed/parallel_layers/parallel_state.py:629] > initializing pipeline model parallel with size 1
[2025-08-20 22:44:15.794: I neuronx_distributed/parallel_layers/parallel_state.py:630] > initializing context model parallel with size 1
[2025-08-20 22:44:15.794: I neuronx_distributed/parallel_layers/parallel_state.py:631] > initializing data parallel with size 1
[2025-08-20 22:44:15.794: I neuronx_distributed/parallel_layers/parallel_state.py:632] > initializing world size to 1
[2025-08-20 22:44:15.795: I neuronx_distributed/parallel_layers/parallel_state.py:379] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x7aa344b52830>, 'Ascending Ring PG Group')>
[2025-08-20 22:44:15.795: I neuronx_distributed/parallel_layers/parallel_state.py:668] [rank_0_pp-1_tp-1_dp-1_cp-1] tp_groups: replica_groups.tp_groups=[[0]]
[2025-08-20 22:44:15.795: I neuronx_distributed/parallel_layers/parallel_state.py:669] [rank_0_pp-1_tp-1_dp-1_cp-1] dp_groups: replica_groups.dp_groups=[[0]]
[2025-08-20 22:44:15.795: I neuronx_distributed/parallel_layers/parallel_state.py:670] [rank_0_pp-1_tp-1_dp-1_cp-1] pp_groups: replica_groups.pp_groups=[[0]]
[2025-08-20 22:44:15.795: I neuronx_distributed/parallel_layers/parallel_state.py:671] [rank_0_pp-1_tp-1_dp-1_cp-1] cp_groups: replica_groups.cp_groups=[[0]]
[2025-08-20 22:44:15.795: I neuronx_distributed/parallel_layers/parallel_state.py:672] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_model_groups: replica_groups.ep_model_groups=[[0]]
[2025-08-20 22:44:15.795: I neuronx_distributed/parallel_layers/parallel_state.py:673] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_data_groups: replica_groups.ep_data_groups=[[0]]
INFO:Neuron:Generating 1 hlos for key: context_encoding_model
INFO:Neuron:Started loading module context_encoding_model
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
WARNING:Neuron:TP degree (1) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
INFO:Neuron:Finished loading module context_encoding_model in 0.05517268180847168 seconds
INFO:Neuron:generating HLO: context_encoding_model, input example shape = torch.Size([2, 32])
Killed

```

Is it possible to run Llama-3.1-8B-Instruct on a trn1.2xlarge instance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to run Llama-3.1-8B-Instruct on a trn1.2xlarge instance? #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is it possible to run Llama-3.1-8B-Instruct on a trn1.2xlarge instance? #53

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions