Skip to content

Conversation

@xenova
Copy link
Collaborator

@xenova xenova commented Jul 4, 2025

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova xenova merged commit 4b7a3aa into main Jul 4, 2025
4 checks passed
@xenova xenova deleted the add-ernie4_5 branch July 4, 2025 17:07
kunal-vaishnavi added a commit to microsoft/onnxruntime-genai that referenced this pull request Jul 7, 2025
Enables exporting the new Ernie 4.5 models via onnxruntime-genai:
https://huggingface.co/baidu/ERNIE-4.5-0.3B-PT

I've uploaded the converted model to
https://huggingface.co/onnx-community/ERNIE-4.5-0.3B-ONNX.

Currently only supports the non-MoE version... but maybe someone can
help with the MoE version:
https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-PT

---

Models tested and validated with python ort &
[transformers.js](huggingface/transformers.js#1354):

```py
from transformers import AutoConfig, AutoTokenizer
import onnxruntime
import numpy as np

# 1. Load config, processor, and model
path_to_model = "./path/to/model"
config = AutoConfig.from_pretrained("baidu/ERNIE-4.5-0.3B-PT", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("baidu/ERNIE-4.5-0.3B-PT", trust_remote_code=True)
decoder_session = onnxruntime.InferenceSession(f"{path_to_model}/model.onnx")

## Set config values
num_key_value_heads = config.num_key_value_heads
head_dim = config.head_dim
num_hidden_layers = config.num_hidden_layers
eos_token_id = config.eos_token_id

# 2. Prepare inputs
## Create input messages
messages = [
  { "role": "system", "content": "You are a helpful assistant." },
  { "role": "user", "content": "Write me a poem about Machine Learning." },
]

## Apply tokenizer
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="np")

## Prepare decoder inputs
batch_size = inputs['input_ids'].shape[0]
past_key_values = {
    f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
    for layer in range(num_hidden_layers)
    for kv in ('key', 'value')
}
input_ids = inputs['input_ids']
position_ids = np.tile(np.arange(1, input_ids.shape[-1] + 1), (batch_size, 1))
attention_mask = np.ones_like(input_ids, dtype=np.int64)

# 3. Generation loop
max_new_tokens = 1024
generated_tokens = np.array([[]], dtype=np.int64)
for i in range(max_new_tokens):
  logits, *present_key_values = decoder_session.run(None, dict(
      input_ids=input_ids,
      attention_mask=attention_mask,
      position_ids=position_ids,
      **past_key_values,
  ))

  ## Update values for next generation loop
  input_ids = logits[:, -1].argmax(-1, keepdims=True)
  attention_mask = np.concatenate([attention_mask, np.ones_like(input_ids, dtype=np.int64)], axis=-1)
  position_ids = position_ids[:, -1:] + 1
  for j, key in enumerate(past_key_values):
    past_key_values[key] = present_key_values[j]

  generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
  if (input_ids == eos_token_id).all():
    break

  ## (Optional) Streaming
  print(tokenizer.decode(input_ids[0]), end='', flush=True)
print()

# 4. Output result
print(tokenizer.batch_decode(generated_tokens))
```

---------

Co-authored-by: kunal-vaishnavi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants