Skip to content

Conversation

@gziz
Copy link
Contributor

@gziz gziz commented Jan 3, 2026

Summary

Fixes #939
This PR updates the YaRN RoPE (Rotary Position Embedding) implementation in the OLMO 3 standalone notebooks to align with the HuggingFace reference implementation, resolving an issue where text generation would generate gibberish after a few sentences.

Background

The current implementation scales position indices for YaRN like:

# Current approach
if rope_type == "yarn":
    positions = positions / rope_factor
    positions = torch.clamp(positions, max=rope_orig_max - 1)

After reviewing the YaRN paper and the HuggingFace implementation, it appears YaRN should instead scale inverse frequencies, which allows for better handling of extended context lengths.

Proposed Changes

Updated the YaRN algorithm to match the HuggingFace transformers implementation

  1. Frequency-dependent scaling: Different frequencies components are scaled differently
  2. Added beta_fast and beta_slow parameters (32.0 and 1.0 respectively, matching OLMO 3's config)
  3. Linear ramp blending: Smoothly interpolates between:
    • High frequencies → unchanged (extrapolation)
    • Low frequencies → scaled by rope_factor (interpolation)
# Updated approach
inv_freq_extrapolation = 1.0 / pos_freqs
inv_freq_interpolation = 1.0 / (rope_factor * pos_freqs)

low, high = find_correction_range(beta_fast, beta_slow, dim, theta_base, rope_orig_max)
inv_freq_extrapolation_factor = 1 - linear_ramp_factor(low, high, dim // 2)

inv_freq = (
    inv_freq_interpolation * (1 - inv_freq_extrapolation_factor)
    + inv_freq_extrapolation * inv_freq_extrapolation_factor
)

Key modifications in both notebooks

  • compute_rope_params function: Updated YaRN logic to use frequency-based scaling
  • OLMO3_CONFIG_7B: Added beta_fast: 32.0 and beta_slow: 1.0 parameters
  • OLMO3_CONFIG_32B: Added beta_fast: 32.0 and beta_slow: 1.0 parameters
  • Olmo3Model.__init__: Updated to pass beta_fast and beta_slow to compute_rope_params

Example Outputs

Prompt: "Tell me about large language models"

Response (current approach):

Large language models 
(LLMs are artificial intelligence models (LLMs) are a type of artificial neural networks trained to learn to predict the next word or next token (word given a sequence of text. They are trained on a text. The are trained on huge dataset of text. The dataset of all the internet texts. They can generate text. They can generate text. They can answer question, code, write code, story, poem, song, code, code, code, code, code, code, code, code, code, essay, code, code, code
...

Response (with proposed fix):

Certainly! Large Language Models (LLMs) are a type of artificial intelligence designed to understand, generate, and respond to human language. Here's an overview:

---

### **What are Large Language Models?**

LLMs are deep learning models trained on vast amounts of text data from books, websites, articles, and other sources. They use neural network architectures—most commonly transformer models (like the one in GPT or BERT)—that allow them to recognize patterns and relationships in language.

---

### **Key Features**

1. **Scale:**  
   - "Large" refers to both the size of the model (in billions or trillions of parameters) and the massive datasets used for training.

... (truncated — response continues coherently)

Testing

  • Verified coherent text generation with Olmo-3-7B-Instruct
  • Output quality now matches the official HuggingFace implementation
  • All 3 tests in the olmo3/tests directory pass as expected
Screenshot 2026-01-03 at 1 26 26 p m

References


Thank you for considering this contribution! Please let me know if you have any questions or would like me to make any adjustments.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@rasbt
Copy link
Owner

rasbt commented Jan 3, 2026

Thanks a lot for the PR. I added the original yarn configs from HF to my layer debugger tool and you were right, this is necessary to match the HF outputs. It looks all correct to me. Thanks for this great PR!

@rasbt rasbt merged commit 491fd58 into rasbt:main Jan 4, 2026
13 checks passed
@gziz
Copy link
Contributor Author

gziz commented Jan 4, 2026

Glad I could help :)
@rasbt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Olmo 3 YaRN RoPE implementation bug

2 participants