This repository is my attempt at converting a pretrained LLM like Qwen3-0.6B into a Recurrent Transformer that loops in the center blocks, similar to Huginn.
Recurrent Transformers offer a number of architectural advantages, such as less memory intensive (tradeoff with inference-time compute), and the ability to learn iterative fixed-point algorithms. However, there are close to none (only one being Huginn) pretrained Recurrent Transformers at the scale of dense LLMs. So, I thought let's try converting a pretrained dense LLM into a Recurrent Transformer.
With Qwen3-0.6B, I converted the model by treating the 8-20th layer as a new recurrent block. I add an adapter, that takes the output from the the 20th layer, and merges it together with the 7th layer output to pass back into the 8th layer, repeating however many loops as desired. And I initialize the adapter to map the 7th layer output identically while zeroing out the 20th layer output, so that the model initializes with equivalent distribution as the original. Weights other than the 8-20th layer were frozen, on one hand to minimize unintended harm to the base model, on another hand to reduce training memory.
Unfortunately, I was not able to get any meaningful conversion. The converted model was able to function normally, but it wasn't able to learn anything special. The converted model behaves mostly the same as the base model when trained over the same data. For example with this loss curve,
While the converted recurrent model outperforms the baseline (with the same frozen weights), the difference is expected given the extra FLOPs needed to train it. And it underperforms the no-freeze weight baseline.
In short, there is no free lunch, and you get as much performance as you train it, just like any normal model.
Anyways, if you are reading this readme, hopefully this provides you with some inspiration :p