TransformerDecoder: optional positional encoding and final matmul #93

Gerstenberger · 2026-01-29T10:23:09Z

Changes for positional encoding and the final matrix multiplication of model output and output embedding matrix to be both optional.

This allows us to use the implementation for self-normalized LM Transformer training, where positional encoding is not required and the final matmul is replaced by another matmul in the sampling loss.

My only question is: should this be a TransformerDecoderV2 instead?

NeoLegends · 2026-01-29T10:37:48Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

    num_output: int
    logits_bias: bool
    share_embedding: bool
+    use_positional_encoding: bool = True


I wonder if, instead of being a flag, this should be a configurable module instead, which you simply replace with a noop if you don't want any positional encoding. This would allow using other positional encoding schemes other than sinusoidal as well.

Yes, agree, better would be to have this more dynamic.

ConformerMHSARelPosV1._sinusoidal_pe should maybe be moved to a separate function, and then you would have positional_encoding=absolute_sinusoidal_positional_encoding as default, and None is also allowed.

NeoLegends · 2026-01-29T10:38:22Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

    logits_bias: bool
    share_embedding: bool
+    use_positional_encoding: bool = True
+    do_output_embedding_matmul: bool = True


Perhaps

Suggested change

do_output_embedding_matmul: bool = True

embed_outputs_to_vocab_dim: bool = True

is clearer naming-wise?

I don't think it's cleaner. But I also don't like the original name. But I'm also not sure whether I like the logic at all (see my separate comment on this, why to have the out_logits at all if it is not used).

albertz · 2026-01-29T11:17:34Z

As a first comment (I will try to comment in more detail later): The same questions have been thought about in the RF implementation, for Transformer encoder, decoder, and very related also Conformer encoder (to make the frontend optional, etc).

Current RF TransformerDecoder implementation. It already has the pos_enc configurable, and you can pass None to it to disable this. It doesn't have the logits optional though yet. But I have variants of this (in own code, not in RETURNN) where I have removed this part, or made it optional.

RF ConformerEncoderV2.

albertz · 2026-01-29T14:01:18Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

@@ -190,13 +194,20 @@ def __init__(self, cfg: TransformerDecoderV1Config):
        else:
            self.out_logits = nn.Linear(self.model_dim, cfg.num_output, bias=cfg.logits_bias)


I just realize, this sharing is weird. I would always set self.out_logits. If sharing, you can just do self.out_logits.weights = self.input_embedding.weight. That would simplify the other code.

Also, self.out_logits should always be set (be None if not used). But with my suggestion, you don't need to care about this.

And then you would also allow to have logits_bias=True with share_embedding=True.

albertz · 2026-01-29T14:05:43Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

    logits_bias: bool
    share_embedding: bool
+    use_positional_encoding: bool = True
+    do_output_embedding_matmul: bool = True


If this is False, and not cfg.share_embedding, the out_logits are not used at all. Does it make sense to even have them then?

TransformerDecoder: optional positional encoding and final matmul

0fa1a41

Gerstenberger requested a review from NeoLegends January 29, 2026 10:23

NeoLegends reviewed Jan 29, 2026

View reviewed changes

albertz reviewed Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransformerDecoder: optional positional encoding and final matmul #93

TransformerDecoder: optional positional encoding and final matmul #93

Uh oh!

Gerstenberger commented Jan 29, 2026 •

edited

Loading

Uh oh!

NeoLegends Jan 29, 2026

Uh oh!

albertz Jan 29, 2026

Uh oh!

NeoLegends Jan 29, 2026

Uh oh!

albertz Jan 29, 2026

Uh oh!

albertz commented Jan 29, 2026

Uh oh!

albertz Jan 29, 2026

Uh oh!

albertz Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	do_output_embedding_matmul: bool = True
	embed_outputs_to_vocab_dim: bool = True

		@@ -190,13 +194,20 @@ def __init__(self, cfg: TransformerDecoderV1Config):
		else:
		self.out_logits = nn.Linear(self.model_dim, cfg.num_output, bias=cfg.logits_bias)

TransformerDecoder: optional positional encoding and final matmul #93

Are you sure you want to change the base?

TransformerDecoder: optional positional encoding and final matmul #93

Uh oh!

Conversation

Gerstenberger commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NeoLegends Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

albertz Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

NeoLegends Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

albertz Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

albertz commented Jan 29, 2026

Uh oh!

albertz Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

albertz Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Gerstenberger commented Jan 29, 2026 •

edited

Loading