Skip to content

Conversation

@a2d8a4v
Copy link
Contributor

@a2d8a4v a2d8a4v commented Jan 2, 2026

Well, in CTranslate2, it already has wav2vec2.0 codbase, which can run wav2vec2.0, MMS, parts of omnilingual-asr models (-SSL and -CTC branches), HuBERT (which only differs in training strategy but are the same in backbone model, in the best of my knowledge). However, wavlm has gated relative mechanism to compute the gated position bias in the first attention layer with the pre-layernormed hidden states. After getting the position bias, it will added together with kv matrix just before the softmax operation (computing the attention matrix), and the position bias will be pass to later attention layers without computing it again.

The major changes comparing to wav2vec2.0 C++ codebase can be seen at two files:
src/layers/attention.cc, in which I need to modify the logic inside dot_product_attention function, and
src/layers/wavlm.cc, where I need to pass one additional object called position_bias.

I've tested the code and get the last hidden state, computed the cosine similarity with the one of the huggingface wavlm. The result is 1.0. So I think the logic of my codebae is correct.

References:

  1. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, NeurIPS 2020.
  2. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, TASLP 2021.
  3. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing, JSTSP 2022.
  4. Scaling Speech Technology to 1,000+ Languages, JMLR 2024.
  5. Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages, arxiv 2025 submitting.

@jordimas
Copy link
Collaborator

jordimas commented Jan 3, 2026

Great work. I am looking forward to test it.

Four quick comments:

  1. Will be possible to add it here https://github.com/OpenNMT/CTranslate2/blob/master/docs/guides/transformers.md a small example how to use it?

  2. If you can look at this test fail:

  =========================== short test summary info ============================
FAILED python/tests/test_transformers.py::TestWavLM::test_transformers_wavlm[microsoft/wavlm-large-expected_transcription0-cpu] - TypeError: TransformerEncoderLayerSpec.__init__() got an unexpected keyword argument 'gated_relative_attention_bias'
  1. Run python -m black python/ to fix check-python-style job that is not passing.

  2. Consider adding a test here: https://github.com/OpenNMT/CTranslate2/blob/master/python/tests/test_transformers.py Look at what is done for Whisper.

Thanks!

namespace models {

struct WavLMOptions {
// Maximum generation length.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planing to use the WavLMOptions structure?
It is not referenced at the moment.

Copy link
Contributor Author

@a2d8a4v a2d8a4v Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, in fact it is not used at this moment.
I tried microsoft/wavlm-large for the test case, which output the last hidden state alone. It may be useful when someone using wavlm plusing a linear layer (language model) training with CTC loss, which outputs token at inferencing stage.

@a2d8a4v
Copy link
Contributor Author

a2d8a4v commented Jan 4, 2026

Hi, @jordimas
My reply to the 4 comments

  1. Of course. Thank you for reminding me of this part, I missed finding this document haha.
    I found the document also lacks the wav2vec 2.0 section. I will add this part together.
  2. OK, it seems that I forgot to add the argument.
  3. Sure, let me check it again
  4. Humm, I think I've added the class TestWavLM inside test_transformers.py.
    Can you take a look at: https://github.com/OpenNMT/CTranslate2/pull/1966/files#diff-87c343e816a510ee31b7408b49bf7da834849e5e62cf90b897ffc2485ccf91a1

Thanks a lot!

@a2d8a4v
Copy link
Contributor Author

a2d8a4v commented Jan 4, 2026

btw, @jordimas
I would like to inquire your advice.
About wav2vec2.0, although I've said that wav2vec2.0, hubert, mms, omnilingual-asr can use Wav2vec2.0 codebase due to the same backbone model architecture. Despite this, the current codebase can not read them directly due to the settings inside the converters. For example, HubertConfig is not set inside the converter.

It'd need some additional changes to fit those models. I'm wondering whether I should create model templates for each of them, or just change the configs and converters.

Thank you for your attention

@jordimas
Copy link
Collaborator

jordimas commented Jan 4, 2026

btw, @jordimas I would like to inquire your advice. About wav2vec2.0, although I've said that wav2vec2.0, hubert, mms, omnilingual-asr can use Wav2vec2.0 codebase due to the same backbone model architecture. Despite this, the current codebase can not read them directly due to the settings inside the converters. For example, HubertConfig is not set inside the converter.

It'd need some additional changes to fit those models. I'm wondering whether I should create model templates for each of them, or just change the configs and converters.

Thank you for your attention

Wiill be possible to add one of these on the PR to see exactly how the problem looks like?

@a2d8a4v
Copy link
Contributor Author

a2d8a4v commented Jan 7, 2026

Sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants