Skip to content

Commit 824e2f1

Browse files
committed
Add support for Mimi
1 parent 3502ddb commit 824e2f1

File tree

6 files changed

+173
-3
lines changed

6 files changed

+173
-3
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -359,6 +359,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
359359
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
360360
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
361361
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
362+
1. **[Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)** (from Kyutai) released with the paper [Moshi: a speech-text foundation model for real-time dialogue](https://kyutai.org/Moshi.pdf) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.
362363
1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
363364
1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
364365
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@
7474
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
7575
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
7676
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
77+
1. **[Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)** (from Kyutai) released with the paper [Moshi: a speech-text foundation model for real-time dialogue](https://arxiv.org/abs/2410.00037) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.
7778
1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
7879
1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
7980
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.

src/models.js

Lines changed: 93 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,7 @@ const MODEL_TYPES = {
134134
MultiModality: 8,
135135
Phi3V: 9,
136136
AudioTextToText: 10,
137+
AutoEncoder: 11,
137138
}
138139
//////////////////////////////////////////////////
139140

@@ -554,6 +555,12 @@ async function encoderForward(self, model_inputs) {
554555
return await sessionRun(session, encoderFeeds);
555556
}
556557

558+
async function autoEncoderForward(self, model_inputs) {
559+
const encoded = await self.encode(model_inputs);
560+
const decoded = await self.decode(encoded);
561+
return decoded;
562+
}
563+
557564
/**
558565
* Forward pass of a decoder model.
559566
* @param {Object} self The decoder model.
@@ -1009,12 +1016,13 @@ export class PreTrainedModel extends Callable {
10091016
this.can_generate = true;
10101017
this._prepare_inputs_for_generation = multimodal_text_to_text_prepare_inputs_for_generation;
10111018
break;
1012-
10131019
case MODEL_TYPES.MultiModality:
10141020
this.can_generate = true;
10151021
this._prepare_inputs_for_generation = multimodality_prepare_inputs_for_generation;
10161022
break;
1017-
1023+
case MODEL_TYPES.AutoEncoder:
1024+
this._forward = autoEncoderForward;
1025+
break;
10181026
default:
10191027
// should be MODEL_TYPES.EncoderOnly
10201028
this._forward = encoderForward;
@@ -1197,7 +1205,13 @@ export class PreTrainedModel extends Callable {
11971205
generation_config: 'generation_config.json',
11981206
}, options),
11991207
]);
1200-
1208+
} else if (modelType === MODEL_TYPES.AutoEncoder) {
1209+
info = await Promise.all([
1210+
constructSessions(pretrained_model_name_or_path, {
1211+
encoder_model: 'encoder_model',
1212+
decoder_model: 'decoder_model',
1213+
}, options),
1214+
]);
12011215
} else { // should be MODEL_TYPES.EncoderOnly
12021216
if (modelType !== MODEL_TYPES.EncoderOnly) {
12031217
const type = modelName ?? config?.model_type;
@@ -7101,7 +7115,77 @@ export class UltravoxModel extends UltravoxPreTrainedModel {
71017115
}
71027116
//////////////////////////////////////////////////
71037117

7118+
export class MimiPreTrainedModel extends PreTrainedModel {
7119+
main_input_name = 'input_values';
7120+
forward_params = ['input_values'];
7121+
}
7122+
7123+
export class MimiEncoderOutput extends ModelOutput {
7124+
/**
7125+
* @param {Object} output The output of the model.
7126+
* @param {Tensor} output.audio_codes Discrete code embeddings, of shape `(batch_size, num_quantizers, codes_length)`.
7127+
*/
7128+
constructor({ audio_codes }) {
7129+
super();
7130+
this.audio_codes = audio_codes;
7131+
}
7132+
}
71047133

7134+
export class MimiDecoderOutput extends ModelOutput {
7135+
/**
7136+
* @param {Object} output The output of the model.
7137+
* @param {Tensor} output.audio_values Decoded audio values, of shape `(batch_size, num_channels, sequence_length)`.
7138+
*/
7139+
constructor({ audio_values }) {
7140+
super();
7141+
this.audio_values = audio_values;
7142+
}
7143+
}
7144+
7145+
/**
7146+
* The Mimi neural audio codec model.
7147+
*/
7148+
export class MimiModel extends MimiPreTrainedModel {
7149+
/**
7150+
* Encodes the input audio waveform into discrete codes.
7151+
* @param {Object} inputs Model inputs
7152+
* @param {Tensor} [inputs.input_values] Float values of the input audio waveform, of shape `(batch_size, channels, sequence_length)`).
7153+
* @returns {Promise<MimiEncoderOutput>} The output tensor of shape `(batch_size, num_codebooks, sequence_length)`.
7154+
*/
7155+
async encode(inputs) {
7156+
return new MimiEncoderOutput(await sessionRun(this.sessions['encoder_model'], inputs));
7157+
}
7158+
7159+
/**
7160+
* Decodes the given frames into an output audio waveform.
7161+
* @param {MimiEncoderOutput} inputs The encoded audio codes.
7162+
* @returns {Promise<MimiDecoderOutput>} The output tensor of shape `(batch_size, num_channels, sequence_length)`.
7163+
*/
7164+
async decode(inputs) {
7165+
return new MimiDecoderOutput(await sessionRun(this.sessions['decoder_model'], inputs));
7166+
}
7167+
}
7168+
7169+
export class MimiEncoderModel extends MimiPreTrainedModel {
7170+
/** @type {typeof PreTrainedModel.from_pretrained} */
7171+
static async from_pretrained(pretrained_model_name_or_path, options = {}) {
7172+
return super.from_pretrained(pretrained_model_name_or_path, {
7173+
...options,
7174+
// Update default model file name if not provided
7175+
model_file_name: options.model_file_name ?? 'encoder_model',
7176+
});
7177+
}
7178+
}
7179+
export class MimiDecoderModel extends MimiPreTrainedModel {
7180+
/** @type {typeof PreTrainedModel.from_pretrained} */
7181+
static async from_pretrained(pretrained_model_name_or_path, options = {}) {
7182+
return super.from_pretrained(pretrained_model_name_or_path, {
7183+
...options,
7184+
// Update default model file name if not provided
7185+
model_file_name: options.model_file_name ?? 'decoder_model',
7186+
});
7187+
}
7188+
}
71057189

71067190
//////////////////////////////////////////////////
71077191
// AutoModels, used to simplify construction of PreTrainedModels
@@ -7272,6 +7356,9 @@ const MODEL_MAPPING_NAMES_ENCODER_DECODER = new Map([
72727356
['blenderbot-small', ['BlenderbotSmallModel', BlenderbotSmallModel]],
72737357
]);
72747358

7359+
const MODEL_MAPPING_NAMES_AUTO_ENCODER = new Map([
7360+
['mimi', ['MimiModel', MimiModel]],
7361+
]);
72757362

72767363
const MODEL_MAPPING_NAMES_DECODER_ONLY = new Map([
72777364
['bloom', ['BloomModel', BloomModel]],
@@ -7603,9 +7690,12 @@ const MODEL_FOR_IMAGE_FEATURE_EXTRACTION_MAPPING_NAMES = new Map([
76037690
])
76047691

76057692
const MODEL_CLASS_TYPE_MAPPING = [
7693+
// MODEL_MAPPING_NAMES:
76067694
[MODEL_MAPPING_NAMES_ENCODER_ONLY, MODEL_TYPES.EncoderOnly],
76077695
[MODEL_MAPPING_NAMES_ENCODER_DECODER, MODEL_TYPES.EncoderDecoder],
76087696
[MODEL_MAPPING_NAMES_DECODER_ONLY, MODEL_TYPES.DecoderOnly],
7697+
[MODEL_MAPPING_NAMES_AUTO_ENCODER, MODEL_TYPES.AutoEncoder],
7698+
76097699
[MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
76107700
[MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
76117701
[MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES, MODEL_TYPES.Seq2Seq],
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
import { FeatureExtractor, validate_audio_inputs } from '../../base/feature_extraction_utils.js';
2+
import { Tensor } from '../../utils/tensor.js';
3+
4+
5+
export class EncodecFeatureExtractor extends FeatureExtractor {
6+
/**
7+
* Asynchronously extracts input values from a given audio using the provided configuration.
8+
* @param {Float32Array|Float64Array} audio The audio data as a Float32Array/Float64Array.
9+
* @returns {Promise<{ input_values: Tensor; }>} The extracted input values.
10+
*/
11+
async _call(audio) {
12+
validate_audio_inputs(audio, 'EncodecFeatureExtractor');
13+
14+
if (audio instanceof Float64Array) {
15+
audio = new Float32Array(audio);
16+
}
17+
18+
const num_channels = this.config.feature_size;
19+
if (audio.length % num_channels !== 0) {
20+
throw new Error(`The length of the audio data must be a multiple of the number of channels (${num_channels}).`);
21+
}
22+
23+
const shape = [
24+
1, /* batch_size */
25+
num_channels, /* num_channels */
26+
audio.length / num_channels, /* num_samples */
27+
];
28+
return {
29+
input_values: new Tensor('float32', audio, shape),
30+
};
31+
}
32+
}

src/models/feature_extractors.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11

22
export * from './audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.js';
3+
export * from './encodec/feature_extraction_encodec.js';
34
export * from './clap/feature_extraction_clap.js';
45
export * from './moonshine/feature_extraction_moonshine.js';
56
export * from './pyannote/feature_extraction_pyannote.js';
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
import { EncodecFeatureExtractor, MimiModel } from "../../../src/transformers.js";
2+
3+
import { MAX_MODEL_LOAD_TIME, MAX_TEST_EXECUTION_TIME, MAX_MODEL_DISPOSE_TIME, DEFAULT_MODEL_OPTIONS } from "../../init.js";
4+
5+
export default () => {
6+
describe("MimiModel", () => {
7+
const model_id = "hf-internal-testing/tiny-random-MimiModel";
8+
9+
/** @type {MimiModel} */
10+
let model;
11+
/** @type {EncodecFeatureExtractor} */
12+
let feature_extractor;
13+
let inputs;
14+
beforeAll(async () => {
15+
model = await MimiModel.from_pretrained(model_id, DEFAULT_MODEL_OPTIONS);
16+
feature_extractor = await EncodecFeatureExtractor.from_pretrained(model_id);
17+
inputs = await feature_extractor(new Float32Array(12000));
18+
}, MAX_MODEL_LOAD_TIME);
19+
20+
it(
21+
"forward",
22+
async () => {
23+
const { audio_values } = await model(inputs);
24+
expect(audio_values.dims).toEqual([1, 1, 13440]);
25+
},
26+
MAX_TEST_EXECUTION_TIME,
27+
);
28+
29+
it(
30+
"encode & decode",
31+
async () => {
32+
const encoder_outputs = await model.encode(inputs);
33+
expect(encoder_outputs.audio_codes.dims).toEqual([1, model.config.num_quantizers, 7]);
34+
35+
const { audio_values } = await model.decode(encoder_outputs);
36+
expect(audio_values.dims).toEqual([1, 1, 13440]);
37+
},
38+
MAX_TEST_EXECUTION_TIME,
39+
);
40+
41+
afterAll(async () => {
42+
await model?.dispose();
43+
}, MAX_MODEL_DISPOSE_TIME);
44+
});
45+
};

0 commit comments

Comments
 (0)