Skip to content

Commit dfbd9c7

Browse files
committed
Add support for SNAC
1 parent 29e1679 commit dfbd9c7

File tree

5 files changed

+63
-0
lines changed

5 files changed

+63
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -407,6 +407,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
407407
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
408408
1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
409409
1. **[SmolVLM](https://huggingface.co/docs/transformers/main/model_doc/smolvlm) (from Hugging Face) released with the blog posts [SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm) and [SmolVLM Grows Smaller – Introducing the 250M & 500M Models!](https://huggingface.co/blog/smolervlm) by the Hugging Face TB Research team.
410+
1. **SNAC** (from Papla Media, ETH Zurich) released with the paper [SNAC: Multi-Scale Neural Audio Codec](https://arxiv.org/abs/2410.14411) by Hubert Siuzdak, Florian Grötschla, Luca A. Lanzendörfer.
410411
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
411412
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
412413
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,7 @@
121121
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
122122
1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
123123
1. **[SmolVLM](https://huggingface.co/docs/transformers/main/model_doc/smolvlm) (from Hugging Face) released with the blog posts [SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm) and [SmolVLM Grows Smaller – Introducing the 250M & 500M Models!](https://huggingface.co/blog/smolervlm) by the Hugging Face TB Research team.
124+
1. **SNAC** (from Papla Media, ETH Zurich) released with the paper [SNAC: Multi-Scale Neural Audio Codec](https://arxiv.org/abs/2410.14411) by Hubert Siuzdak, Florian Grötschla, Luca A. Lanzendörfer.
124125
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
125126
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
126127
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.

src/models.js

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7287,6 +7287,60 @@ export class DacDecoderModel extends DacPreTrainedModel {
72877287
}
72887288
//////////////////////////////////////////////////
72897289

7290+
7291+
//////////////////////////////////////////////////
7292+
// Snac models
7293+
export class SnacPreTrainedModel extends PreTrainedModel {
7294+
main_input_name = 'input_values';
7295+
forward_params = ['input_values'];
7296+
}
7297+
7298+
/**
7299+
* The SNAC (Multi-Scale Neural Audio Codec) model.
7300+
*/
7301+
export class SnacModel extends SnacPreTrainedModel {
7302+
/**
7303+
* Encodes the input audio waveform into discrete codes.
7304+
* @param {Object} inputs Model inputs
7305+
* @param {Tensor} [inputs.input_values] Float values of the input audio waveform, of shape `(batch_size, channels, sequence_length)`).
7306+
* @returns {Promise<Record<string, Tensor>>} The output tensors of shape `(batch_size, num_codebooks, sequence_length)`.
7307+
*/
7308+
async encode(inputs) {
7309+
return await sessionRun(this.sessions['encoder_model'], inputs);
7310+
}
7311+
7312+
/**
7313+
* Decodes the given frames into an output audio waveform.
7314+
* @param {Record<string, Tensor>} inputs The encoded audio codes.
7315+
* @returns {Promise<{audio_values: Tensor}>} The output tensor of shape `(batch_size, num_channels, sequence_length)`.
7316+
*/
7317+
async decode(inputs) {
7318+
return await sessionRun(this.sessions['decoder_model'], inputs);
7319+
}
7320+
}
7321+
7322+
export class SnacEncoderModel extends SnacPreTrainedModel {
7323+
/** @type {typeof PreTrainedModel.from_pretrained} */
7324+
static async from_pretrained(pretrained_model_name_or_path, options = {}) {
7325+
return super.from_pretrained(pretrained_model_name_or_path, {
7326+
...options,
7327+
// Update default model file name if not provided
7328+
model_file_name: options.model_file_name ?? 'encoder_model',
7329+
});
7330+
}
7331+
}
7332+
export class SnacDecoderModel extends SnacPreTrainedModel {
7333+
/** @type {typeof PreTrainedModel.from_pretrained} */
7334+
static async from_pretrained(pretrained_model_name_or_path, options = {}) {
7335+
return super.from_pretrained(pretrained_model_name_or_path, {
7336+
...options,
7337+
// Update default model file name if not provided
7338+
model_file_name: options.model_file_name ?? 'decoder_model',
7339+
});
7340+
}
7341+
}
7342+
//////////////////////////////////////////////////
7343+
72907344
//////////////////////////////////////////////////
72917345
// AutoModels, used to simplify construction of PreTrainedModels
72927346
// (uses config to instantiate correct class)
@@ -7468,6 +7522,7 @@ const MODEL_MAPPING_NAMES_ENCODER_DECODER = new Map([
74687522
const MODEL_MAPPING_NAMES_AUTO_ENCODER = new Map([
74697523
['mimi', ['MimiModel', MimiModel]],
74707524
['dac', ['DacModel', DacModel]],
7525+
['snac', ['SnacModel', SnacModel]],
74717526
]);
74727527

74737528
const MODEL_MAPPING_NAMES_DECODER_ONLY = new Map([
@@ -7873,6 +7928,8 @@ const CUSTOM_MAPPING = [
78737928
['DacDecoderModel', DacDecoderModel, MODEL_TYPES.EncoderOnly],
78747929
['MimiEncoderModel', MimiEncoderModel, MODEL_TYPES.EncoderOnly],
78757930
['MimiDecoderModel', MimiDecoderModel, MODEL_TYPES.EncoderOnly],
7931+
['SnacEncoderModel', SnacEncoderModel, MODEL_TYPES.EncoderOnly],
7932+
['SnacDecoderModel', SnacDecoderModel, MODEL_TYPES.EncoderOnly],
78767933
]
78777934
for (const [name, model, type] of CUSTOM_MAPPING) {
78787935
MODEL_TYPE_MAPPING.set(name, type);

src/models/feature_extractors.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ export * from './dac/feature_extraction_dac.js';
66
export * from './moonshine/feature_extraction_moonshine.js';
77
export * from './pyannote/feature_extraction_pyannote.js';
88
export * from './seamless_m4t/feature_extraction_seamless_m4t.js';
9+
export * from './snac/feature_extraction_snac.js';
910
export * from './speecht5/feature_extraction_speecht5.js';
1011
export * from './wav2vec2/feature_extraction_wav2vec2.js';
1112
export * from './wespeaker/feature_extraction_wespeaker.js';
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
import { DacFeatureExtractor } from '../dac/feature_extraction_dac.js';
2+
3+
export class SnacFeatureExtractor extends DacFeatureExtractor { }

0 commit comments

Comments
 (0)