Releases: huggingface/transformers.js
2.5.4
What's new?
- Add support for 3 new vision architectures (Swin, DeiT, Yolos) in #262. Check out the Hugging Face Hub to see which models you can use!
- Swin for image classification. e.g.:
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg'; let classifier = await pipeline('image-classification', 'Xenova/swin-base-patch4-window7-224-in22k'); let output = await classifier(url, { topk: null }); // [ // { label: 'Bengal_tiger', score: 0.2258443683385849 }, // { label: 'tiger, Panthera_tigris', score: 0.21161635220050812 }, // { label: 'predator, predatory_animal', score: 0.09135803580284119 }, // { label: 'tigress', score: 0.08038495481014252 }, // // ... 21838 more items // ]
- DeiT for image classification. e.g.,:
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg'; let classifier = await pipeline('image-classification', 'Xenova/deit-tiny-distilled-patch16-224'); let output = await classifier(url); // [{ label: 'tiger, Panthera tigris', score: 0.9804046154022217 }]
- Yolos for object detection. e.g.,:
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg'; let detector = await pipeline('object-detection', 'Xenova/yolos-small-300'); let output = await detector(url); // [ // { label: 'remote', score: 0.9837935566902161, box: { xmin: 331, ymin: 80, xmax: 367, ymax: 192 } }, // { label: 'cat', score: 0.94994056224823, box: { xmin: 8, ymin: 57, xmax: 316, ymax: 470 } }, // { label: 'couch', score: 0.9843178987503052, box: { xmin: 0, ymin: 0, xmax: 639, ymax: 474 } }, // { label: 'remote', score: 0.9704685211181641, box: { xmin: 39, ymin: 71, xmax: 179, ymax: 114 } }, // { label: 'cat', score: 0.9921762943267822, box: { xmin: 339, ymin: 17, xmax: 642, ymax: 380 } } // ]
- Swin for image classification. e.g.:
- Documentation improvements by @perborgen in #261
New contributors 🤗
- @perborgen made their first contribution in #261
Full Changelog: 2.5.3...2.5.4
2.5.3
What's new?
- Fix whisper timestamps for non-English languages in #253
- Fix caching for some LFS files from the Hugging Face Hub in #251
- Improve documentation (w/ example code and links) in #255 and #257. Thanks @josephrocca for helping with this!
New contributors 🤗
- @josephrocca made their first contribution in #257
Full Changelog: 2.5.2...2.5.3
2.5.2
What's new?
- Add
audio-classificationwith MMS and Wav2Vec2 in #220. Example usage:// npm i @xenova/transformers import { pipeline } from '@xenova/transformers'; // Create audio classification pipeline let classifier = await pipeline('audio-classification', 'Xenova/mms-lid-4017'); // Run inference let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jeanNL.wav'; let output = await classifier(url); // [ // { label: 'fra', score: 0.9995712041854858 }, // { label: 'hat', score: 0.00003788191679632291 }, // { label: 'lin', score: 0.00002646935718075838 }, // { label: 'hun', score: 0.000015628289474989288 }, // { label: 'bre', score: 0.000007014674793026643 } // ]
- Adds
automatic-speech-recognitionfor Wav2Vec2 models in #220 (MMS coming soon). - Add support for multi-label classification problem type in #249. Thanks @KiterWork for reporting!
- Add M2M100 tokenizer in #250. Thanks @AAnirudh07 for the feature request!
- Documentation improvements
New Contributors
- @celsodias12 made their first contribution in #247
Full Changelog: 2.5.1...2.5.2
2.5.1
What's new?
- Add support for Llama/Llama2 models in #232
- Tokenization performance improvements in #234 (+ The Tokenizer Playground example app)
- Add support for DeBERTa/DeBERTa-v2 models in #244
- Documentation improvements for zero-shot-classification pipeline (link)
Full Changelog: 2.5.0...2.5.1
2.5.0
What's new?
Support for computing CLIP image and text embeddings separately (#227)
You can now compute CLIP text and vision embeddings separately, allowing for faster inference when you only need to query one of the modalities. We've also released a demo application for semantic image search to showcase this functionality.

Example: Compute text embeddings with CLIPTextModelWithProjection.
import { AutoTokenizer, CLIPTextModelWithProjection } from '@xenova/transformers';
// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
const text_model = await CLIPTextModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');
// Run tokenization
let texts = ['a photo of a car', 'a photo of a football match'];
let text_inputs = tokenizer(texts, { padding: true, truncation: true });
// Compute embeddings
const { text_embeds } = await text_model(text_inputs);
// Tensor {
// dims: [ 2, 512 ],
// type: 'float32',
// data: Float32Array(1024) [ ... ],
// size: 1024
// }Example: Compute vision embeddings with CLIPVisionModelWithProjection.
import { AutoProcessor, CLIPVisionModelWithProjection, RawImage} from '@xenova/transformers';
// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch16');
const vision_model = await CLIPVisionModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');
// Read image and run processor
let image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
let image_inputs = await processor(image);
// Compute embeddings
const { image_embeds } = await vision_model(image_inputs);
// Tensor {
// dims: [ 1, 512 ],
// type: 'float32',
// data: Float32Array(512) [ ... ],
// size: 512
// }Improved browser extension example/template (#196)
We've updated the source code for our example browser extension, making the following improvements:
- Custom model caching - meaning you don't need to ship the weights of the model with the extension. In addition to a smaller bundle size, when the user updates, they won't need to redownload the weights!
- Use ES6 module syntax (vs. CommonJS) - much cleaner code!
- Persistent service worker - fixed an issue where the service worker would go to sleep after a portion of inactivity.
Summary of updates since last minor release (2.4.0):
- (2.4.1) Improved documentation
- (2.4.2) Support for private/gated models (#202)
- (2.4.3) Example Next.js applications (#211) + MPNet model support (#221)
- (2.4.4) StarCoder models + example application (release; demo + source code)
Misc bug fixes and improvements
- Fixed floating-point-precision edge-case for resizing images
- Fixed
RawImage.save() - BPE tokenization for weird whitespace characters (#208)
2.4.4
What's new?
-
New model: StarCoder (Xenova/starcoderbase-1b and Xenova/tiny_starcoder_py)
-
In-browser code completion example application (demo and source code)
Full Changelog: 2.4.3...2.4.4
2.4.3
What's new?
-
Example next.js applications in #211
-
Demo: client-side or server-side
-
Source code: client-side or server-side
Full Changelog: 2.4.2...2.4.3
2.4.2
What's new?
- Add support for private/gated model access by @xenova in #202
- Fix BPE tokenization for weird whitespace characters by @xenova in #208
- Thanks to @fozziethebeat for reporting and helping to debug
- Minor documentation improvements
Full Changelog: 2.4.1...2.4.2
2.4.1
2.4.0
What's new?
Word-level timestamps for Whisper automatic-speech-recognition 🤯
This release adds the ability to predict word-level timestamps for our whisper automatic-speech-recognition models by analyzing the cross-attentions and applying dynamic time warping. Our implementation is adapted from this PR, which added this functionality to the 🤗 transformers Python library.
Example usage: (see docs)
import { pipeline } from '@xenova/transformers';
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
let transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en', {
revision: 'output_attentions',
});
let output = await transcriber(url, { return_timestamps: 'word' });
// {
// "text": " And so my fellow Americans ask not what your country can do for you ask what you can do for your country.",
// "chunks": [
// { "text": " And", "timestamp": [0, 0.78] },
// { "text": " so", "timestamp": [0.78, 1.06] },
// { "text": " my", "timestamp": [1.06, 1.46] },
// ...
// { "text": " for", "timestamp": [9.72, 9.92] },
// { "text": " your", "timestamp": [9.92, 10.22] },
// { "text": " country.", "timestamp": [10.22, 13.5] }
// ]
// }Note: For now, you need to choose the output_attentions revision (see above). In future, we may merge these models into the main branch. Also, we currently do not have exports for the medium and large models, simply because I don't have enough RAM to do the export myself (>25GB needed) 😅 ... so, if you would like to use our conversion script to do the conversion yourself, please make a PR on the hub with these new models (under a new output_attentions branch)!
From our testing, the JS implementation exactly matches the output produced by the Python implementation (when using the same model of course)! 🥳
Python (left) vs. JavaScript (right)
I'm excited to see what you all build with this! Please tag me on twitter if you use it in your project - I'd love to see! I'm also planning on adding this as an option to whisper-web, so stay tuned! 🚀
Misc bug fixes and improvements
- Fix loading of grayscale images in node.js (#178)


