Skip to content

Commit a938a56

Browse files
authored
Add support for grounding dino (#1137)
* Add sum, norm, normalize unit tests * Add min/max unit tests * make tests synchronous * Cleanup * Update mean op unit tests * Add more tensor unit tests * Update view unit test * Add tensor construction unit tests * Add more tensor op unit tests * Add another squeeze unit test * Multiple dims for squeeze unit test * Refactor tensor reduce ops * Add support for `gt` and `lt` tensor ops * Add grounding dino implementation * Allow grounding dino to be usable via the pipeline API * Add listed support for grounding dino * Add grounding dino unit tests * Add zero-shot object detection pipeline unit test for grounding dino
1 parent f126091 commit a938a56

15 files changed

+915
-274
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
335335
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
336336
1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
337337
1. **[Granite](https://huggingface.co/docs/transformers/main/model_doc/granite)** (from IBM) released with the paper [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler](https://arxiv.org/abs/2408.13359) by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda.
338+
1. **[Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino)** (from IDEA-Research) released with the paper [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499) by Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang.
338339
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
339340
1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.
340341
1. **[Hiera](https://huggingface.co/docs/transformers/model_doc/hiera)** (from Meta) released with the paper [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/pdf/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@
5050
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
5151
1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
5252
1. **[Granite](https://huggingface.co/docs/transformers/main/model_doc/granite)** (from IBM) released with the paper [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler](https://arxiv.org/abs/2408.13359) by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda.
53+
1. **[Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino)** (from IDEA-Research) released with the paper [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499) by Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang.
5354
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
5455
1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.
5556
1. **[Hiera](https://huggingface.co/docs/transformers/model_doc/hiera)** (from Meta) released with the paper [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/pdf/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer.

src/base/image_processors_utils.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ function enforce_size_divisibility([width, height], divisor) {
6868
* @param {number[]} arr The coordinate for the center of the box and its width, height dimensions (center_x, center_y, width, height)
6969
* @returns {number[]} The coodinates for the top-left and bottom-right corners of the box (top_left_x, top_left_y, bottom_right_x, bottom_right_y)
7070
*/
71-
function center_to_corners_format([centerX, centerY, width, height]) {
71+
export function center_to_corners_format([centerX, centerY, width, height]) {
7272
return [
7373
centerX - width / 2,
7474
centerY - height / 2,

src/base/processing_utils.js

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,17 @@ export class Processor extends Callable {
101101
return this.tokenizer.batch_decode(...args);
102102
}
103103

104+
/**
105+
* @param {Parameters<PreTrainedTokenizer['decode']>} args
106+
* @returns {ReturnType<PreTrainedTokenizer['decode']>}
107+
*/
108+
decode(...args) {
109+
if (!this.tokenizer) {
110+
throw new Error('Unable to decode without a tokenizer.');
111+
}
112+
return this.tokenizer.decode(...args);
113+
}
114+
104115

105116
/**
106117
* Calls the feature_extractor function with the given input.

src/models.js

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -532,14 +532,23 @@ async function encoderForward(self, model_inputs) {
532532
encoderFeeds.inputs_embeds = await self.encode_text({ input_ids: model_inputs.input_ids });
533533
}
534534
if (session.inputNames.includes('token_type_ids') && !encoderFeeds.token_type_ids) {
535+
if (!encoderFeeds.input_ids) {
536+
throw new Error('Both `input_ids` and `token_type_ids` are missing in the model inputs.');
537+
}
535538
// Assign default `token_type_ids` (all zeroes) to the `encoderFeeds` if the model expects it,
536539
// but they weren't created by the tokenizer.
537-
encoderFeeds.token_type_ids = new Tensor(
538-
'int64',
539-
new BigInt64Array(encoderFeeds.input_ids.data.length),
540-
encoderFeeds.input_ids.dims
541-
)
540+
encoderFeeds.token_type_ids = zeros_like(encoderFeeds.input_ids);
541+
}
542+
if (session.inputNames.includes('pixel_mask') && !encoderFeeds.pixel_mask) {
543+
if (!encoderFeeds.pixel_values) {
544+
throw new Error('Both `pixel_values` and `pixel_mask` are missing in the model inputs.');
545+
}
546+
// Assign default `pixel_mask` (all ones) to the `encoderFeeds` if the model expects it,
547+
// but they weren't created by the processor.
548+
const dims = encoderFeeds.pixel_values.dims;
549+
encoderFeeds.pixel_mask = ones([dims[0], dims[2], dims[3]]);
542550
}
551+
543552
return await sessionRun(session, encoderFeeds);
544553
}
545554

@@ -5428,6 +5437,8 @@ export class Dinov2WithRegistersForImageClassification extends Dinov2WithRegiste
54285437
}
54295438
}
54305439
//////////////////////////////////////////////////
5440+
export class GroundingDinoPreTrainedModel extends PreTrainedModel { }
5441+
export class GroundingDinoForObjectDetection extends GroundingDinoPreTrainedModel { }
54315442

54325443
//////////////////////////////////////////////////
54335444
export class YolosPreTrainedModel extends PreTrainedModel { }
@@ -7338,6 +7349,7 @@ const MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES = new Map([
73387349
const MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES = new Map([
73397350
['owlvit', ['OwlViTForObjectDetection', OwlViTForObjectDetection]],
73407351
['owlv2', ['Owlv2ForObjectDetection', Owlv2ForObjectDetection]],
7352+
['grounding-dino', ['GroundingDinoForObjectDetection', GroundingDinoForObjectDetection]],
73417353
]);
73427354

73437355
const MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES = new Map([
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
2+
import {
3+
ImageProcessor,
4+
} from "../../base/image_processors_utils.js";
5+
import { ones } from '../../utils/tensor.js';
6+
7+
8+
/**
9+
* @typedef {object} GroundingDinoFeatureExtractorResultProps
10+
* @property {import('../../utils/tensor.js').Tensor} pixel_mask
11+
* @typedef {import('../../base/image_processors_utils.js').ImageProcessorResult & GroundingDinoFeatureExtractorResultProps} GroundingDinoFeatureExtractorResult
12+
*/
13+
14+
export class GroundingDinoImageProcessor extends ImageProcessor {
15+
/**
16+
* Calls the feature extraction process on an array of images, preprocesses
17+
* each image, and concatenates the resulting features into a single Tensor.
18+
* @param {import('../../utils/image.js').RawImage[]} images The image(s) to extract features from.
19+
* @returns {Promise<GroundingDinoFeatureExtractorResult>} An object containing the concatenated pixel values of the preprocessed images.
20+
*/
21+
async _call(images) {
22+
const result = await super._call(images);
23+
24+
const dims = result.pixel_values.dims;
25+
const pixel_mask = ones([dims[0], dims[2], dims[3]]);
26+
27+
return { ...result, pixel_mask };
28+
}
29+
}
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
import { Processor } from "../../base/processing_utils.js";
2+
import { AutoImageProcessor } from "../auto/image_processing_auto.js";
3+
import { AutoTokenizer } from "../../tokenizers.js";
4+
import { center_to_corners_format } from "../../base/image_processors_utils.js";
5+
6+
/**
7+
* Get token ids of phrases from posmaps and input_ids.
8+
* @param {import('../../utils/tensor.js').Tensor} posmaps A boolean tensor of unbatched text-thresholded logits related to the detected bounding boxes of shape `(hidden_size, )`.
9+
* @param {import('../../utils/tensor.js').Tensor} input_ids A tensor of token ids of shape `(sequence_length, )`.
10+
*/
11+
function get_phrases_from_posmap(posmaps, input_ids) {
12+
13+
const left_idx = 0;
14+
const right_idx = posmaps.dims.at(-1) - 1;
15+
16+
const posmaps_list = posmaps.tolist();
17+
posmaps_list.fill(false, 0, left_idx + 1);
18+
posmaps_list.fill(false, right_idx);
19+
20+
const input_ids_list = input_ids.tolist();
21+
return posmaps_list
22+
.map((val, idx) => val ? idx : null)
23+
.filter(idx => idx !== null)
24+
.map(i => input_ids_list[i]);
25+
}
26+
27+
export class GroundingDinoProcessor extends Processor {
28+
static tokenizer_class = AutoTokenizer
29+
static image_processor_class = AutoImageProcessor
30+
31+
/**
32+
* @typedef {import('../../utils/image.js').RawImage} RawImage
33+
*/
34+
/**
35+
*
36+
* @param {RawImage|RawImage[]|RawImage[][]} images
37+
* @param {string|string[]} text
38+
* @returns {Promise<any>}
39+
*/
40+
async _call(images, text, options = {}) {
41+
42+
const image_inputs = images ? await this.image_processor(images, options) : {};
43+
const text_inputs = text ? this.tokenizer(text, options) : {};
44+
45+
return {
46+
...text_inputs,
47+
...image_inputs,
48+
}
49+
}
50+
post_process_grounded_object_detection(outputs, input_ids, {
51+
box_threshold = 0.25,
52+
text_threshold = 0.25,
53+
target_sizes = null
54+
} = {}) {
55+
const { logits, pred_boxes } = outputs;
56+
const batch_size = logits.dims[0];
57+
58+
if (target_sizes !== null && target_sizes.length !== batch_size) {
59+
throw Error("Make sure that you pass in as many target sizes as the batch dimension of the logits")
60+
}
61+
const num_queries = logits.dims.at(1);
62+
63+
const probs = logits.sigmoid(); // (batch_size, num_queries, 256)
64+
const scores = probs.max(-1).tolist(); // (batch_size, num_queries)
65+
66+
// Convert to [x0, y0, x1, y1] format
67+
const boxes = pred_boxes.tolist() // (batch_size, num_queries, 4)
68+
.map(batch => batch.map(box => center_to_corners_format(box)));
69+
70+
const results = [];
71+
for (let i = 0; i < batch_size; ++i) {
72+
const target_size = target_sizes !== null ? target_sizes[i] : null;
73+
74+
// Convert from relative [0, 1] to absolute [0, height] coordinates
75+
if (target_size !== null) {
76+
boxes[i] = boxes[i].map(box => box.map((x, j) => x * target_size[(j + 1) % 2]));
77+
}
78+
79+
const batch_scores = scores[i];
80+
const final_scores = [];
81+
const final_phrases = [];
82+
const final_boxes = [];
83+
for (let j = 0; j < num_queries; ++j) {
84+
const score = batch_scores[j];
85+
if (score <= box_threshold) {
86+
continue;
87+
}
88+
const box = boxes[i][j];
89+
const prob = probs[i][j];
90+
91+
final_scores.push(score);
92+
final_boxes.push(box);
93+
94+
const phrases = get_phrases_from_posmap(prob.gt(text_threshold), input_ids[i]);
95+
final_phrases.push(phrases);
96+
}
97+
results.push({ scores: final_scores, boxes: final_boxes, labels: this.batch_decode(final_phrases) });
98+
}
99+
return results;
100+
}
101+
}

src/models/image_processors.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ export * from './donut/image_processing_donut.js'
1010
export * from './dpt/image_processing_dpt.js'
1111
export * from './efficientnet/image_processing_efficientnet.js'
1212
export * from './glpn/image_processing_glpn.js'
13+
export * from './grounding_dino/image_processing_grounding_dino.js'
1314
export * from './idefics3/image_processing_idefics3.js'
1415
export * from './janus/image_processing_janus.js'
1516
export * from './jina_clip/image_processing_jina_clip.js'

src/models/processors.js

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
export * from './florence2/processing_florence2.js';
2-
export * from './mgp_str/processing_mgp_str.js';
3-
export * from './moonshine/processing_moonshine.js';
2+
export * from './grounding_dino/processing_grounding_dino.js';
43
export * from './idefics3/processing_idefics3.js';
54
export * from './janus/processing_janus.js';
65
export * from './jina_clip/processing_jina_clip.js';
6+
export * from './mgp_str/processing_mgp_str.js';
7+
export * from './moonshine/processing_moonshine.js';
78
export * from './owlvit/processing_owlvit.js';
89
export * from './phi3_v/processing_phi3_v.js';
910
export * from './paligemma/processing_paligemma.js';

src/pipelines.js

Lines changed: 29 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2553,13 +2553,35 @@ export class ZeroShotObjectDetectionPipeline extends (/** @type {new (options: T
25532553
// Run model with both text and pixel inputs
25542554
const output = await this.model({ ...text_inputs, pixel_values });
25552555

2556-
// @ts-ignore
2557-
const processed = this.processor.image_processor.post_process_object_detection(output, threshold, imageSize, true)[0];
2558-
let result = processed.boxes.map((box, i) => ({
2559-
score: processed.scores[i],
2560-
label: candidate_labels[processed.classes[i]],
2561-
box: get_bounding_box(box, !percentage),
2562-
})).sort((a, b) => b.score - a.score);
2556+
let result;
2557+
if('post_process_grounded_object_detection' in this.processor) {
2558+
// @ts-ignore
2559+
const processed = this.processor.post_process_grounded_object_detection(
2560+
output,
2561+
text_inputs.input_ids,
2562+
{
2563+
// TODO: support separate threshold values
2564+
box_threshold: threshold,
2565+
text_threshold: threshold,
2566+
target_sizes: imageSize,
2567+
},
2568+
)[0];
2569+
result = processed.boxes.map((box, i) => ({
2570+
score: processed.scores[i],
2571+
label: processed.labels[i],
2572+
box: get_bounding_box(box, !percentage),
2573+
}))
2574+
} else {
2575+
// @ts-ignore
2576+
const processed = this.processor.image_processor.post_process_object_detection(output, threshold, imageSize, true)[0];
2577+
result = processed.boxes.map((box, i) => ({
2578+
score: processed.scores[i],
2579+
label: candidate_labels[processed.classes[i]],
2580+
box: get_bounding_box(box, !percentage),
2581+
}))
2582+
}
2583+
result.sort((a, b) => b.score - a.score);
2584+
25632585
if (top_k !== null) {
25642586
result = result.slice(0, top_k);
25652587
}

0 commit comments

Comments
 (0)