Skip to content

Commit 57955cb

Browse files
[DOCS] Adds DeBERTA v2 to the tokenizers list in API docs (#112752)
Co-authored-by: Max Hniebergall <[email protected]>
1 parent 763764c commit 57955cb

File tree

4 files changed

+151
-0
lines changed

4 files changed

+151
-0
lines changed

docs/reference/ingest/processors/inference.asciidoc

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,18 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizatio
169169
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate]
170170
=======
171171
172+
`deberta_v2`::::
173+
(Optional, object)
174+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-deberta-v2]
175+
+
176+
.Properties of deberta_v2
177+
[%collapsible%open]
178+
=======
179+
`truncate`::::
180+
(Optional, string)
181+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate-deberta-v2]
182+
=======
183+
172184
`roberta`::::
173185
(Optional, object)
174186
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-roberta]
@@ -224,6 +236,18 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizatio
224236
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate]
225237
=======
226238
239+
`deberta_v2`::::
240+
(Optional, object)
241+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-deberta-v2]
242+
+
243+
.Properties of deberta_v2
244+
[%collapsible%open]
245+
=======
246+
`truncate`::::
247+
(Optional, string)
248+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate-deberta-v2]
249+
=======
250+
227251
`roberta`::::
228252
(Optional, object)
229253
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-roberta]
@@ -304,6 +328,23 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizatio
304328
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate]
305329
=======
306330
331+
`deberta_v2`::::
332+
(Optional, object)
333+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-deberta-v2]
334+
+
335+
.Properties of deberta_v2
336+
[%collapsible%open]
337+
=======
338+
`span`::::
339+
(Optional, integer)
340+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-span]
341+
342+
`truncate`::::
343+
(Optional, string)
344+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate-deberta-v2]
345+
=======
346+
347+
307348
`roberta`::::
308349
(Optional, object)
309350
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-roberta]
@@ -363,6 +404,18 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizatio
363404
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate]
364405
=======
365406
407+
`deberta_v2`::::
408+
(Optional, object)
409+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-deberta-v2]
410+
+
411+
.Properties of deberta_v2
412+
[%collapsible%open]
413+
=======
414+
`truncate`::::
415+
(Optional, string)
416+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate-deberta-v2]
417+
=======
418+
366419
`roberta`::::
367420
(Optional, object)
368421
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-roberta]
@@ -424,6 +477,22 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizatio
424477
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate]
425478
=======
426479
480+
`deberta_v2`::::
481+
(Optional, object)
482+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-deberta-v2]
483+
+
484+
.Properties of deberta_v2
485+
[%collapsible%open]
486+
=======
487+
`span`::::
488+
(Optional, integer)
489+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-span]
490+
491+
`truncate`::::
492+
(Optional, string)
493+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate-deberta-v2]
494+
=======
495+
427496
`roberta`::::
428497
(Optional, object)
429498
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-roberta]
@@ -515,6 +584,18 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizatio
515584
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate]
516585
=======
517586
587+
`deberta_v2`::::
588+
(Optional, object)
589+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-deberta-v2]
590+
+
591+
.Properties of deberta_v2
592+
[%collapsible%open]
593+
=======
594+
`truncate`::::
595+
(Optional, string)
596+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate-deberta-v2]
597+
=======
598+
518599
`roberta`::::
519600
(Optional, object)
520601
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-roberta]

docs/reference/ml/ml-shared.asciidoc

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -988,6 +988,7 @@ values are
988988
+
989989
--
990990
* `bert`: Use for BERT-style models
991+
* `deberta_v2`: Use for DeBERTa v2 and v3-style models
991992
* `mpnet`: Use for MPNet-style models
992993
* `roberta`: Use for RoBERTa-style and BART-style models
993994
* experimental:[] `xlm_roberta`: Use for XLMRoBERTa-style models
@@ -1037,6 +1038,19 @@ sequence. Therefore, do not use `second` in this case.
10371038

10381039
end::inference-config-nlp-tokenization-truncate[]
10391040

1041+
tag::inference-config-nlp-tokenization-truncate-deberta-v2[]
1042+
Indicates how tokens are truncated when they exceed `max_sequence_length`.
1043+
The default value is `first`.
1044+
+
1045+
--
1046+
* `balanced`: One or both of the first and second sequences may be truncated so as to balance the tokens included from both sequences.
1047+
* `none`: No truncation occurs; the inference request receives an error.
1048+
* `first`: Only the first sequence is truncated.
1049+
* `second`: Only the second sequence is truncated. If there is just one sequence, that sequence is truncated.
1050+
--
1051+
1052+
end::inference-config-nlp-tokenization-truncate-deberta-v2[]
1053+
10401054
tag::inference-config-nlp-tokenization-bert-with-special-tokens[]
10411055
Tokenize with special tokens. The tokens typically included in BERT-style tokenization are:
10421056
+
@@ -1050,10 +1064,23 @@ tag::inference-config-nlp-tokenization-bert-ja-with-special-tokens[]
10501064
Tokenize with special tokens if `true`.
10511065
end::inference-config-nlp-tokenization-bert-ja-with-special-tokens[]
10521066

1067+
tag::inference-config-nlp-tokenization-deberta-v2[]
1068+
DeBERTa-style tokenization is to be performed with the enclosed settings.
1069+
end::inference-config-nlp-tokenization-deberta-v2[]
1070+
10531071
tag::inference-config-nlp-tokenization-max-sequence-length[]
10541072
Specifies the maximum number of tokens allowed to be output by the tokenizer.
10551073
end::inference-config-nlp-tokenization-max-sequence-length[]
10561074

1075+
tag::inference-config-nlp-tokenization-deberta-v2-with-special-tokens[]
1076+
Tokenize with special tokens. The tokens typically included in DeBERTa-style tokenization are:
1077+
+
1078+
--
1079+
* `[CLS]`: The first token of the sequence being classified.
1080+
* `[SEP]`: Indicates sequence separation and sequence end.
1081+
--
1082+
end::inference-config-nlp-tokenization-deberta-v2-with-special-tokens[]
1083+
10571084
tag::inference-config-nlp-tokenization-roberta[]
10581085
RoBERTa-style tokenization is to be performed with the enclosed settings.
10591086
end::inference-config-nlp-tokenization-roberta[]

docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,18 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizatio
137137
(Optional, string)
138138
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate]
139139
=======
140+
`deberta_v2`::::
141+
(Optional, object)
142+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-deberta-v2]
143+
+
144+
.Properties of deberta_v2
145+
[%collapsible%open]
146+
=======
147+
`truncate`::::
148+
(Optional, string)
149+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate-deberta-v2]
150+
=======
151+
140152
`roberta`::::
141153
(Optional, object)
142154
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-roberta]

docs/reference/ml/trained-models/apis/put-trained-models.asciidoc

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -773,6 +773,37 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizatio
773773
(Optional, boolean)
774774
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
775775
====
776+
`deberta_v2`::
777+
(Optional, object)
778+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-deberta-v2]
779+
+
780+
.Properties of deberta_v2
781+
[%collapsible%open]
782+
====
783+
`do_lower_case`:::
784+
(Optional, boolean)
785+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-do-lower-case]
786+
+
787+
--
788+
Defaults to `false`.
789+
--
790+
791+
`max_sequence_length`:::
792+
(Optional, integer)
793+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-max-sequence-length]
794+
795+
`span`:::
796+
(Optional, integer)
797+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-span]
798+
799+
`truncate`:::
800+
(Optional, string)
801+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-truncate-deberta-v2]
802+
803+
`with_special_tokens`:::
804+
(Optional, boolean)
805+
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-deberta-v2-with-special-tokens]
806+
====
776807
`roberta`::
777808
(Optional, object)
778809
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-roberta]

0 commit comments

Comments
 (0)