Add support for RoFormer models (#464)

xenova · web-flow · commit c91248cf70bc · 2023-12-26T15:49:11.000+02:00
* Add `RoFormerTokenizer

* Use `clean_text` in bert normalizer config

* Add control characters test

* Add support for RoFormer models

* Use default label if id2label is not specified

* Update requirements.txt

* Skip roformer tokenizer tests
diff --git a/README.md b/README.md
@@ -324,6 +324,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
 1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
 1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
 1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
+1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
 1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
 1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
diff --git a/docs/snippets/6_supported-models.snippet b/docs/snippets/6_supported-models.snippet
@@ -59,6 +59,7 @@
 1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
 1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
 1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
+1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
 1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
 1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
diff --git a/scripts/supported_models.py b/scripts/supported_models.py
@@ -650,6 +650,44 @@
             'microsoft/resnet-152',
         ],
     },
+    'roformer': {
+        # Feature extraction
+        'feature-extraction': [
+            'hf-tiny-model-private/tiny-random-RoFormerModel',
+        ],
+
+        # Text classification
+        'text-classification': [
+            'hf-tiny-model-private/tiny-random-RoFormerForSequenceClassification',
+        ],
+
+        # Token classification
+        'token-classification': [
+            'hf-tiny-model-private/tiny-random-RoFormerForTokenClassification',
+        ],
+
+        # TODO 
+        # # Text generation 
+        # 'text-generation': [
+        #     'hf-tiny-model-private/tiny-random-RoFormerForCausalLM',
+        # ],
+
+        # Masked language modelling
+        'fill-mask': [
+            'alchemab/antiberta2',
+            'hf-tiny-model-private/tiny-random-RoFormerForMaskedLM',
+        ],
+
+        # Question answering
+        'question-answering': [
+            'hf-tiny-model-private/tiny-random-RoFormerForQuestionAnswering',
+        ],
+
+        # Multiple choice
+        'multiple-choice': [
+            'hf-tiny-model-private/tiny-random-RoFormerForMultipleChoice',
+        ],
+    },
     'phi': {
         # Text generation
         'text-generation': [
@@ -747,6 +785,8 @@
             'MBZUAI/LaMini-T5-61M',
             'MBZUAI/LaMini-T5-223M',
             'MBZUAI/LaMini-T5-738M',
+            'declare-lab/flan-alpaca-base',
+            'declare-lab/flan-alpaca-large',
         ],
 
         # Feature extraction
diff --git a/src/models.js b/src/models.js
@@ -1464,6 +1464,78 @@ export class BertForQuestionAnswering extends BertPreTrainedModel {
 }
 //////////////////////////////////////////////////
 
+//////////////////////////////////////////////////
+// RoFormer models
+export class RoFormerPreTrainedModel extends PreTrainedModel { }
+
+/**
+ * The bare RoFormer Model transformer outputting raw hidden-states without any specific head on top.
+ */
+export class RoFormerModel extends RoFormerPreTrainedModel { }
+
+/**
+ * RoFormer Model with a `language modeling` head on top.
+ */
+export class RoFormerForMaskedLM extends RoFormerPreTrainedModel {
+    /**
+     * Calls the model on new inputs.
+     *
+     * @param {Object} model_inputs The inputs to the model.
+     * @returns {Promise<MaskedLMOutput>} An object containing the model's output logits for masked language modeling.
+     */
+    async _call(model_inputs) {
+        return new MaskedLMOutput(await super._call(model_inputs));
+    }
+}
+
+/**
+ * RoFormer Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output)
+ */
+export class RoFormerForSequenceClassification extends RoFormerPreTrainedModel {
+    /**
+     * Calls the model on new inputs.
+     *
+     * @param {Object} model_inputs The inputs to the model.
+     * @returns {Promise<SequenceClassifierOutput>} An object containing the model's output logits for sequence classification.
+     */
+    async _call(model_inputs) {
+        return new SequenceClassifierOutput(await super._call(model_inputs));
+    }
+}
+
+/**
+ * RoFormer Model with a token classification head on top (a linear layer on top of the hidden-states output)
+ * e.g. for Named-Entity-Recognition (NER) tasks.
+ */
+export class RoFormerForTokenClassification extends RoFormerPreTrainedModel {
+    /**
+     * Calls the model on new inputs.
+     *
+     * @param {Object} model_inputs The inputs to the model.
+     * @returns {Promise<TokenClassifierOutput>} An object containing the model's output logits for token classification.
+     */
+    async _call(model_inputs) {
+        return new TokenClassifierOutput(await super._call(model_inputs));
+    }
+}
+
+/**
+ * RoFormer Model with a span classification head on top for extractive question-answering tasks like SQuAD
+ * (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
+ */
+export class RoFormerForQuestionAnswering extends RoFormerPreTrainedModel {
+    /**
+     * Calls the model on new inputs.
+     *
+     * @param {Object} model_inputs The inputs to the model.
+     * @returns {Promise<QuestionAnsweringModelOutput>} An object containing the model's output logits for question answering.
+     */
+    async _call(model_inputs) {
+        return new QuestionAnsweringModelOutput(await super._call(model_inputs));
+    }
+}
+// TODO: Add RoFormerForCausalLM and RoFormerForMultipleChoice
+//////////////////////////////////////////////////
 
 //////////////////////////////////////////////////
 // ConvBert models
@@ -4671,6 +4743,7 @@ export class PretrainedMixin {
 
 const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
     ['bert', ['BertModel', BertModel]],
+    ['roformer', ['RoFormerModel', RoFormerModel]],
     ['electra', ['ElectraModel', ElectraModel]],
     ['esm', ['EsmModel', EsmModel]],
     ['convbert', ['ConvBertModel', ConvBertModel]],
@@ -4756,6 +4829,7 @@ const MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES = new Map([
 
 const MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = new Map([
     ['bert', ['BertForSequenceClassification', BertForSequenceClassification]],
+    ['roformer', ['RoFormerForSequenceClassification', RoFormerForSequenceClassification]],
     ['electra', ['ElectraForSequenceClassification', ElectraForSequenceClassification]],
     ['esm', ['EsmForSequenceClassification', EsmForSequenceClassification]],
     ['convbert', ['ConvBertForSequenceClassification', ConvBertForSequenceClassification]],
@@ -4776,6 +4850,7 @@ const MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = new Map([
 
 const MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = new Map([
     ['bert', ['BertForTokenClassification', BertForTokenClassification]],
+    ['roformer', ['RoFormerForTokenClassification', RoFormerForTokenClassification]],
     ['electra', ['ElectraForTokenClassification', ElectraForTokenClassification]],
     ['esm', ['EsmForTokenClassification', EsmForTokenClassification]],
     ['convbert', ['ConvBertForTokenClassification', ConvBertForTokenClassification]],
@@ -4821,6 +4896,7 @@ const MODEL_WITH_LM_HEAD_MAPPING_NAMES = new Map([
 
 const MODEL_FOR_MASKED_LM_MAPPING_NAMES = new Map([
     ['bert', ['BertForMaskedLM', BertForMaskedLM]],
+    ['roformer', ['RoFormerForMaskedLM', RoFormerForMaskedLM]],
     ['electra', ['ElectraForMaskedLM', ElectraForMaskedLM]],
     ['esm', ['EsmForMaskedLM', EsmForMaskedLM]],
     ['convbert', ['ConvBertForMaskedLM', ConvBertForMaskedLM]],
@@ -4839,6 +4915,7 @@ const MODEL_FOR_MASKED_LM_MAPPING_NAMES = new Map([
 
 const MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = new Map([
     ['bert', ['BertForQuestionAnswering', BertForQuestionAnswering]],
+    ['roformer', ['RoFormerForQuestionAnswering', RoFormerForQuestionAnswering]],
     ['electra', ['ElectraForQuestionAnswering', ElectraForQuestionAnswering]],
     ['convbert', ['ConvBertForQuestionAnswering', ConvBertForQuestionAnswering]],
     ['camembert', ['CamembertForQuestionAnswering', CamembertForQuestionAnswering]],
diff --git a/src/pipelines.js b/src/pipelines.js
@@ -280,7 +280,7 @@ export class TokenClassificationPipeline extends Pipeline {
                 let tokenData = batch[j];
                 let topScoreIndex = max(tokenData.data)[1];
 
-                let entity = id2label[topScoreIndex];
+                let entity = id2label ? id2label[topScoreIndex] : `LABEL_${topScoreIndex}`;
                 if (ignore_labels.includes(entity)) {
                     // We predicted a token that should be ignored. So, we skip it.
                     continue;
diff --git a/src/tokenizers.js b/src/tokenizers.js
@@ -1182,17 +1182,61 @@ class BertNormalizer extends Normalizer {
         return text.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
     }
 
+
+    /**
+     * Checks whether `char` is a control character.
+     * @param {string} char The character to check.
+     * @returns {boolean} Whether `char` is a control character.
+     * @private
+     */
+    _is_control(char) {
+        switch (char) {
+            case '\t':
+            case '\n':
+            case '\r':
+                // These are technically control characters but we count them as whitespace characters.
+                return false;
+
+            default:
+                // Check if unicode category starts with C:
+                // Cc - Control
+                // Cf - Format
+                // Co - Private Use
+                // Cs - Surrogate
+                return /^\p{Cc}|\p{Cf}|\p{Co}|\p{Cs}$/u.test(char);
+        }
+    }
+
+    /**
+     * Performs invalid character removal and whitespace cleanup on text.
+     * @param {string} text The text to clean.
+     * @returns {string} The cleaned text.
+     * @private
+     */
+    _clean_text(text) {
+        const output = [];
+        for (const char of text) {
+            const cp = char.charCodeAt(0);
+            if (cp === 0 || cp === 0xFFFD || this._is_control(char)) {
+                continue;
+            }
+            if (/^\s$/.test(char)) { // is whitespace
+                output.push(" ");
+            } else {
+                output.push(char);
+            }
+        }
+        return output.join("");
+    }
     /**
      * Normalizes the given text based on the configuration.
      * @param {string} text The text to normalize.
      * @returns {string} The normalized text.
      */
     normalize(text) {
-        // TODO use rest of config
-        // config.clean_text,
-        // config.handle_chinese_chars,
-        // config.strip_accents,
-        // config.lowercase,
+        if (this.config.clean_text) {
+            text = this._clean_text(text);
+        }
 
         if (this.config.handle_chinese_chars) {
             text = this._tokenize_chinese_chars(text);
@@ -2944,6 +2988,12 @@ export class ConvBertTokenizer extends PreTrainedTokenizer {
         return add_token_types(inputs);
     }
 }
+export class RoFormerTokenizer extends PreTrainedTokenizer {
+    /** @type {add_token_types} */
+    prepare_model_inputs(inputs) {
+        return add_token_types(inputs);
+    }
+}
 export class DistilBertTokenizer extends PreTrainedTokenizer { }
 export class CamembertTokenizer extends PreTrainedTokenizer { }
 export class XLMTokenizer extends PreTrainedTokenizer {
@@ -4136,6 +4186,7 @@ export class AutoTokenizer {
         BertTokenizer,
         HerbertTokenizer,
         ConvBertTokenizer,
+        RoFormerTokenizer,
         XLMTokenizer,
         ElectraTokenizer,
         MobileBertTokenizer,
diff --git a/tests/generate_tests.py b/tests/generate_tests.py
@@ -39,6 +39,9 @@
 
     # TODO: remove when https://github.com/huggingface/transformers/issues/26547 is fixed
     'speecht5',
+
+    # TODO: remove when https://github.com/huggingface/transformers/issues/28164 is fixed
+    'roformer',
 ]
 
 TOKENIZERS_TO_IGNORE = [
@@ -80,6 +83,9 @@
         "<s>\n",
         " </s> test </s> ",
         "</s>test</s>",
+
+        # Control characters
+        "1\u00002\uFFFD3",
     ],
     "custom_by_model_type": {
         "llama": [
diff --git a/tests/requirements.txt b/tests/requirements.txt
@@ -2,3 +2,4 @@ transformers[torch]@git+https://github.com/huggingface/transformers
 sacremoses==0.0.53
 sentencepiece==0.1.99
 protobuf==4.24.3
+rjieba==0.1.11