Skip to content

Loading dediacritic tool fails due to Emoji library dependency, and tokenizer model_max_length seems incorrect #5

@ghost

Description

Hello,

  1. The dediacritic tool doesn't seem to work within Google Colab with Python version 3.7. I tried to manually modify the Emoji library but to no result.
from google.colab import output
output.enable_custom_widget_manager()

from google.colab import drive
drive.mount('/content/drive/') 

!pip install camel-tools==1.4.1 -f https://download.pytorch.org/whl/torch_stable.html
os.environ['CAMELTOOLS_DATA'] = '/content/drive/MyDrive/SAAL/EnAr/CAMeL'
!camel_data -i all

from camel_tools.utils.dediac import dediac_ar

image

  1. After loading the tokenizer using Hugging Face's AutoTokenizer, I have to set the tokenizer model_max_length maually to 512, otherwise the value is an extremely large integer > 1e10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions