Max length problem with bert-base-arabic-camelbert-mix-pos-msa

### What I'm doing
I'm using CamelBERT PoS tagging to process modern standard Arabic text, and I'm doing so as follows.

```python
# make the model using pipeline
model = pipeline("token-classification", model="CAMeL-Lab/bert-base-arabic-camelbert-ca")

# run the model on some text
model("SOME ARABIC TEXT")
```

### The problem

When running the model on texts with >512 words, I get the following error.

```
RuntimeError: The size of tensor a (563) must match the size of tensor b (512) at non-singleton dimension 1
```

As mentioned in [this issue over in Camel Tools](https://github.com/CAMeL-Lab/camel_tools/issues/123), it's a known CamelBERT problem and the solution is to use the tokeniser as follows:

```python
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca', max_length=512)
```

However, this does not fix the whole pipeline, and running pipeline with `max_length=512` results in an error because the parameter does not exist.

### What I've tried

I've tried doing the following...

```python
tokeniser = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca', truncation=True, max_length=512)
model = pipeline("token-classification", model='CAMeL-Lab/bert-base-arabic-camelbert-ca', tokenizer=self.tokeniser)
```
but that doesn't work either. There's this warning...

```
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
```
which suggests that the parameter is being ignored even when we specify `max_length`.

I've almost got it working by doing the tokenisation and model separately.

```python
tokeniser = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca', truncation=True, max_length=512)
model = AutoModelForTokenClassification.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')

text = "SOME ARABIC TEXT"
tokenised = tokeniser(text, return_tensors="pt", max_length=512, truncation=True)
output = model(**tokenised)
```
But then I get the output as tensors, and I'm not sure how to decode the output into human readable form.

### Question

Is there a known fix or workaround to this problem? The output from CamelBERT is super useful, but there's quite a lot of texts with >512 tokens.

Thanks! And apologies if I'm just missing something obvious.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Max length problem with bert-base-arabic-camelbert-mix-pos-msa #6

What I'm doing

The problem

What I've tried

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Max length problem with bert-base-arabic-camelbert-mix-pos-msa #6

Description

What I'm doing

The problem

What I've tried

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions