You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: codellama.md
+21-19Lines changed: 21 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -151,7 +151,7 @@ This is a specialized task particular to code models. The model is trained to ge
151
151
152
152
This task is available in the **base** and **instruction** variants of the 7B and 13B models. It is _not_ available for any of the 34B models or the Python versions.
153
153
154
-
To use this feature successfully, you need to pay close attention to the format used to train the model for this task, as it uses special separators to identify the different parts of the prompt. Let's see an example:
154
+
To use this feature successfully, you need to pay close attention to the format used to train the model for this task, as it uses special separators to identify the different parts of the prompt. Fortunately, transformers' `CodeLlamaTokenizer` makes this very easy, as demonstrated below:
155
155
156
156
```python
157
157
from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -165,38 +165,40 @@ model = AutoModelForCausalLM.from_pretrained(
:param s: The string to remove non-ASCII characters from.
191
-
:return: The string with non-ASCII characters removed.
184
+
```Python
185
+
defremove_non_ascii(s: str) -> str:
186
+
""" Remove non-ASCII characters from a string.
187
+
188
+
Args:
189
+
s: The string to remove non-ASCII characters from.
190
+
191
+
Returns:
192
+
The string with non-ASCII characters removed.
192
193
"""
193
194
result =""
194
195
for c in s:
195
196
iford(c) <128:
196
-
result += c <EOT></s>
197
+
result += c
198
+
return result
197
199
```
198
200
199
-
In order to use the completion, you’ll need to process the output to cut the text between the `<MID>` and `<EOT>` tokens – that’s what goes between the prefix and suffix we supplied.
201
+
Under the hood, the tokenizer [automatically splits by `<FILL_ME>`](https://huggingface.co/docs/transformers/main/model_doc/code_llama#transformers.CodeLlamaTokenizer.fill_token)to create a formatted input string that follows [the original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself: it avoids pitfalls, such as token glueing, that are very hard to debug.
0 commit comments