Skip to content

Commit 2ad84c6

Browse files
pcuencaArthurZucker
authored andcommitted
[CodeLlama]: simplify infilling with <FILL_ME> (huggingface#1424)
* CodeLlama: simplify infilling with `<FILL_ME>`. Co-authored-by: Arthur <[email protected]> * Use `AutoTokenizer` It now works after [these PRs](https://huggingface.co/codellama/CodeLlama-7b-hf/discussions/11) have been merged. --------- Co-authored-by: Arthur <[email protected]>
1 parent dbca4a1 commit 2ad84c6

File tree

1 file changed

+21
-19
lines changed

1 file changed

+21
-19
lines changed

codellama.md

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ This is a specialized task particular to code models. The model is trained to ge
151151

152152
This task is available in the **base** and **instruction** variants of the 7B and 13B models. It is _not_ available for any of the 34B models or the Python versions.
153153

154-
To use this feature successfully, you need to pay close attention to the format used to train the model for this task, as it uses special separators to identify the different parts of the prompt. Let's see an example:
154+
To use this feature successfully, you need to pay close attention to the format used to train the model for this task, as it uses special separators to identify the different parts of the prompt. Fortunately, transformers' `CodeLlamaTokenizer` makes this very easy, as demonstrated below:
155155

156156
```python
157157
from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -165,38 +165,40 @@ model = AutoModelForCausalLM.from_pretrained(
165165
torch_dtype=torch.float16
166166
).to("cuda")
167167

168-
prefix = 'def remove_non_ascii(s: str) -> str:\n """ '
169-
suffix = "\n return result\n"
170-
171-
prompt = f"<PRE> {prefix} <SUF>{suffix} <MID>"
172-
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
168+
prompt = '''def remove_non_ascii(s: str) -> str:
169+
""" <FILL_ME>
170+
return result
171+
'''
173172

173+
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to("cuda")
174174
output = model.generate(
175-
inputs["input_ids"],
175+
input_ids,
176176
max_new_tokens=200,
177-
do_sample=False,
178177
)
179178
output = output[0].to("cpu")
180-
print(tokenizer.decode(output))
181-
```
182179

180+
filling = tokenizer.decode(output[input_ids.shape[1]:], skip_special_tokens=True)
181+
print(prompt.replace("<FILL_ME>", filling))
183182
```
184-
<s> <PRE> def remove_non_ascii(s: str) -> str:
185-
""" <SUF>
186-
return result
187-
<MID>
188-
Remove non-ASCII characters from a string.
189183

190-
:param s: The string to remove non-ASCII characters from.
191-
:return: The string with non-ASCII characters removed.
184+
```Python
185+
def remove_non_ascii(s: str) -> str:
186+
""" Remove non-ASCII characters from a string.
187+
188+
Args:
189+
s: The string to remove non-ASCII characters from.
190+
191+
Returns:
192+
The string with non-ASCII characters removed.
192193
"""
193194
result = ""
194195
for c in s:
195196
if ord(c) < 128:
196-
result += c <EOT></s>
197+
result += c
198+
return result
197199
```
198200

199-
In order to use the completion, you’ll need to process the output to cut the text between the `<MID>` and `<EOT>` tokens – that’s what goes between the prefix and suffix we supplied.
201+
Under the hood, the tokenizer [automatically splits by `<FILL_ME>`](https://huggingface.co/docs/transformers/main/model_doc/code_llama#transformers.CodeLlamaTokenizer.fill_token) to create a formatted input string that follows [the original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself: it avoids pitfalls, such as token glueing, that are very hard to debug.
200202

201203
#### Conversational Instructions
202204

0 commit comments

Comments
 (0)