Skip to content

Commit 29a67d2

Browse files
authored
Merge branch 'main' into plr_fixes
2 parents 3c4749b + 8ef5cd6 commit 29a67d2

File tree

16 files changed

+471
-67
lines changed

16 files changed

+471
-67
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
name: CI collated reports
2+
3+
on:
4+
workflow_call:
5+
inputs:
6+
job:
7+
required: true
8+
type: string
9+
report_repo_id:
10+
required: true
11+
type: string
12+
machine_type:
13+
required: true
14+
type: string
15+
gpu_name:
16+
description: Name of the GPU used for the job. Its enough that the value contains the name of the GPU, e.g. "noise-h100-more-noise". Case insensitive.
17+
required: true
18+
type: string
19+
20+
jobs:
21+
collated_reports:
22+
name: Collated reports
23+
runs-on: ubuntu-22.04
24+
if: always()
25+
steps:
26+
- uses: actions/checkout@v4
27+
- uses: actions/download-artifact@v4
28+
29+
- name: Collated reports
30+
shell: bash
31+
env:
32+
ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
33+
CI_SHA: ${{ github.sha }}
34+
TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
35+
run: |
36+
pip install huggingface_hub
37+
python3 utils/collated_reports.py \
38+
--path /transformers/reports/ \
39+
--machine-type ${{ inputs.machine_type }} \
40+
--commit-hash ${{ env.CI_SHA }} \
41+
--job ${{ inputs.job }} \
42+
--report-repo-id ${{ inputs.report_repo_id }} \
43+
--gpu-name ${{ inputs.gpu_name }}
44+
45+
- name: Upload collated reports
46+
uses: actions/upload-artifact@v4
47+
with:
48+
name: collated_reports_${{ env.CI_SHA }}.json
49+
path: collated_reports_${{ env.CI_SHA }}.json

SECURITY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Models uploaded on the Hugging Face Hub come in different formats. We heavily re
1414
models in the [`safetensors`](https://github.com/huggingface/safetensors) format (which is the default prioritized
1515
by the transformers library), as developed specifically to prevent arbitrary code execution on your system.
1616

17-
To avoid loading models from unsafe formats(e.g. [pickle](https://docs.python.org/3/library/pickle.html), you should use the `use_safetensors` parameter. If doing so, in the event that no .safetensors file is present, transformers will error when loading the model.
17+
To avoid loading models from unsafe formats (e.g. [pickle](https://docs.python.org/3/library/pickle.html), you should use the `use_safetensors` parameter. If doing so, in the event that no .safetensors file is present, transformers will error when loading the model.
1818

1919
### Remote code
2020

docs/source/en/generation_strategies.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -504,6 +504,31 @@ Recommended practices:
504504
- Add self-contained examples to enable quick experimentation.
505505
- Describe soft-requirements such as if the method only works well with a certain family of models.
506506

507+
### Reusing `generate`’s input preparation
508+
509+
If you're adding a new decoding loop, you might want to preserve the input preparation present in `generate` (batch expansion, attention masks, logits processors, stopping criteria, etc.). You can also pass a **callable** to `custom_generate` to reuse [`~GenerationMixin.generate`]’s full preparation pipeline while overriding only the decoding loop.
510+
511+
```py
512+
def custom_loop(model, input_ids, attention_mask, logits_processor, stopping_criteria, generation_config, **model_kwargs):
513+
next_tokens = input_ids
514+
while input_ids.shape[1] < stopping_criteria[0].max_length:
515+
logits = model(next_tokens, attention_mask=attention_mask, **model_kwargs).logits
516+
next_token_logits = logits_processor(input_ids, logits[:, -1, :])
517+
next_tokens = torch.argmax(next_token_logits, dim=-1)[:, None]
518+
input_ids = torch.cat((input_ids, next_tokens), dim=-1)
519+
attention_mask = torch.cat((attention_mask, torch.ones_like(next_tokens)), dim=-1)
520+
return input_ids
521+
522+
output = model.generate(
523+
**inputs,
524+
custom_generate=custom_loop,
525+
max_new_tokens=10,
526+
)
527+
```
528+
529+
> [!TIP]
530+
> If you publish a `custom_generate` repository, your `generate` implementation can itself define a callable and pass it to `model.generate()`. This lets you customize the decoding loop while still benefiting from Transformers’ built-in input preparation logic.
531+
507532
### Finding custom generation methods
508533

509534
You can find all custom generation methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`. In addition to the tag, we curate two collections of `custom_generate` methods:

docs/source/ko/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,10 @@
9191
- local: in_translation
9292
title: (번역중) Tools and RAG
9393
title: 모델을 사용해 대화하기
94+
- sections:
95+
- local: tiny_agents
96+
title: Tiny-Agents CLI 및 MCP 도구
97+
title: 서빙(Serving)
9498
- sections:
9599
- local: in_translation
96100
title: (번역중) torch.compile

docs/source/ko/tiny_agents.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
### `tiny-agents` CLI 및 MCP 도구[[tiny-agents-cli-and-mcp-tools]]
2+
3+
MCP 도구의 사용을 보여주기 위해 [`tiny-agents`](https://huggingface.co/blog/python-tiny-agents) CLI와 `transformers serve` 서버를 연동하는 방법을 살펴보겠습니다.
4+
5+
> [!TIP]
6+
> 이 예시처럼 많은 Hugging Face Spaces를 MCP 서버로 활용할 수 있습니다. 호환 가능한 모든 Spaces는 [여기](https://huggingface.co/spaces?filter=mcp-server)에서 찾을 수 있습니다.
7+
8+
MCP 도구를 사용하려면 먼저 모델에 사용 가능한 도구를 알려야 합니다. 예를 들어, [이미지 생성 MCP 서버](https://evalstate-flux1-schnell.hf.space/)를 참조하는 `tiny-agents` 설정 파일을 살펴보겠습니다.
9+
10+
```json
11+
{
12+
"model": "Menlo/Jan-nano",
13+
"endpointUrl": "http://localhost:8000",
14+
"servers": [
15+
{
16+
"type": "sse",
17+
"url": "https://evalstate-flux1-schnell.hf.space/gradio_api/mcp/sse"
18+
}
19+
]
20+
}
21+
```
22+
23+
그런 다음 아래 명령어로 `tiny-agents` 채팅 인터페이스를 실행할 수 있습니다.
24+
25+
```bash
26+
tiny-agents run path/to/your/config.json
27+
```
28+
29+
백그라운드에서 `transformers serve`가 실행 중이라면, 이제 로컬 모델에서 MCP 도구를 사용할 수 있습니다. 다음은 `tiny-agents`와의 채팅 세션 예시입니다.
30+
31+
```bash
32+
Agent loaded with 1 tools:
33+
• flux1_schnell_infer
34+
» Generate an image of a cat on the moon
35+
<Tool req_0_tool_call>flux1_schnell_infer {"prompt": "a cat on the moon", "seed": 42, "randomize_seed": true, "width": 1024, "height": 1024, "num_inference_steps": 4}
36+
37+
Tool req_0_tool_call
38+
[Binary Content: Image image/webp, 57732 bytes]
39+
The task is complete and the content accessible to the User
40+
Image URL: https://evalstate-flux1-schnell.hf.space/gradio_api/file=/tmp/gradio/3dbddc0e53b5a865ed56a4e3dbdd30f3f61cf3b8aabf1b456f43e5241bd968b8/image.webp
41+
380576952
42+
43+
Flux 1 Schnell 이미지 생성기를 사용하여 달 위의 고양이 이미지를 생성했습니다. 이미지는 1024x1024 픽셀이며 4번의 추론 단계를 거쳐 생성되었습니다. 변경 사항이 필요하거나 추가 도움이 필요하시면 알려주세요!
44+
```

src/transformers/cache_utils.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,6 @@
1616
)
1717

1818

19-
if _is_quanto_greater_than_0_2_5 := is_quanto_greater("0.2.5", accept_dev=True):
20-
from optimum.quanto import MaxOptimizer, qint2, qint4, quantize_weight
21-
2219
if is_hqq_available():
2320
from hqq.core.quantize import Quantizer as HQQQuantizer
2421

@@ -558,7 +555,7 @@ def __init__(
558555
q_group_size: int = 64,
559556
residual_length: int = 128,
560557
):
561-
super().__init__(self)
558+
super().__init__()
562559
self.nbits = nbits
563560
self.axis_key = axis_key
564561
self.axis_value = axis_value
@@ -635,10 +632,12 @@ def __init__(
635632
residual_length=residual_length,
636633
)
637634

638-
if not _is_quanto_greater_than_0_2_5:
635+
# We need to import quanto here to avoid circular imports due to optimum/quanto/models/transformers_models.py
636+
if is_quanto_greater("0.2.5", accept_dev=True):
637+
from optimum.quanto import MaxOptimizer, qint2, qint4
638+
else:
639639
raise ImportError(
640640
"You need optimum-quanto package version to be greater or equal than 0.2.5 to use `QuantoQuantizedCache`. "
641-
"Detected version {optimum_quanto_version}."
642641
)
643642

644643
if self.nbits not in [2, 4]:
@@ -656,6 +655,8 @@ def __init__(
656655
self.optimizer = MaxOptimizer() # hardcode as it's the only one for per-channel quantization
657656

658657
def _quantize(self, tensor, axis):
658+
from optimum.quanto import quantize_weight
659+
659660
scale, zeropoint = self.optimizer(tensor, self.qtype, axis, self.q_group_size)
660661
qtensor = quantize_weight(tensor, self.qtype, axis, scale, zeropoint, self.q_group_size)
661662
return qtensor

src/transformers/commands/serving.py

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -830,13 +830,22 @@ def get_processor_inputs_from_inbound_messages(messages, modality: Modality):
830830
parsed_message = {"role": message["role"], "content": []}
831831

832832
if modality == Modality.LLM:
833-
# If we're working with LLMs, then "content" is a single string.
834-
content = message["content"] if isinstance(message["content"], str) else message["content"]["text"]
835-
parsed_message["content"] = content
833+
# Input: `content` is a string or a list of dictionaries with a "text" key.
834+
# Output: `content` is a string.
835+
if isinstance(message["content"], str):
836+
parsed_content = message["content"]
837+
elif isinstance(message["content"], list):
838+
parsed_content = []
839+
for content in message["content"]:
840+
if content["type"] == "text":
841+
parsed_content.append(content["text"])
842+
parsed_content = " ".join(parsed_content)
843+
parsed_message["content"] = parsed_content
836844

837845
elif modality == Modality.VLM:
838-
# If we're working with VLMs, then "content" is a dictionary, containing a "type" key indicating
839-
# which other key will be present and the type of the value of said key.
846+
# Input: `content` is a string or a list of dictionaries with a "type" key (possible types: "text",
847+
# "image_url").
848+
# Output: `content` is a list of dictionaries with a "type" key
840849
if isinstance(message["content"], str):
841850
parsed_message["content"].append({"type": "text", "text": message["content"]})
842851
else:

src/transformers/generation/utils.py

Lines changed: 34 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -2165,7 +2165,7 @@ def generate(
21652165
negative_prompt_ids: Optional[torch.Tensor] = None,
21662166
negative_prompt_attention_mask: Optional[torch.Tensor] = None,
21672167
use_model_defaults: Optional[bool] = None,
2168-
custom_generate: Optional[str] = None,
2168+
custom_generate: Optional[Union[str, Callable]] = None,
21692169
**kwargs,
21702170
) -> Union[GenerateOutput, torch.LongTensor]:
21712171
r"""
@@ -2235,11 +2235,15 @@ def generate(
22352235
generation configuration (`model.generation_config`), as opposed to the global defaults
22362236
(`GenerationConfig()`). If unset, models saved starting from `v4.50` will consider this flag to be
22372237
`True`.
2238-
custom_generate (`str`, *optional*):
2239-
A string containing the name of a huggingface.co repository. If provided, the custom `generate`
2240-
function defined in that reposity's `custom_generate/generate.py` file will be executed instead of the
2241-
standard `generate` method. Note that the logic is for generation is entirely defined in that
2242-
repository, and the return type may be different from the standard `generate` method.
2238+
custom_generate (`str` or `Callable`, *optional*):
2239+
One of the following:
2240+
- `str` (Hugging Face Hub repository name): runs the custom `generate` function defined at
2241+
`custom_generate/generate.py` in that repository instead of the standard `generate` method. The
2242+
repository fully replaces the generation logic, and the return type may differ.
2243+
- `str` (local repository path): same as above but from a local path, `trust_remote_code` not required.
2244+
- `Callable`: `generate` will perform the usual input preparation steps, then call the provided callable to
2245+
run the decoding loop.
2246+
For more information, see [the docs](../../generation_strategies#custom-generation-methods).
22432247
kwargs (`dict[str, Any]`, *optional*):
22442248
Ad hoc parametrization of `generation_config` and/or additional model-specific kwargs that will be
22452249
forwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoder
@@ -2263,7 +2267,7 @@ def generate(
22632267
"""
22642268
# 0. If requested, load an arbitrary generation recipe from the Hub and run it instead
22652269
trust_remote_code = kwargs.pop("trust_remote_code", None)
2266-
if custom_generate is not None:
2270+
if custom_generate is not None and isinstance(custom_generate, str):
22672271
# Get all `generate` arguments in a single variable. Custom functions are responsible for handling them:
22682272
# they receive the same inputs as `generate`, with `model` instead of `self` and excluding the arguments to
22692273
# trigger the custom generation. They can access to methods from `GenerationMixin` through `model`.
@@ -2360,6 +2364,14 @@ def generate(
23602364
else:
23612365
input_ids = inputs_tensor if model_input_name == "input_ids" else model_kwargs.pop("input_ids")
23622366

2367+
# Expand inputs depending on the generation mode
2368+
input_ids, model_kwargs = self._expand_inputs_for_generation(
2369+
input_ids=input_ids,
2370+
expand_size=max(generation_config.num_beams, generation_config.num_return_sequences),
2371+
is_encoder_decoder=self.config.is_encoder_decoder,
2372+
**model_kwargs,
2373+
)
2374+
23632375
if generation_config.token_healing:
23642376
input_ids = self.heal_tokens(input_ids, tokenizer)
23652377

@@ -2441,7 +2453,18 @@ def generate(
24412453
model_kwargs["use_cache"] = generation_config.use_cache
24422454

24432455
# 10. go into different generation modes
2444-
if generation_mode == GenerationMode.ASSISTED_GENERATION:
2456+
if isinstance(custom_generate, Callable):
2457+
result = custom_generate(
2458+
self,
2459+
input_ids,
2460+
logits_processor=prepared_logits_processor,
2461+
stopping_criteria=prepared_stopping_criteria,
2462+
generation_config=generation_config,
2463+
synced_gpus=synced_gpus,
2464+
streamer=streamer,
2465+
**model_kwargs,
2466+
)
2467+
elif generation_mode == GenerationMode.ASSISTED_GENERATION:
24452468
if generation_config.num_return_sequences > 1:
24462469
raise ValueError(
24472470
"num_return_sequences has to be 1 when doing assisted generate, "
@@ -2530,15 +2553,7 @@ def generate(
25302553
)
25312554

25322555
elif generation_mode in (GenerationMode.SAMPLE, GenerationMode.GREEDY_SEARCH):
2533-
# 11. expand input_ids with `num_return_sequences` additional sequences per batch
2534-
input_ids, model_kwargs = self._expand_inputs_for_generation(
2535-
input_ids=input_ids,
2536-
expand_size=generation_config.num_return_sequences,
2537-
is_encoder_decoder=self.config.is_encoder_decoder,
2538-
**model_kwargs,
2539-
)
2540-
2541-
# 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
2556+
# 11. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
25422557
result = self._sample(
25432558
input_ids,
25442559
logits_processor=prepared_logits_processor,
@@ -2550,14 +2565,7 @@ def generate(
25502565
)
25512566

25522567
elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
2553-
# 11. interleave input_ids with `num_beams` additional sequences per batch
2554-
input_ids, model_kwargs = self._expand_inputs_for_generation(
2555-
input_ids=input_ids,
2556-
expand_size=generation_config.num_beams,
2557-
is_encoder_decoder=self.config.is_encoder_decoder,
2558-
**model_kwargs,
2559-
)
2560-
# 12. run beam sample
2568+
# 11. run beam sample
25612569
result = self._beam_search(
25622570
input_ids,
25632571
logits_processor=prepared_logits_processor,
@@ -2583,14 +2591,6 @@ def generate(
25832591
num_beam_groups=generation_config.num_beam_groups,
25842592
max_length=generation_config.max_length,
25852593
)
2586-
# 12. interleave input_ids with `num_beams` additional sequences per batch
2587-
input_ids, model_kwargs = self._expand_inputs_for_generation(
2588-
input_ids=input_ids,
2589-
expand_size=generation_config.num_beams,
2590-
is_encoder_decoder=self.config.is_encoder_decoder,
2591-
**model_kwargs,
2592-
)
2593-
# 13. run beam search
25942594
result = self._group_beam_search(
25952595
input_ids,
25962596
beam_scorer,
@@ -2657,14 +2657,7 @@ def typeerror():
26572657
num_beam_hyps_to_keep=generation_config.num_return_sequences,
26582658
max_length=generation_config.max_length,
26592659
)
2660-
# 12. interleave input_ids with `num_beams` additional sequences per batch
2661-
input_ids, model_kwargs = self._expand_inputs_for_generation(
2662-
input_ids=input_ids,
2663-
expand_size=generation_config.num_beams,
2664-
is_encoder_decoder=self.config.is_encoder_decoder,
2665-
**model_kwargs,
2666-
)
2667-
# 13. run beam search
2660+
# 12. run beam search
26682661
result = self._constrained_beam_search(
26692662
input_ids,
26702663
constrained_beam_scorer=constrained_beam_scorer,

src/transformers/models/idefics2/modeling_idefics2.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -141,8 +141,12 @@ def forward(self, pixel_values: torch.FloatTensor, patch_attention_mask: torch.B
141141
embeddings = patch_embeds.flatten(2).transpose(1, 2)
142142

143143
max_nb_patches_h, max_nb_patches_w = max_im_h // self.patch_size, max_im_w // self.patch_size
144-
boundaries = torch.arange(1 / self.num_patches_per_side, 1.0, 1 / self.num_patches_per_side)
145-
position_ids = torch.full(size=(batch_size, max_nb_patches_h * max_nb_patches_w), fill_value=0)
144+
boundaries = torch.arange(
145+
1 / self.num_patches_per_side, 1.0, 1 / self.num_patches_per_side, device=pixel_values.device
146+
)
147+
position_ids = torch.full(
148+
size=(batch_size, max_nb_patches_h * max_nb_patches_w), fill_value=0, device=pixel_values.device
149+
)
146150

147151
for batch_idx, p_attn_mask in enumerate(patch_attention_mask):
148152
nb_patches_h = p_attn_mask[:, 0].sum()
@@ -158,9 +162,8 @@ def forward(self, pixel_values: torch.FloatTensor, patch_attention_mask: torch.B
158162
bucket_coords_w = torch.bucketize(fractional_coords_w, boundaries, right=True)
159163

160164
pos_ids = (bucket_coords_h[:, None] * self.num_patches_per_side + bucket_coords_w).flatten()
161-
position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids
165+
position_ids[batch_idx][p_attn_mask.view(-1)] = pos_ids
162166

163-
position_ids = position_ids.to(self.position_embedding.weight.device)
164167
embeddings = embeddings + self.position_embedding(position_ids)
165168
return embeddings
166169

0 commit comments

Comments
 (0)