Skip to content

Commit fd5c0d5

Browse files
authored
Merge branch 'master' into patch-3
2 parents 6c2d700 + 166b731 commit fd5c0d5

32 files changed

+1456
-270
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ jobs:
170170
CIBW_MANYLINUX_X86_64_IMAGE: manylinux2014
171171
CIBW_MANYLINUX_AARCH64_IMAGE: manylinux2014
172172
CIBW_ARCHS: ${{ matrix.arch }}
173-
CIBW_SKIP: pp* *-musllinux_*
173+
CIBW_SKIP: "*-musllinux_*"
174174

175175
- name: Upload Python wheels
176176
uses: actions/upload-artifact@v4
@@ -195,10 +195,6 @@ jobs:
195195
artifact_pattern: python-wheels-Linux-aarch64
196196
wheel_pattern: "*cp310*manylinux*_aarch64.whl"
197197

198-
#- os: windows-2022
199-
# artifact_pattern: python-wheels-Windows-auto64
200-
# wheel_pattern: "*cp310*win*.whl"
201-
202198
- os: macos-15
203199
artifact_pattern: python-wheels-macOS-arm64
204200
wheel_pattern: "*cp310*macosx*arm64.whl"
@@ -226,8 +222,6 @@ jobs:
226222
- name: Install wheel
227223
shell: bash
228224
run: |
229-
ls -l
230-
find .
231225
pip install ${{ matrix.wheel_pattern }}
232226
233227
- name: Test Python wheel

CHANGELOG.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,20 @@
44

55
### Fixes and improvements
66

7-
## [v4.6.1](https://github.com/OpenNMT/CTranslate2/releases/tag/v4.6.1) (2025-10-07)
7+
## [v4.6.2](https://github.com/OpenNMT/CTranslate2/releases/tag/v4.6.2) (2025-12-05)
8+
9+
### New features
10+
11+
* Qwen 3 support (#1943) by [@jordimas](https://github.com/jordimas)
12+
* Gemma 3 text support (#1936) by [@jordimas](https://github.com/jordimas)
13+
14+
### Fixes and improvements
15+
16+
* Fixed pkg_resources Deprecated Warning (#1911) by [@thawancomt](https://github.com/thawancomt)
17+
* Disable INT8 for sm120 - Blackwell GPUs (#1937) by [@Purfview](https://github.com/Purfview)
18+
* FIX: package libctranslate2.so in wheel to avoid build fail (#1920) by [@yzewei](https://github.com/yzewei)
19+
20+
## [v4.6.1](https://github.com/OpenNMT/CTranslate2/releases/tag/v4.6.1) (2025-11-07)
821

922
### New features
1023

CMakeLists.txt

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,8 @@ set(SOURCES
170170
src/ops/mean.cc
171171
src/ops/mean_cpu.cc
172172
src/ops/median_filter.cc
173+
src/ops/median_filter_cpu.cc
174+
src/ops/median_filter_gpu.cu
173175
src/ops/min_max.cc
174176
src/ops/mul.cc
175177
src/ops/multinomial.cc
@@ -545,8 +547,9 @@ if (WITH_CUDA)
545547
list(APPEND PRIVATE_INCLUDE_DIRECTORIES ${CUDNN_INCLUDE_DIR})
546548
list(APPEND LIBRARIES ${CUDNN_LIBRARIES})
547549
add_definitions(-DCT2_WITH_CUDNN)
550+
list(APPEND SOURCES src/ops/conv1d_cudnn_gpu.cu)
548551
else()
549-
message(WARNING "cuDNN library is not enabled: convolution layers will not be supported on GPU")
552+
list(APPEND SOURCES src/ops/conv1d_gpu.cu)
550553
endif()
551554

552555
if(CUDA_DYNAMIC_LOADING)
@@ -636,7 +639,6 @@ if (WITH_CUDA)
636639
src/ops/alibi_add_gpu.cu
637640
src/ops/bias_add_gpu.cu
638641
src/ops/concat_split_slide_gpu.cu
639-
src/ops/conv1d_gpu.cu
640642
src/ops/dequantize_gpu.cu
641643
src/ops/flash_attention_gpu.cu
642644
src/ops/gather_gpu.cu

CONTRIBUTING.md

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,19 @@ Do you think a feature is missing or would be a great addition to the project? P
2323
* look for GitHub issues marked with the *help wanted* label: these are developments that we find particularly suited for community contributions.
2424
* If you are planning to make a large change to the existing code, consider asking first on [the forum](https://forum.opennmt.net/) to confirm that it is welcome.
2525

26+
## Contribution rules
27+
28+
CTranslate2 is a low-level, performance-critical codebase. A single misplaced pointer or inefficient memory allocation (which LLMs often get wrong) can take hours to debug.
29+
30+
To maintain code integrity and manage maintainer workload, we apply the following policy:
31+
32+
* Use of AI tools for brainstorming or minor assistance is acceptable, but contributors must explicitly disclose how AI was used and remain fully responsible for correctness, performance, and design. Submissions that appear generated without deep understanding will be declined. Verifying AI output for correctness and performance is more time-consuming than writing code manually.
33+
34+
* Mandatory Deep Understanding: Contributors must fully understand their code and be prepared to justify the purpose of part of the code base.
35+
36+
* Please contribute within your area of expertise. If you are not familiar with the core codebase, consider contributing to documentation, examples, or Hugging Face integrations.
37+
38+
2639
### Building the sources
2740

2841
See [Install from sources](https://opennmt.net/CTranslate2/installation.html#install-from-sources).
@@ -85,7 +98,7 @@ The list is ordered on 5. from the largest to smallest time.
8598

8699
#### `StorageView` class
87100

88-
CTranslate2 uses [row-major](https://en.wikipedia.org/wiki/Row-_and_column-major_order) storages, usually encapsulated in the `StorageView` class. This class acts like a tensor representation but without the mathematical semantics. It is convenience wrapper to view a buffer of data in a particular shape, and provides methods to resize, reshape, and copy data. The underlying storage has a type (e.g. `float`) and a location (e.g. GPU #1) which are both resolved at runtime.
101+
CTranslate2 uses [row-major](https://en.wikipedia.org/wiki/Row-_and_column-major_order) storages, usually encapsulated in the `StorageView` class. This class acts like a tensor representation but without the mathematical semantics. It is a convenience wrapper to view a buffer of data in a particular shape, and provides methods to resize, reshape, and copy data. The underlying storage has a type (e.g. `float`) and a location (e.g. GPU #1) which are both resolved at runtime.
89102

90103
To maximize performance, the implementation avoid new allocations when possible:
91104

@@ -144,7 +157,7 @@ To limit the size of the packages pushed to PyPI, some libraries are not include
144157

145158
One of the benefits of this dynamic loading is that multiple versions of cuBLAS and cuDNN are supported by the same binary. In particular, users can install any CUDA 12.x version as long as it provides `libcublas.so.12`.
146159

147-
The Python library only support CUDA 12.x. C++ source code is always compatible with CUDA 11, possible to use CUDA 11 libraries during compilation to create CUDA 11.x support wheel.
160+
The Python library only supports CUDA 12.x. C++ source code is always compatible with CUDA 11, possible to use CUDA 11 libraries during compilation to create CUDA 11.x support wheel.
148161

149162
### Updating other dependencies
150163

@@ -161,7 +174,7 @@ If a dependency needs an update, it is particularly important that it is updated
161174

162175
### Managing PyPI project size limit
163176

164-
Projects on PyPI have a size limit. The default limit is 10GB and [we already requested](https://github.com/pypi/support/issues/1480) an increase to 20GB in the past. Because increase requests can take several months to be accepted, we now try to work with this 20GB limit.
177+
Projects on PyPI have a size limit. The default limit is 10GB. Currently the CTranslate2 project [has 50GB](https://github.com/pypi/support/issues/8119) of storage limit.
165178

166179
So older releases need to be regularly deleted on PyPI to make room for new releases. **However, make sure to keep the latest release of each major version.**
167180

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,16 @@ Executed with 4 threads on a [*c5.2xlarge*](https://aws.amazon.com/ec2/instance-
119119

120120
Executed with CUDA 11 on a [*g5.xlarge*](https://aws.amazon.com/ec2/instance-types/g5/) Amazon EC2 instance equipped with a NVIDIA A10G GPU (driver version: 510.47.03).
121121

122+
## Contributing
123+
124+
CTranslate2 is a community-driven project. We welcome contributions of all kinds:
125+
* **New Model Support:** Help us implement more Transformer architectures.
126+
* **Performance:** Propose optimizations for CPU or GPU kernels.
127+
* **Bug Reports:** Open an issue if you find something not working as expected.
128+
* **Documentation:** Improve our guides or add new examples.
129+
130+
Check out our [Contributing Guide](CONTRIBUTING.md) to learn how to set up your development environment.
131+
122132
## Additional resources
123133

124134
* [Documentation](https://opennmt.net/CTranslate2)

docs/guides/transformers.md

Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ CTranslate2 supports selected models from Hugging Face's [Transformers](https://
88
* CodeGen
99
* DistilBERT
1010
* Falcon
11+
* Gemma 2
12+
* Gemma 3 (text only)
1113
* Llama
1214
* M2M100
1315
* MarianMT
@@ -20,6 +22,8 @@ CTranslate2 supports selected models from Hugging Face's [Transformers](https://
2022
* GPT-NeoX
2123
* OPT
2224
* Pegasus
25+
* Qwen 2.5
26+
* Qwen 3
2327
* T5
2428
* Whisper
2529
* XLM-RoBERTa
@@ -80,7 +84,7 @@ print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target), skip_special_tok
8084

8185
## BERT
8286

83-
[BERT](https://huggingface.co/docs/transformers/model_doc/bert) is pretrained model on English language using a masked language modeling objective.
87+
[BERT](https://huggingface.co/docs/transformers/model_doc/bert) is a pretrained model on English language using a masked language modeling objective.
8488

8589
CTranslate2 only implements the `BertModel` class from Transformers which includes the Transformer encoder and the pooling layer. Task-specific layers should be run with PyTorch as shown in the example below.
8690

@@ -183,6 +187,43 @@ output = tokenizer.decode(results[0].sequences_ids[0])
183187
print(output)
184188
```
185189

190+
## Gemma 3 (text only)
191+
192+
193+
[Gemma 3](https://ai.google.dev/gemma/docs/core) is Google's latest family of lightweight, open-weight AI models, built on the same technology as Gemini.
194+
195+
Gemma models come in two flavors: instruction tuned (it) models and base models.
196+
197+
Instruction tuned models expect a specific [prompt template format](https://ai.google.dev/gemma/docs/core/prompt-structure) which you should use.
198+
199+
When converting an instruction-tuned model, CTranslate sets `<end_of_turn>` as the default end-of-sequence token.
200+
201+
202+
To convert a model:
203+
204+
```bash
205+
ct2-transformers-converter --model google/gemma-3-1b-it --output_dir gemma-3-1b-it
206+
```
207+
208+
Gemma 3 usage sample:
209+
210+
211+
```python
212+
213+
from transformers import AutoTokenizer
214+
import ctranslate2
215+
216+
tok = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
217+
gen = ctranslate2.Generator("gemma-3-1b-it")
218+
219+
prompt = "<start_of_turn>user\nGenerate a 200 word text talking about George Orwell.<end_of_turn>\n<start_of_turn>model\n"
220+
tokens = tok.convert_ids_to_tokens(tok.encode(prompt))
221+
222+
res = gen.generate_batch([tokens], max_length=2048, sampling_temperature=0.1, include_prompt_in_result=False)
223+
print(tok.convert_tokens_to_string(res[0].sequences[0]))
224+
```
225+
226+
186227
## Llama 2
187228

188229
[Llama 2](https://ai.meta.com/llama/) is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.
@@ -446,6 +487,44 @@ output = tokenizer.decode(results[0].sequences_ids[0])
446487
print(output)
447488
```
448489

490+
## Qwen 3
491+
492+
[Qwen 3](https://github.com/QwenLM/Qwen3) are a collection of large language models developed by the Alibaba Group. A key feature is allows switching between "thinking mode" for complex reasoning and a "non-thinking mode" for efficient general chat.
493+
494+
To convert a model:
495+
496+
```bash
497+
ct2-transformers-converter --model Qwen/Qwen3-4B --quantization float16 --output_dir qwen3-4b-ct2
498+
```
499+
500+
Usage Sample
501+
502+
You can use the converted model for text generation with ctranslate2.Generator. For Qwen 3 instruction-tuned models, you should use the Hugging Face tokenizer's apply_chat_template method to correctly format your prompts, especially when dealing with the optional "thinking mode". Currently MoE models variants are not supported.
503+
504+
```python
505+
import ctranslate2
506+
import transformers
507+
508+
generator = ctranslate2.Generator("qwen3-4b-ct2")
509+
tokenizer = transformers.AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")
510+
511+
def generate(prompt):
512+
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt, add_special_tokens=False))
513+
results = generator.generate_batch([tokens], max_length=2048, sampling_temperature=0.7, include_prompt_in_result=False)
514+
return tokenizer.decode(results[0].sequences_ids[0])
515+
516+
prompt_base = """<|im_start|>user
517+
A train leaves Station A at 60 mph heading towards Station B, 300 miles away. At the same time, another train leaves Station B at 40 mph heading towards Station A. When will they meet and how far from Station A?
518+
<|im_end|>
519+
<|im_start|>assistant"""
520+
521+
print("Non-thinking:\n" + "-"*60)
522+
print(generate(prompt_base + "\n<think></think>\n"))
523+
524+
print("\nThinking:\n" + "="*60)
525+
print(generate(prompt_base))
526+
```
527+
449528
## T5
450529

451530
[T5](https://huggingface.co/docs/transformers/model_doc/t5) is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.

include/ctranslate2/batch_reader.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,8 @@ namespace ctranslate2 {
5656

5757
std::vector<Example>
5858
get_next(const size_t max_batch_size,
59-
const BatchType batch_type = BatchType::Examples);
59+
const BatchType batch_type = BatchType::Examples,
60+
const bool consider_padding = false);
6061

6162
// Consumes and returns the next example.
6263
virtual Example get_next_example() = 0;
@@ -67,6 +68,12 @@ namespace ctranslate2 {
6768
}
6869

6970
private:
71+
std::vector<Example> fill_batch_with_fixed_increment(const size_t max_batch_size,
72+
const BatchType batch_type);
73+
74+
std::vector<Example> fill_batch_with_variable_increment(const size_t max_batch_size,
75+
const BatchType batch_type);
76+
7077
bool _initialized = false;
7178
Example _next;
7279
};

include/ctranslate2/layers/attention.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
#include "ctranslate2/layers/attention_layer.h"
44
#include "ctranslate2/padder.h"
5+
#include "ctranslate2/layers/transformer.h"
56

67
namespace ctranslate2 {
78
namespace layers {
@@ -65,6 +66,8 @@ namespace ctranslate2 {
6566
dim_t _relative_right_max_position;
6667
const bool _merge_time_and_head_dims;
6768
const dim_t _cache_time_dim;
69+
std::unique_ptr<const LayerNorm> _q_norm; // Query normalization
70+
std::unique_ptr<const LayerNorm> _k_norm; // Key normalization
6871
};
6972
}
7073
}

include/ctranslate2/ops/median_filter.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
11
#pragma once
2-
32
#include "op.h"
43

54
namespace ctranslate2 {
65
namespace ops {
76

87
class MedianFilter : public Op {
98
public:
10-
MedianFilter(const dim_t width);
9+
explicit MedianFilter(dim_t width);
1110
void operator()(const StorageView& input, StorageView& output) const;
1211

1312
private:
1413
const dim_t _width;
14+
template <Device D, typename T>
15+
void compute(const StorageView& input, const dim_t axis_size, StorageView& output) const;
1516
};
1617

1718
}

python/cpp/generator.cc

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -234,10 +234,10 @@ namespace ctranslate2 {
234234
Arguments:
235235
start_tokens: Batch of start tokens. If the decoder starts from a special
236236
start token like ``<s>``, this token should be added to this input.
237-
max_batch_size: The maximum batch size. If the number of inputs is greater than
238-
:obj:`max_batch_size`, the inputs are sorted by length and split by chunks of
239-
:obj:`max_batch_size` examples so that the number of padding positions is
240-
minimized.
237+
max_batch_size: The maximum batch size. If the number of inputs is greater than :obj:`max_batch_size`,
238+
the inputs are sorted by length and split by chunks of :obj:`max_batch_size` examples
239+
(or tokens when :obj:`batch_type`="tokens") so that the number of padding positions
240+
is minimized.
241241
batch_type: Whether :obj:`max_batch_size` is the number of "examples" or "tokens".
242242
asynchronous: Run the generation asynchronously.
243243
beam_size: Beam size (1 for greedy search).

0 commit comments

Comments
 (0)