Skip to content

Commit 82c32eb

Browse files
guoshengCSZeyuChen
andauthored
Refine FasterTransformer (PaddlePaddle#1122)
* Expose diversity rate. Refine extension utility. Update topk_update in FT. * Remove duplicate doc for diversity_rate. * Fix FT jit compiling cmake args. * Fix sources attribute of FasterTransformerExtension. * Use UPDATE_COMMAND instead of PATCH_COMMAND to make re-run always use the latest patches. Fix beam_id_in_output calculation in topk_stage_1_opt3. * Update FT BLEU report in README. * Fix diversity in beam search when not fusing topK and softmax. Co-authored-by: Zeyu Chen <[email protected]>
1 parent ec2333e commit 82c32eb

File tree

8 files changed

+240
-90
lines changed

8 files changed

+240
-90
lines changed

examples/machine_translation/transformer/configs/transformer.base.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,14 +69,19 @@ label_smooth_eps: 0.1
6969
# decrease when meeting the end token. However, 'v2' always generates
7070
# longer results thus might do more calculation and be slower.
7171
beam_search_version: "v1"
72-
beam_size: 5
72+
beam_size: 4
7373
max_out_len: 256
7474
# Indicating whether max_out_len in configurations is the length relative to
7575
# that of source text. Only works in `v2` temporarily.
7676
use_rel_len: False
7777
# The power number in length penalty calculation. Only works in `v2` temporarily.
7878
# Please refer to GNMT <https://arxiv.org/pdf/1609.08144.pdf>.
7979
alpha: 0.6
80+
# Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural Generation
81+
# <https://arxiv.org/abs/1611.08562>`_ for details. Bigger `diversity_rate`
82+
# would lead to more diversity. if `diversity_rate == 0` is equivalent to naive
83+
# BeamSearch. **NOTE**: Only works when using FasterTransformer temporarily.
84+
diversity_rate: 0.0
8085
# The number of decoded sentences to output.
8186
n_best: 1
8287

examples/machine_translation/transformer/configs/transformer.big.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,14 +69,19 @@ label_smooth_eps: 0.1
6969
# decrease when meeting the end token. However, 'v2' always generates
7070
# longer results thus might do more calculation and be slower.
7171
beam_search_version: "v1"
72-
beam_size: 5
72+
beam_size: 4
7373
max_out_len: 1024
7474
# Indicating whether max_out_len in configurations is the length relative to
7575
# that of source text. Only works in `v2` temporarily.
7676
use_rel_len: False
7777
# The power number in length penalty calculation. Only works in `v2` temporarily.
7878
# Please refer to GNMT <https://arxiv.org/pdf/1609.08144.pdf>.
7979
alpha: 0.6
80+
# Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural Generation
81+
# <https://arxiv.org/abs/1611.08562>`_ for details. Bigger `diversity_rate`
82+
# would lead to more diversity. if `diversity_rate == 0` is equivalent to naive
83+
# BeamSearch. **NOTE**: Only works when using FasterTransformer temporarily.
84+
diversity_rate: 0.0
8085
# The number of decoded sentences to output.
8186
n_best: 1
8287

examples/machine_translation/transformer/faster_transformer/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ git clone https://github.com/moses-smt/mosesdecoder.git
178178
perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt
179179
```
180180

181-
执行上述操作之后,可以看到类似如下的结果,此处结果是 base model 在 newstest2014 上的 BLEU 结果:
181+
执行上述操作之后,可以看到类似如下的结果,此处结果是 beam_size 为 5 时 base model 在 newstest2014 上的 BLEU 结果:
182182
```
183183
BLEU = 26.89, 58.4/32.6/20.5/13.4 (BP=1.000, ratio=1.010, hyp_len=65166, ref_len=64506)
184184
```
@@ -300,7 +300,7 @@ git clone https://github.com/moses-smt/mosesdecoder.git
300300
perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt
301301
```
302302

303-
执行上述操作之后,可以看到类似如下的结果,此处结果是 base model 在 newstest2014 上的 BLEU 结果:
303+
执行上述操作之后,可以看到类似如下的结果,此处结果是 beam_size 为 5 时 base model 在 newstest2014 上的 BLEU 结果:
304304
```
305305
BLEU = 26.89, 58.4/32.6/20.5/13.4 (BP=1.000, ratio=1.010, hyp_len=65166, ref_len=64506)
306306
```

examples/machine_translation/transformer/predict.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ def do_predict(args):
110110
beam_search_version=args.beam_search_version,
111111
rel_len=args.use_rel_len, # only works when using FT or beam search v2
112112
alpha=args.alpha, # only works when using beam search v2
113+
diversity_rate=args.diversity_rate, # only works when using FT
113114
use_fp16_decoding=False) # only works when using FT
114115

115116
# Load the trained model

paddlenlp/ops/CMakeLists.txt

Lines changed: 59 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,59 @@ set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS}")
7575
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
7676
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler -Wall")
7777

78+
######################################################################################
79+
# A function for automatic detection of GPUs installed (if autodetection is enabled)
80+
# Usage:
81+
# detect_installed_gpus(out_variable)
82+
function(detect_installed_gpus out_variable)
83+
if(NOT CUDA_gpu_detect_output)
84+
set(cufile ${PROJECT_BINARY_DIR}/detect_cuda_archs.cu)
85+
86+
file(WRITE ${cufile} ""
87+
"#include \"stdio.h\"\n"
88+
"#include \"cuda.h\"\n"
89+
"#include \"cuda_runtime.h\"\n"
90+
"int main() {\n"
91+
" int count = 0;\n"
92+
" if (cudaSuccess != cudaGetDeviceCount(&count)) return -1;\n"
93+
" if (count == 0) return -1;\n"
94+
" for (int device = 0; device < count; ++device) {\n"
95+
" cudaDeviceProp prop;\n"
96+
" if (cudaSuccess == cudaGetDeviceProperties(&prop, device))\n"
97+
" printf(\"%d.%d \", prop.major, prop.minor);\n"
98+
" }\n"
99+
" return 0;\n"
100+
"}\n")
101+
102+
execute_process(COMMAND "${CUDA_NVCC_EXECUTABLE}"
103+
"--run" "${cufile}"
104+
WORKING_DIRECTORY "${PROJECT_BINARY_DIR}/CMakeFiles/"
105+
RESULT_VARIABLE nvcc_res OUTPUT_VARIABLE nvcc_out
106+
ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)
107+
108+
if(nvcc_res EQUAL 0)
109+
# Only use last item of nvcc_out (the last device's compute capability).
110+
string(REGEX REPLACE "\\." "" nvcc_out "${nvcc_out}")
111+
string(REGEX MATCHALL "[0-9()]+" nvcc_out "${nvcc_out}")
112+
list(GET nvcc_out -1 nvcc_out)
113+
set(CUDA_gpu_detect_output ${nvcc_out} CACHE INTERNAL "Returned GPU architetures from detect_installed_gpus tool" FORCE)
114+
endif()
115+
endif()
116+
117+
if(NOT CUDA_gpu_detect_output)
118+
message(STATUS "Automatic GPU detection failed. Building for all known architectures.")
119+
set(${out_variable} ${paddle_known_gpu_archs} PARENT_SCOPE)
120+
else()
121+
set(${out_variable} ${CUDA_gpu_detect_output} PARENT_SCOPE)
122+
endif()
123+
endfunction()
124+
125+
if (NOT SM)
126+
# TODO(guosheng): Remove it if `GetCUDAComputeCapability` is exposed by paddle.
127+
# Currently, if `CUDA_gpu_detect_output` is not defined, use the detected arch.
128+
detect_installed_gpus(SM)
129+
endif()
130+
78131
if (SM STREQUAL 80 OR
79132
SM STREQUAL 86 OR
80133
SM STREQUAL 70 OR
@@ -217,64 +270,19 @@ set(FT_PATCH_COMMAND
217270
&& ${MUTE_COMMAND}
218271
)
219272

220-
######################################################################################
221-
# A function for automatic detection of GPUs installed (if autodetection is enabled)
222-
# Usage:
223-
# detect_installed_gpus(out_variable)
224-
function(detect_installed_gpus out_variable)
225-
if(NOT CUDA_gpu_detect_output)
226-
set(cufile ${PROJECT_BINARY_DIR}/detect_cuda_archs.cu)
227-
228-
file(WRITE ${cufile} ""
229-
"#include \"stdio.h\"\n"
230-
"#include \"cuda.h\"\n"
231-
"#include \"cuda_runtime.h\"\n"
232-
"int main() {\n"
233-
" int count = 0;\n"
234-
" if (cudaSuccess != cudaGetDeviceCount(&count)) return -1;\n"
235-
" if (count == 0) return -1;\n"
236-
" for (int device = 0; device < count; ++device) {\n"
237-
" cudaDeviceProp prop;\n"
238-
" if (cudaSuccess == cudaGetDeviceProperties(&prop, device))\n"
239-
" printf(\"%d.%d \", prop.major, prop.minor);\n"
240-
" }\n"
241-
" return 0;\n"
242-
"}\n")
243-
244-
execute_process(COMMAND "${CUDA_NVCC_EXECUTABLE}"
245-
"--run" "${cufile}"
246-
WORKING_DIRECTORY "${PROJECT_BINARY_DIR}/CMakeFiles/"
247-
RESULT_VARIABLE nvcc_res OUTPUT_VARIABLE nvcc_out
248-
ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)
249-
250-
if(nvcc_res EQUAL 0)
251-
# Only use last item of nvcc_out (the last device's compute capability).
252-
string(REGEX REPLACE "\\." "" nvcc_out "${nvcc_out}")
253-
string(REGEX MATCHALL "[0-9()]+" nvcc_out "${nvcc_out}")
254-
list(GET nvcc_out -1 nvcc_out)
255-
set(CUDA_gpu_detect_output ${nvcc_out} CACHE INTERNAL "Returned GPU architetures from detect_installed_gpus tool" FORCE)
256-
endif()
257-
endif()
258-
259-
if(NOT CUDA_gpu_detect_output)
260-
message(STATUS "Automatic GPU detection failed. Building for all known architectures.")
261-
set(${out_variable} ${paddle_known_gpu_archs} PARENT_SCOPE)
262-
else()
263-
set(${out_variable} ${CUDA_gpu_detect_output} PARENT_SCOPE)
264-
endif()
265-
endfunction()
266-
267-
# TODO(guosheng): Remove it if `GetCUDAComputeCapability` is exposed by paddle.
268-
# Currently, if `CUDA_gpu_detect_output` is not defined, use the detected arch.
269-
detect_installed_gpus(SM)
273+
# TODO(guosheng): Use UPDATE_COMMAND instead of PATCH_COMMAND to make cmake
274+
# re-run always use the latest patches when the developer changes FT patch codes,
275+
# all patches rather than the changes would re-build, any better way to do this.
276+
# Or maybe hidden this function for simplicity.
277+
set(FT_UPDATE_COMMAND git checkout v3.1 && git checkout . && ${FT_PATCH_COMMAND})
270278

271279
ExternalProject_Add(
272280
extern_${THIRD_PARTY_NAME}
273281
GIT_REPOSITORY https://github.com/NVIDIA/FasterTransformer.git
274282
GIT_TAG v3.1
275283
PREFIX ${THIRD_PATH}
276284
SOURCE_DIR ${THIRD_PATH}/source/${THIRD_PARTY_NAME}
277-
PATCH_COMMAND ${FT_PATCH_COMMAND}
285+
UPDATE_COMMAND ${FT_UPDATE_COMMAND} # PATCH_COMMAND ${FT_PATCH_COMMAND}
278286
BINARY_DIR ${THIRD_PATH}/build/${THIRD_PARTY_NAME}
279287
INSTALL_COMMAND ""
280288
CMAKE_ARGS -DCMAKE_BUILD_TYPE=Release -DSM=${SM} -DBUILD_PD=ON -DPY_CMD=${PY_CMD} -DON_INFER=${ON_INFER} -DPADDLE_LIB=${PADDLE_LIB} -DWITH_MKL=${WITH_MKL} -DWITH_STATIC_LIB=${WITH_STATIC_LIB}

paddlenlp/ops/ext_utils.py

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,20 @@
3535
CUDA_HOME = None
3636

3737

38+
def _get_files(path):
39+
"""
40+
Helps to list all files under the given path.
41+
"""
42+
if os.path.isfile(path):
43+
return [path]
44+
all_files = []
45+
for root, _dirs, files in os.walk(path, followlinks=True):
46+
for file in files:
47+
file = os.path.join(root, file)
48+
all_files.append(file)
49+
return all_files
50+
51+
3852
class CMakeExtension(Extension):
3953
def __init__(self, name, source_dir=None):
4054
# A CMakeExtension needs a source_dir instead of a file list.
@@ -43,10 +57,7 @@ def __init__(self, name, source_dir=None):
4357
self.source_dir = str(Path(__file__).parent.resolve())
4458
else:
4559
self.source_dir = os.path.abspath(os.path.expanduser(source_dir))
46-
self.sources = [
47-
os.path.join(self.source_dir, f)
48-
for f in os.listdir(self.source_dir)
49-
]
60+
self.sources = _get_files(self.source_dir)
5061

5162
def build_with_command(self, ext_builder):
5263
"""
@@ -95,6 +106,10 @@ def get_target_filename(self):
95106
class FasterTransformerExtension(CMakeExtension):
96107
def __init__(self, name, source_dir=None):
97108
super(FasterTransformerExtension, self).__init__(name, source_dir)
109+
self.sources = _get_files(
110+
os.path.
111+
join(self.source_dir, "faster_transformer", "src")) + _get_files(
112+
os.path.join(self.source_dir, "patches", "FasterTransformer"))
98113
self._std_out_handle = None
99114
# Env variable may not work as expected, since jit compile by `load`
100115
# would not re-built if source code is not update.
@@ -114,7 +129,7 @@ def build_with_command(self, ext_builder):
114129
# `GetCUDAComputeCapability` is not exposed yet, and detect CUDA/GPU
115130
# version in cmake file.
116131
# self.cmake_args += [f"-DSM={self.sm}"] if self.sm is not None else []
117-
self.cmake_args = [f"-DWITH_GPT=ON"]
132+
self.cmake_args += [f"-DWITH_GPT=ON"]
118133
try:
119134
super(FasterTransformerExtension,
120135
self).build_with_command(ext_builder)
@@ -207,7 +222,11 @@ def load(name, build_dir=None, force=False, verbose=False, **kwargs):
207222
name)
208223
raise NotImplementedError
209224
if build_dir is None:
210-
build_dir = os.path.join(PPNLP_HOME, 'extenstions')
225+
# Maybe under package dir is better to avoid cmake source path conflict
226+
# with different source path.
227+
# build_dir = os.path.join(PPNLP_HOME, 'extenstions')
228+
build_dir = os.path.join(
229+
str(Path(__file__).parent.resolve()), 'extenstions')
211230
build_base_dir = os.path.abspath(
212231
os.path.expanduser(os.path.join(build_dir, name)))
213232
if not os.path.exists(build_base_dir):

paddlenlp/ops/faster_transformer/transformer/faster_transformer.py

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,10 @@ class FasterTransformer(TransformerModel):
9292
max_out_len (int, optional):
9393
The maximum output length. Defaults to 256.
9494
diversity_rate (float, optional):
95-
The diversity rate for beam search. Defaults to 0.0.
95+
Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural Generation <https://arxiv.org/abs/1611.08562>`_
96+
for details. Bigger `diversity_rate` would lead to more diversity.
97+
if `diversity_rate == 0` is equivalent to naive BeamSearch. Default
98+
to 0 if not set.
9699
use_fp16_decoding(bool, optional): Whether to use fp16 for decoding.
97100
rel_len(bool, optional):
98101
Indicating whether `max_out_len` in is the length relative to that
@@ -458,6 +461,13 @@ class TransformerGenerator(paddle.nn.Layer):
458461
- `alpha(float, optional)`: The power number in length penalty
459462
calculation. Refer to `GNMT <https://arxiv.org/pdf/1609.08144.pdf>`_.
460463
Only works in `v2` temporarily. Default to 0.6 if not set.
464+
465+
- diversity_rate(float, optional): Refer to `A Simple, Fast Diverse
466+
Decoding Algorithm for Neural Generation <https://arxiv.org/abs/1611.08562>`_
467+
for details. Bigger `diversity_rate` would lead to more diversity.
468+
if `diversity_rate == 0` is equivalent to naive BeamSearch. Default
469+
to 0 if not set. **NOTE**: Only works when using FasterTransformer
470+
temporarily.
461471
"""
462472

463473
def __init__(self,
@@ -524,6 +534,10 @@ def __init__(self,
524534
logger.warning(
525535
"Exception occurs when using Faster Transformer. " \
526536
"The original forward will be involved. ")
537+
if diversity_rate != 0:
538+
logger.warning(
539+
"diversity_rate would not work since it is only " \
540+
"supported by FasterTransformer temporarily.")
527541
self.transformer = InferTransformerModel(
528542
src_vocab_size=src_vocab_size,
529543
trg_vocab_size=trg_vocab_size,
@@ -544,6 +558,10 @@ def __init__(self,
544558
rel_len=rel_len,
545559
alpha=alpha)
546560
else:
561+
if diversity_rate != 0:
562+
logger.warning(
563+
"diversity_rate would not work since it is only " \
564+
"supported by FasterTransformer temporarily.")
547565
self.transformer = InferTransformerModel(
548566
src_vocab_size=src_vocab_size,
549567
trg_vocab_size=trg_vocab_size,

0 commit comments

Comments
 (0)