Skip to content

Commit fa60078

Browse files
authored
Merge branch 'develop' into PET
2 parents 3ae883c + b6a7f8f commit fa60078

File tree

72 files changed

+3575
-610
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+3575
-610
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ pip install --upgrade paddlenlp
5353

5454
### Transformer API: 强大的预训练模型生态底座
5555

56-
覆盖**15**个网络结构和**67**个预训练模型参数,既包括百度自研的预训练模型如ERNIE系列, PLATO, SKEP等,也涵盖业界主流的中文预训练模型。也欢迎开发者进预训练模贡献!🤗
56+
覆盖**15**个网络结构和**67**个预训练模型参数,既包括百度自研的预训练模型如ERNIE系列, PLATO, SKEP等,也涵盖业界主流的中文预训练模型。也欢迎开发者进预训练模贡献!🤗
5757

5858
```python
5959
from paddlenlp.transformers import *
@@ -78,7 +78,7 @@ text = tokenizer('自然语言处理')
7878

7979
# 语义表示
8080
model = ErnieModel.from_pretrained('ernie-1.0')
81-
pooled_output, sequence_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
81+
sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
8282
# 文本分类 & 句对匹配
8383
model = ErnieForSequenceClassification.from_pretrained('ernie-1.0')
8484
# 序列标注

README_en.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ English | [简体中文](./README.md)
2020

2121
## Introduction
2222

23-
**PaddleNLP** is a powerful NLP library with **Awesome** pre-trained Transformer models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications.
23+
**PaddleNLP** is a powerful NLP library with **Awesome** pre-trained Transformer models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications.
2424

2525

2626
* **Easy-to-Use API**
@@ -76,7 +76,7 @@ text = tokenizer('natural language understanding')
7676

7777
# Semantic Representation
7878
model = ErnieModel.from_pretrained('ernie-1.0')
79-
pooled_output, sequence_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
79+
sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
8080
# Text Classificaiton and Matching
8181
model = ErnieForSequenceClassification.from_pretrained('ernie-1.0')
8282
# Sequence Labeling

docs/data_prepare/dataset_list.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@ PaddleNLP提供了以下数据集的快速读取API,实际使用时请根据
1111
| [DuReader-robust](https://aistudio.baidu.com/aistudio/competition/detail/49) | 千言数据集:阅读理解,答案原文抽取|`paddlenlp.datasets.load_dataset('dureader_robust')` |
1212
| [CMRC2018](http://hfl-rc.com/cmrc2018/) | 第二届“讯飞杯”中文机器阅读理解评测数据集|`paddlenlp.datasets.load_dataset('cmrc2018')` |
1313
| [DRCD](https://github.com/DRCKnowledgeTeam/DRCD) | 台達閱讀理解資料集|`paddlenlp.datasets.load_dataset('drcd')` |
14+
| [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) | Washington大学问答数据集|`paddlenlp.datasets.load_dataset('triviaqa')` |
15+
| [C3](https://dataset.org/c3/) | 阅读理解单选题 |`paddlenlp.datasets.load_dataset('c3')` |
16+
1417

1518
## 文本分类
1619

@@ -48,6 +51,9 @@ PaddleNLP提供了以下数据集的快速读取API,实际使用时请根据
4851
| [THUCNews](https://github.com/gaussic/text-classification-cnn-rnn#%E6%95%B0%E6%8D%AE%E9%9B%86) | THUCNews中文新闻类别分类 | `paddlenlp.datasets.load_dataset('thucnews')` |
4952
| [HYP](https://pan.webis.de/semeval19/semeval19-web/) | 英文政治新闻情感分类语料 | `paddlenlp.datasets.load_dataset('hyp')` |
5053

54+
## 文本匹配
55+
| [CAIL2019-SCM](https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm) | 相似法律案例匹配 | `paddlenlp.datasets.load_dataset('cail2019_scm')` |
56+
5157
## 序列标注
5258

5359
| 数据集名称 | 简介 | 调用方法 |

docs/data_prepare/dataset_load.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,5 @@
6565
6666
>>> from paddlenlp.datasets import load_dataset
6767
>>> train_ds, test_ds = load_dataset("glue", "cola", splits=["train", "test"], data_files=["my_train_file.csv", "my_test_file.csv"])
68+
69+
**另外需要注意数据集的是没有默认加载选项的,**:attr:`splits` ****:attr:`data_files` **必须至少指定一个。**

examples/dialogue/plato-2/interaction.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ def interact(args):
7676
example, is_infer=True)
7777
data = plato_reader._pad_batch_records([record], is_infer=True)
7878
inputs = gen_inputs(data, args.latent_type_size)
79+
inputs['tgt_ids'] = inputs['tgt_ids'].astype('int64')
7980
pred = model(inputs)[0]
8081
bot_response = pred["response"]
8182
print(

examples/dialogue/plato-2/utils/tokenization.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ def convert_to_unicode(text):
8686
def load_vocab(vocab_file):
8787
"""Loads a vocabulary file into a dictionary."""
8888
vocab = collections.OrderedDict()
89-
fin = open(vocab_file)
89+
fin = open(vocab_file, 'r', encoding="UTF-8")
9090
for num, line in enumerate(fin):
9191
items = convert_to_unicode(line.rstrip()).split("\t")
9292
if len(items) > 2:

examples/dialogue/unified_transformer/README.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ train_ds, dev_ds, test1_ds, test2_ds = load_dataset('duconv', splits=('train', '
4848

4949
### 模型训练
5050

51-
运行如下命令即可在练集上进行finetune,并在验证集上进行验证
51+
运行如下命令即可在训练集上进行finetune,并在验证集上进行验证
5252

5353
```shell
5454
# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡
@@ -81,7 +81,6 @@ python -m paddle.distributed.launch --gpus '0' --log_dir ./log finetune.py \
8181
|---------------------------------|
8282
| unified_transformer-12L-cn |
8383
| unified_transformer-12L-cn-luge |
84-
| plato-mini |
8584

8685
- `save_dir` 表示模型的保存路径。
8786
- `logging_steps` 表示日志打印间隔。
@@ -143,7 +142,6 @@ python infer.py \
143142
|---------------------------------|
144143
| unified_transformer-12L-cn |
145144
| unified_transformer-12L-cn-luge |
146-
| plato-mini |
147145

148146
- `output_path` 表示预测结果的保存路径。
149147
- `logging_steps` 表示日志打印间隔。
@@ -202,7 +200,7 @@ python interaction.py \
202200
- `top_k` 表示采用"sampling"解码策略时,token的概率按从大到小排序,生成的token只从前`top_k`个中进行采样。
203201
- `device` 表示使用的设备。
204202

205-
**NOTE:** 输入"[EXIT]"退出交互程序,输入"[NEXT]"开启下一轮新的对话。
203+
**NOTE:** 输入"[EXIT]"退出交互程序,输入"[NEXT]"开启下一轮新的对话。需要注意使用退格会导致错误。
206204

207205
## Reference
208206

examples/dialogue/unified_transformer/interaction.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ def interaction(args, model, tokenizer):
4848
add_start_token_as_response=True,
4949
return_tensors=True,
5050
is_split_into_words=False)
51+
inputs['input_ids'] = inputs['input_ids'].astype('int64')
5152
ids, scores = model.generate(
5253
input_ids=inputs['input_ids'],
5354
token_type_ids=inputs['token_type_ids'],

examples/few_shot/p-tuning/predict.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -340,7 +340,7 @@ def write_chid(task_name, output_file, pred_labels):
340340
args.task_name + ".json")
341341

342342
label_norm_dict = None
343-
with open(label_normalize_json) as f:
343+
with open(label_normalize_json, encoding='utf-8') as f:
344344
label_norm_dict = json.load(f)
345345

346346
convert_example_fn = convert_example if args.task_name != "chid" else convert_chid_example

examples/information_extraction/waybill_ie/README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
执行以下命令,下载并解压示例数据集:
1212

1313
```bash
14-
python download.py --data_dir ./
14+
python download.py --data_dir ./waybill_ie
1515
```
1616

1717
数据示例如下:
@@ -51,6 +51,17 @@ python run_bigru_crf.py
5151
export CUDA_VISIBLE_DEVICES=0
5252
python run_ernie.py
5353
```
54+
##### 模型导出
55+
使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,具体代码见export_model.py。静态图参数保存在output_path指定路径中。 运行方式:
56+
57+
`python export_model.py --params_path ernie_ckpt/model_80.pdparams --output_path=./output`
58+
59+
其中`params_path`是指动态图训练保存的参数路径,`output_path`是指静态图参数导出路径。
60+
61+
导出模型之后,可以用于部署,deploy/python/predict.py文件提供了python部署预测示例。运行方式:
62+
63+
`python deploy/python/predict.py --model_dir ./output`
64+
5465

5566
#### 启动ERNIE + CRF训练
5667

0 commit comments

Comments
 (0)