Skip to content

Commit c715708

Browse files
committed
重构部分代码,增加两个新模型
1 parent 0551569 commit c715708

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+2676
-952
lines changed

MODELS.md

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
| 模型 | 大小 | SHA256 |
44
| :-------: | :---------------------------------------------: | :--------------------------------------------------------------: |
5+
| Base1(v3) | [583MB](http://39.96.43.154/ltp/v3/base1.tgz) | 397c3893e39692ced5858930e0cf8556454747a7c76521d70423a147d6f8c6d7 |
6+
| Base2(v3) | [583MB](http://39.96.43.154/ltp/v3/base2.tgz) | 685a195f09c1947231394ef1bb814e8608252888a9a6dcc1fa5080a5a186e096 |
57
| Base(v3) | [491.9MB](http://39.96.43.154/ltp/v3/base.tgz) | 777a97d6770285e5ab3b0720923bc86781e3279508a72a30c2dd9140b09e5ec8 |
68
| Small(v3) | [156.8MB](http://39.96.43.154/ltp/v3/small.tgz) | 0992d5037cd1c62779a3b5c6d45b883a46e4782c6bcc5850117faf69a9ee6c56 |
79
| Tiny(v3) | [31.3MB](http://39.96.43.154/ltp/v3/tiny.tgz) | d0ab69f1493db232676423270d481080bf636bf8547e4297129b6a21c6f73612 |
@@ -16,14 +18,27 @@
1618

1719
## V2/v3 指标
1820

19-
| 模型 | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
20-
| :-------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
21-
| LTP 4.0 (Base) | 98.7 | 98.5 | 95.4 | 80.6 | 89.5 | 75.2 | |
22-
| LTP 4.0 (Small) | 98.4 | 98.2 | 94.3 | 78.4 | 88.3 | 74.7 | 12.58 |
23-
| LTP 4.0 (Tiny) | 96.8 | 97.1 | 91.6 | 70.9 | 83.8 | 70.1 | 29.53 |
21+
| 模型 | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
22+
| :-------------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
23+
| LTP 4.0 (Base) | 98.7 | 98.5 | 95.4 | 80.6 | 89.5 | 75.2 | 39.12 |
24+
| LTP 4.0 (Base1) | 99.22 | 98.73 | 96.39 | 79.28 | 89.57 | 76.57 | --.-- |
25+
| LTP 4.0 (Base2) | 99.18 | 98.69 | 95.97 | 79.49 | 90.19 | 76.62 | --.-- |
26+
| LTP 4.0 (Small) | 98.4 | 98.2 | 94.3 | 78.4 | 88.3 | 74.7 | 43.13 |
27+
| LTP 4.0 (Tiny) | 96.8 | 97.1 | 91.6 | 70.9 | 83.8 | 70.1 | 53.22 |
2428

2529
**备注**: 本版本SDP采用 [CCL2020语义依存分析](http://ir.hit.edu.cn/sdp2020ccl) 语料,其他语料同V1
2630

31+
测试环境如下:
32+
33+
+ Python 3.8.5
34+
+ LTP 4.1 Batch Size = 8
35+
+ CentOS Linux release 8.3.2011
36+
+ Tesla V100-SXM2-16GB
37+
+ Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
38+
39+
**备注**: 速度数据在人民日报命名实体测试数据上获得,速度计算方式均为所有任务顺序执行的结果。另外,语义角色标注与语义依存新旧版采用的语料不相同,因此无法直接比较(新版语义依存使用Semeval
40+
2016语料,语义角色标注使用CPB3.0语料)。
41+
2742
## V1 指标
2843

2944
| 模型 | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) | 模型大小 |
@@ -55,3 +70,12 @@
5570
| GSDSimp + OntoNotes ([GSDSimp](http://39.96.43.154/ltp/ud/gsds.tgz)) | 98.44 | 96.84 | 78.06 | 87.58 | d51b8508f290ab82d1c3844541eb774506213c1f6cf7d2b86fe7d69358d0d52a |
5671
| GSDSimp + OntoNotes ([GSDSimp+CRF](http://39.96.43.154/ltp/ud/gsds_crf.tgz)) | 98.4 | 96.47 | 79.69 | 86.39 | 1e9ea20cfbc1837bf5736f8b8502aaecb9343590e98b04bba18e15724d3092b2 |
5772

73+
### UD System Performance
74+
75+
| 模型 | 分词 | 词性(XPOS) | 命名实体 | 依存句法 | SHA256 |
76+
| :--------------------------------------------------------------------------: | :---: | :--------: | :------: | :------: | :--------------------------------------------------------------: |
77+
| GSD + OntoNotes ([GSD](http://39.96.43.154/ltp/ud/gsd.tgz)) | 98.12 | 97.22 | 78.56 | 86.91 | e4fd41c6f2c6d84d6df2657f1e47078cb98364366d91e852f0980102c755592a |
78+
| GSD + OntoNotes ([GSD+CRF](http://39.96.43.154/ltp/ud/gsd_crf.tgz)) | 97.96 | 96.81 | 79.77 | 86.06 | 0264b4a92e34bb97054ff06f99068b884c54908d1ad265926b0983f2594e1e6a |
79+
| GSDSimp + OntoNotes ([GSDSimp](http://39.96.43.154/ltp/ud/gsds.tgz)) | 97.49 | 96.24 | 78.06 | 82.48 | d51b8508f290ab82d1c3844541eb774506213c1f6cf7d2b86fe7d69358d0d52a |
80+
| GSDSimp + OntoNotes ([GSDSimp+CRF](http://39.96.43.154/ltp/ud/gsds_crf.tgz)) | 97.25 | 96.22 | 79.69 | 82.92 | 1e9ea20cfbc1837bf5736f8b8502aaecb9343590e98b04bba18e15724d3092b2 |
81+

README.md

Lines changed: 95 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,95 @@
1-
[![LTP](https://img.shields.io/pypi/v/ltp?label=LTP4%20ALPHA)](https://pypi.org/project/ltp/)
2-
![VERSION](https://img.shields.io/pypi/pyversions/ltp)
3-
![CODE SIZE](https://img.shields.io/github/languages/code-size/HIT-SCIR/ltp)
4-
![CONTRIBUTORS](https://img.shields.io/github/contributors/HIT-SCIR/ltp)
5-
![LAST COMMIT](https://img.shields.io/github/last-commit/HIT-SCIR/ltp)
6-
[![Documentation Status](https://readthedocs.org/projects/ltp/badge/?version=latest)](https://ltp.readthedocs.io/zh_CN/latest/?badge=latest)
7-
[![PyPI Downloads](https://img.shields.io/pypi/dm/ltp)](https://pypi.python.org/pypi/ltp)
8-
9-
# LTP 4
10-
11-
LTP(Language Technology Platform) 提供了一系列中文自然语言处理工具,用户可以使用这些工具对于中文文本进行分词、词性标注、句法分析等等工作。
12-
13-
If you use any source codes included in this toolkit in your work, please kindly cite the following paper. The bibtex
14-
are listed below:
15-
<pre>
16-
@article{che2020n,
17-
title={N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models},
18-
author={Che, Wanxiang and Feng, Yunlong and Qin, Libo and Liu, Ting},
19-
journal={arXiv preprint arXiv:2009.11616},
20-
year={2020}
21-
}
22-
</pre>
23-
24-
## 快速使用
25-
26-
```python
27-
from ltp import LTP
28-
29-
ltp = LTP() # 默认加载 Small 模型
30-
seg, hidden = ltp.seg(["他叫汤姆去拿外衣。"])
31-
pos = ltp.pos(hidden)
32-
ner = ltp.ner(hidden)
33-
srl = ltp.srl(hidden)
34-
dep = ltp.dep(hidden)
35-
sdp = ltp.sdp(hidden)
36-
```
37-
38-
**[详细说明](docs/quickstart.rst)**
39-
40-
## Language Bindings
41-
42-
+ C++
43-
+ Rust
44-
+ Java
45-
+ Python Rebinding
46-
47-
[libltp](https://github.com/HIT-SCIR/libltp)
48-
49-
## 指标
50-
51-
| 模型 | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
52-
| :-------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
53-
| LTP 4.0 (Base) | 98.7 | 98.5 | 95.4 | 80.6 | 89.5 | 75.2 | |
54-
| LTP 4.0 (Small) | 98.4 | 98.2 | 94.3 | 78.4 | 88.3 | 74.7 | 12.58 |
55-
| LTP 4.0 (Tiny) | 96.8 | 97.1 | 91.6 | 70.9 | 83.8 | 70.1 | 29.53 |
56-
57-
**[模型下载地址](MODELS.md)**
58-
59-
## 模型算法
60-
61-
+ 分词: Electra Small<sup>[1](#RELTRANS)</sup> + Linear
62-
+ 词性: Electra Small + Linear
63-
+ 命名实体: Electra Small + Relative Transformer<sup>[2](#RELTRANS)</sup> + Linear
64-
+ 依存句法: Electra Small + BiAffine + Eisner<sup>[3](#Eisner)</sup>
65-
+ 语义依存: Electra Small + BiAffine
66-
+ 语义角色: Electra Small + BiAffine + CRF
67-
68-
## 构建 Wheel 包
69-
70-
```shell script
71-
python setup.py sdist bdist_wheel
72-
python -m twine upload dist/*
73-
```
74-
75-
## 作者信息
76-
77-
+ 冯云龙 <<[ylfeng@ir.hit.edu.cn](mailto:ylfeng@ir.hit.edu.cn)>>
78-
79-
## 开源协议
80-
81-
1. 语言技术平台面向国内外大学、中科院各研究所以及个人研究者免费开放源代码,但如上述机构和个人将该平台用于商业目的(如企业合作项目等)则需要付费。
82-
2. 除上述机构以外的企事业单位,如申请使用该平台,需付费。
83-
3. 凡涉及付费问题,请发邮件到 car@ir.hit.edu.cn 洽商。
84-
4. 如果您在 LTP 基础上发表论文或取得科研成果,请您在发表论文和申报成果时声明“使用了哈工大社会计算与信息检索研究中心研制的语言技术平台(LTP)”.
85-
同时,发信给car@ir.hit.edu.cn,说明发表论文或申报成果的题目、出处等。
86-
87-
## 脚注
88-
89-
+ <a name="RELTRANS">1</a>:: [Chinese-ELECTRA](https://github.com/ymcui/Chinese-ELECTRA)
90-
+ <a name="RELTRANS">
91-
2</a>:: [TENER: Adapting Transformer Encoder for Named Entity Recognition](https://arxiv.org/abs/1911.04474)
92-
+ <a name="Eisner">
93-
3</a>:: [A PyTorch implementation of "Deep Biaffine Attention for Neural Dependency Parsing"](https://github.com/yzhangcs/parser)
1+
[![LTP](https://img.shields.io/pypi/v/ltp?label=LTP4%20ALPHA)](https://pypi.org/project/ltp/)
2+
![VERSION](https://img.shields.io/pypi/pyversions/ltp)
3+
![CODE SIZE](https://img.shields.io/github/languages/code-size/HIT-SCIR/ltp)
4+
![CONTRIBUTORS](https://img.shields.io/github/contributors/HIT-SCIR/ltp)
5+
![LAST COMMIT](https://img.shields.io/github/last-commit/HIT-SCIR/ltp)
6+
[![Documentation Status](https://readthedocs.org/projects/ltp/badge/?version=latest)](https://ltp.readthedocs.io/zh_CN/latest/?badge=latest)
7+
[![PyPI Downloads](https://img.shields.io/pypi/dm/ltp)](https://pypi.python.org/pypi/ltp)
8+
9+
# LTP 4
10+
11+
LTP(Language Technology Platform) 提供了一系列中文自然语言处理工具,用户可以使用这些工具对于中文文本进行分词、词性标注、句法分析等等工作。
12+
13+
If you use any source codes included in this toolkit in your work, please kindly cite the following paper. The bibtex
14+
are listed below:
15+
<pre>
16+
@article{che2020n,
17+
title={N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models},
18+
author={Che, Wanxiang and Feng, Yunlong and Qin, Libo and Liu, Ting},
19+
journal={arXiv preprint arXiv:2009.11616},
20+
year={2020}
21+
}
22+
</pre>
23+
24+
## 快速使用
25+
26+
```python
27+
from ltp import LTP
28+
29+
ltp = LTP() # 默认加载 Small 模型
30+
seg, hidden = ltp.seg(["他叫汤姆去拿外衣。"])
31+
pos = ltp.pos(hidden)
32+
ner = ltp.ner(hidden)
33+
srl = ltp.srl(hidden)
34+
dep = ltp.dep(hidden)
35+
sdp = ltp.sdp(hidden)
36+
```
37+
38+
**[详细说明](docs/quickstart.rst)**
39+
40+
## Language Bindings
41+
42+
+ C++
43+
+ Rust
44+
+ Java
45+
+ Python Rebinding
46+
47+
[libltp](https://github.com/HIT-SCIR/libltp)
48+
49+
## 指标
50+
51+
| 模型 | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
52+
| :--------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
53+
| LTP 4.0 (Base) | 98.7 | 98.5 | 95.4 | 80.6 | 89.5 | 75.2 | 39.12 |
54+
| LTP 4.0 (Base1) | 99.22 | 98.73 | 96.39 | 79.28 | 89.57 | 76.57 | --.-- |
55+
| LTP 4.0 (Base2) | 99.18 | 98.69 | 95.97 | 79.49 | 90.19 | 76.62 | --.-- |
56+
| LTP 4.0 (Small) | 98.4 | 98.2 | 94.3 | 78.4 | 88.3 | 74.7 | 43.13 |
57+
| LTP 4.0 (Tiny) | 96.8 | 97.1 | 91.6 | 70.9 | 83.8 | 70.1 | 53.22 |
58+
59+
**[模型下载地址](MODELS.md)**
60+
61+
## 模型算法
62+
63+
+ 分词: Electra Small<sup>[1](#RELTRANS)</sup> + Linear
64+
+ 词性: Electra Small + Linear
65+
+ 命名实体: Electra Small + Relative Transformer<sup>[2](#RELTRANS)</sup> + Linear
66+
+ 依存句法: Electra Small + BiAffine + Eisner<sup>[3](#Eisner)</sup>
67+
+ 语义依存: Electra Small + BiAffine
68+
+ 语义角色: Electra Small + BiAffine + CRF
69+
70+
## 构建 Wheel 包
71+
72+
```shell script
73+
python setup.py sdist bdist_wheel
74+
python -m twine upload dist/*
75+
```
76+
77+
## 作者信息
78+
79+
+ 冯云龙 <<[ylfeng@ir.hit.edu.cn](mailto:ylfeng@ir.hit.edu.cn)>>
80+
81+
## 开源协议
82+
83+
1. 语言技术平台面向国内外大学、中科院各研究所以及个人研究者免费开放源代码,但如上述机构和个人将该平台用于商业目的(如企业合作项目等)则需要付费。
84+
2. 除上述机构以外的企事业单位,如申请使用该平台,需付费。
85+
3. 凡涉及付费问题,请发邮件到 car@ir.hit.edu.cn 洽商。
86+
4. 如果您在 LTP 基础上发表论文或取得科研成果,请您在发表论文和申报成果时声明“使用了哈工大社会计算与信息检索研究中心研制的语言技术平台(LTP)”.
87+
同时,发信给car@ir.hit.edu.cn,说明发表论文或申报成果的题目、出处等。
88+
89+
## 脚注
90+
91+
+ <a name="RELTRANS">1</a>:: [Chinese-ELECTRA](https://github.com/ymcui/Chinese-ELECTRA)
92+
+ <a name="RELTRANS">
93+
2</a>:: [TENER: Adapting Transformer Encoder for Named Entity Recognition](https://arxiv.org/abs/1911.04474)
94+
+ <a name="Eisner">
95+
3</a>:: [A PyTorch implementation of "Deep Biaffine Attention for Neural Dependency Parsing"](https://github.com/yzhangcs/parser)

ltp/algorithms/eisner.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
def eisner(scores, mask) -> torch.Tensor:
1111
lens = mask.sum(1)
1212
batch_size, seq_len, _ = scores.shape
13+
# [batch_size, w, n]
1314
scores = scores.permute(2, 1, 0)
1415
s_i = torch.full_like(scores, float('-inf'))
1516
s_c = torch.full_like(scores, float('-inf'))

ltp/data/dataset/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88

99
from typing import Optional, Union, Dict, List
1010
from datasets import DatasetBuilder, Features, Split, DatasetDict, Dataset
11+
from datasets import Sequence, ClassLabel, Value, Translation, TranslationVariableLanguages
1112

1213

1314
def load_dataset(

ltp/data/dataset/bio.py

Lines changed: 68 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,41 @@
44

55
import logging
66

7+
import os
8+
import itertools
9+
from collections import Counter
10+
711
import datasets
812
from os.path import join
913
from dataclasses import dataclass
10-
from ltp.data.utils import iter_blocks
14+
from ltp.data.utils import iter_blocks, vocab_builder
1115

1216
_TRAINING_FILE = "train.bio"
1317
_DEV_FILE = "dev.bio"
1418
_TEST_FILE = "test.bio"
1519

1620

21+
@vocab_builder
22+
def build_vocabs(data_dir, *files):
23+
counter = Counter()
24+
25+
if os.path.exists(os.path.join(data_dir, 'vocabs', 'bio.txt')):
26+
return
27+
28+
if not os.path.exists(os.path.join(data_dir, 'vocabs')):
29+
os.makedirs(os.path.join(data_dir, 'vocabs'))
30+
31+
for filename in files:
32+
for line_num, block in iter_blocks(filename=filename):
33+
values = [list(value) for value in zip(*block)]
34+
counter.update(values[1])
35+
36+
with open(os.path.join(data_dir, 'vocabs', 'bio.txt'), mode='w') as f:
37+
tags = sorted(counter.keys())
38+
tags.remove('O')
39+
f.write('\n'.join(['O'] + tags))
40+
41+
1742
def create_feature(file=None):
1843
if file:
1944
return datasets.ClassLabel(names_file=file)
@@ -31,37 +56,55 @@ class BioConfig(datasets.BuilderConfig):
3156
class Bio(datasets.GeneratorBasedBuilder):
3257
BUILDER_CONFIG_CLASS = BioConfig
3358

59+
@staticmethod
60+
def default_files(data_dir) -> dict:
61+
return {
62+
datasets.Split.TRAIN: join(data_dir, _TRAINING_FILE),
63+
datasets.Split.VALIDATION: join(data_dir, _DEV_FILE),
64+
datasets.Split.TEST: join(data_dir, _TEST_FILE),
65+
}
66+
3467
def _info(self):
68+
build_vocabs(self.config)
69+
feats = {'bio': self.config.bio}
70+
for key in feats:
71+
if feats[key] is None:
72+
feats[key] = os.path.join(self.config.data_dir, 'vocabs', f'{key}.txt')
73+
3574
return datasets.DatasetInfo(
3675
features=datasets.Features(
3776
{
38-
"words": datasets.Sequence(datasets.Value("string")),
39-
"bio": datasets.Sequence(create_feature(self.config.bio))
77+
"form": datasets.Sequence(datasets.Value("string")),
78+
"bio": datasets.Sequence(create_feature(feats['bio']))
4079
}
4180
),
4281
supervised_keys=None,
4382
)
4483

4584
def _split_generators(self, dl_manager):
46-
data_files = {
47-
"train": join(self.config.data_dir, _TRAINING_FILE),
48-
"dev": join(self.config.data_dir, _DEV_FILE),
49-
"test": join(self.config.data_dir, _TEST_FILE),
50-
}
51-
data_files = dl_manager.download_and_extract(data_files)
52-
53-
return [
54-
datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": data_files["train"]}),
55-
datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": data_files["dev"]}),
56-
datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"filepath": data_files["test"]}),
57-
]
58-
59-
def _generate_examples(self, filepath):
60-
logging.info("⏳ Generating examples from = %s", filepath)
61-
for line_num, block in iter_blocks(filename=filepath):
62-
# last example
63-
words, bio = [list(value) for value in zip(*block)]
64-
65-
yield line_num, {
66-
"words": words, "bio": bio
67-
}
85+
"""We handle string, list and dicts in datafiles"""
86+
if not self.config.data_files:
87+
raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}")
88+
data_files = dl_manager.download_and_extract(self.config.data_files)
89+
if isinstance(data_files, (str, list, tuple)):
90+
files = data_files
91+
if isinstance(files, str):
92+
files = [files]
93+
return [datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"files": files})]
94+
splits = []
95+
for split_name, files in data_files.items():
96+
if isinstance(files, str):
97+
files = [files]
98+
splits.append(datasets.SplitGenerator(name=split_name, gen_kwargs={"files": files}))
99+
return splits
100+
101+
def _generate_examples(self, files):
102+
for filename in files:
103+
logging.info("⏳ Generating examples from = %s", files)
104+
for line_num, block in iter_blocks(filename=filename):
105+
# last example
106+
words, bio = [list(value) for value in zip(*block)]
107+
108+
yield line_num, {
109+
"form": words, "bio": bio
110+
}

0 commit comments

Comments
 (0)