Skip to content

Commit 6a1adf1

Browse files
authored
Merge branch 'master' into add_dcn_v2
2 parents 9e7fd9a + 08b6767 commit 6a1adf1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+1821
-151
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@ output
44
paddlerec.egg-info/
55
*~
66
*.pyc
7-
*.DS_Store
7+
*.DS_Store
8+
kernel_meta/

README_CN.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030

3131
- 源于飞桨生态的搜索推荐模型 **一站式开箱即用工具**
3232
- 适合初学者,开发者,研究者的推荐系统全流程解决方案
33-
- 包含内容理解、匹配、召回、排序、 多任务、重排序等多个任务的完整推荐搜索算法库
33+
- 包含内容理解、匹配、召回、排序、 多任务、重排序等多个任务的完整推荐搜索算法库[支持模型列表](#支持模型列表)
3434

3535
<h2 align="center">快速使用</h2>
3636

@@ -107,6 +107,8 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # 静态图训
107107
### 致谢
108108
* [外部开发者贡献列表](contributor.md)
109109

110+
### 支持模型列表
111+
110112
<h2 align="center">支持模型列表</h2>
111113

112114

README_EN.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525

2626
- A quick start tool of search & recommendation algorithm based on [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html)
2727
- A complete solution of recommendation system for beginners, developers and researchers.
28-
- Recommendation algorithm library including content-understanding, match, recall, rank, multi-task, re-rank etc.
28+
- Recommendation algorithm library including content-understanding, match, recall, rank, multi-task, re-rank etc.[Support model list](#Support_Model_List)
2929

3030
<h2 align="center">Getting Started</h2>
3131

@@ -73,31 +73,33 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # Training wit
7373

7474
### Background
7575
* [Recommendation System](doc/rec_background.md)
76-
* [Distributed deep learning](doc/ps_background.md)
76+
* [Distributed deep Learning](doc/ps_background.md)
7777

78-
### Introductory tutorial
79-
* [PaddleRec function introduction](doc/introduction.md)
78+
### Introductory Tutorial
79+
* [PaddleRec Function Introduction](doc/introduction.md)
8080
* [Dygraph Train](doc/dygraph_mode.md)
8181
* [Static Train](doc/static_mode.md)
8282
* [Distributed Train](doc/fleet_mode.md)
8383

8484

85-
### Advanced tutorial
85+
### Advanced Tutorial
86+
* [Submit Specification](doc/contribute.md)
8687
* [Custom Reader](doc/custom_reader.md)
8788
* [Custom Model](doc/model_develop.md)
88-
* [Configuration description of yaml](doc/yaml.md)
89-
* [Training visualization](doc/visualization.md)
89+
* [Configuration Description of Yaml](doc/yaml.md)
90+
* [Training Visualization](doc/visualization.md)
9091
* [Serving](doc/serving.md)
91-
* [Python inference](doc/inference.md)
92+
* [Python Inference](doc/inference.md)
9293
* [Benchmark](doc/benchmark.md)
9394

9495
### FAQ
9596
* [Common Problem FAQ](doc/faq.md)
9697

9798
### Acknowledgements
98-
* [Contributions from external developer](contributor.md)
99+
* [Contributions From External Developer](contributor.md)
99100

100-
<h2 align="center">Support model list</h2>
101+
#### Support_Model_List
102+
<h2 align="center">Support Model List</h2>
101103

102104

103105
| Type | Algorithm | Online Environment | Parameter-Server | Multi-GPU | version | Paper |

contributor.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212
| [BERT4REC](models/rank/bert4rec/) | [jinweiluo](https://github.com/jinweiluo) | https://github.com/PaddlePaddle/PaddleRec/pull/624 | 论文复现赛第四期 |
1313
| [FAT_DeepFFM](models/rank/fat_deepffm/) | [LinJayan](https://github.com/LinJayan) | https://github.com/PaddlePaddle/PaddleRec/pull/651 | 论文复现赛第四期 |
1414
| [DeepRec](models/rank/deeprec/) | [chenjiyan2001](https://github.com/chenjiyan2001) | https://github.com/PaddlePaddle/PaddleRec/pull/647 | 论文复现赛第五期 |
15-
| [ENSFM](models/recal/ensfm/) | [renmada](https://github.com/renmada) | https://github.com/PaddlePaddle/PaddleRec/pull/618 | 论文复现赛第五期 |
16-
| [TiSAS](models/recal/tisas/) | [renmada](https://github.com/renmada) | https://github.com/PaddlePaddle/PaddleRec/pull/625 | 论文复现赛第五期 |
15+
| [ENSFM](models/recall/ensfm/) | [renmada](https://github.com/renmada) | https://github.com/PaddlePaddle/PaddleRec/pull/618 | 论文复现赛第五期 |
16+
| [TiSAS](models/recall/tisas/) | [renmada](https://github.com/renmada) | https://github.com/PaddlePaddle/PaddleRec/pull/625 | 论文复现赛第五期 |
1717
| [AutoFIS](models/rank/autofis/) | [renmada](https://github.com/renmada) | https://github.com/PaddlePaddle/PaddleRec/pull/660 | 论文复现赛第五期 |
1818
| [Dselect_K](models/multitask/dselect_k/) | [Andy1314Chen](https://github.com/Andy1314Chen) | https://github.com/PaddlePaddle/PaddleRec/pull/671 | 论文复现赛第五期 |
1919
| [MIND](models/recall/mind/) | [duyiqi17 ](https://github.com/duyiqi17) | https://github.com/PaddlePaddle/PaddleRec/pull/398 | 其他 |
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
16+
runner:
17+
raw_file_dir: "path" # raw_data dir
18+
raw_filled_file_dir: "./raw_data" # raw_data_filled dir
19+
train_data_dir: "./train_data_full" # train datasets
20+
test_data_dir: "./test_data_full" # test datasets
21+
rebuild_feature_map: True # False use feature_map_cache
22+
min_threshold: 4
23+
feature_map_cache: '.feature_map'
24+

datasets/Avazu_flen/preprocess.py

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
"""
15+
The file implements data preprocessing and dataset spilting.
16+
"""
17+
18+
from __future__ import print_function
19+
import numpy as np
20+
from tqdm import tqdm
21+
from pathlib import Path
22+
import shutil
23+
import pickle
24+
import csv
25+
from collections import defaultdict
26+
import logging
27+
import argparse
28+
import os
29+
import sys
30+
31+
sys.path.append("../../")
32+
from tools.utils.utils_single import load_yaml, load_dy_model_class, get_abs_model, create_data_loader
33+
34+
35+
def _mkdir_if_not_exist(path):
36+
if not os.path.exists(path):
37+
os.makedirs(path)
38+
39+
40+
class PreDataset(object):
41+
def __init__(self, config):
42+
super(PreDataset, self).__init__()
43+
self.config = config
44+
45+
self.field_names = None
46+
self.target_name = None
47+
self.field_info = None
48+
self.idx_to_field_name = None
49+
self.feature_map = None
50+
self.train_cnt = 0
51+
self.test_cnt = 0
52+
self.sample_cnt = 0
53+
self.raw_file_dir = self.config.get("runner.raw_file_dir")
54+
self.raw_filled_file_dir = self.config.get(
55+
"runner.raw_filled_file_dir")
56+
57+
self.rebuild_feature_map = self.config.get(
58+
"runner.rebuild_feature_map")
59+
self.min_threshold = self.config.get("runner.min_threshold")
60+
self.feature_map_cache = self.config.get("runner.feature_map_cache")
61+
62+
# self.filled_raw()
63+
64+
self.init()
65+
66+
def init(self):
67+
self._get_field_name()
68+
self._get_feature_map()
69+
self._build_split()
70+
71+
def filled_raw(self):
72+
"fill raw data with '-1' ,and spilt user, item, contex fields"
73+
train_path = self.raw_file_dir
74+
_mkdir_if_not_exist(self.raw_filled_file_dir)
75+
self.file_object = self.raw_filled_file_dir + '/PreRaw_data.txt'
76+
77+
file_object_ = open(self.file_object, 'w')
78+
with open(train_path, "r") as rf:
79+
n = 0
80+
m = -1
81+
for l in tqdm(rf):
82+
m += 1
83+
out = []
84+
values = l.rstrip('\n').split(',')
85+
86+
fields_values = []
87+
for i, v in enumerate(values):
88+
if v == "":
89+
values[i] = "-1"
90+
91+
fields_values.append(values[0])
92+
fields_values.append(values[3])
93+
fields_values.extend(values[16:])
94+
fields_values.extend(values[11:15])
95+
fields_values.extend(values[8:11])
96+
fields_values.extend(values[4:8])
97+
fields_values.append(values[15])
98+
fields_values.append(values[2])
99+
fields_values.append(values[1])
100+
101+
if m == 0:
102+
print(fields_values)
103+
file_object_.write(','.join(fields_values) + '\n')
104+
file_object_.close()
105+
logging.info('All Samples: %s ' % (m))
106+
107+
def _get_field_name(self):
108+
self.file_object = self.raw_filled_file_dir + '/PreRaw_data.txt' ##################
109+
with open(self.file_object) as csv_file: # open the input file.
110+
data_file = csv.reader(csv_file)
111+
header = next(data_file) # get the header line.
112+
self.field_info = {k: v for v, k in enumerate(header)}
113+
self.idx_to_field_name = {
114+
idx: name
115+
for idx, name in enumerate(header)
116+
}
117+
self.field_names = header[2:] # list of feature names.
118+
self.field_names.append(header[0])
119+
self.target_name = header[1] # target name.
120+
121+
def _get_feature_map(self):
122+
if not self.rebuild_feature_map and Path(
123+
self.feature_map_cache).exists():
124+
with open(self.feature_map_cache, 'rb') as f:
125+
feature_mapper = pickle.load(f)
126+
else:
127+
feature_cnts = defaultdict(lambda: defaultdict(int))
128+
with open(self.file_object) as f:
129+
f.readline()
130+
pbar = tqdm(f, mininterval=1, smoothing=0.1)
131+
pbar.set_description('Create avazu dataset: counting features')
132+
for line in pbar:
133+
values = line.rstrip('\n').split(',')
134+
if len(values) != len(self.field_names) + 1:
135+
continue
136+
for k, v in self.field_info.items():
137+
if k not in ['click']:
138+
feature_cnts[k][values[v]] += 1
139+
feature_mapper = {
140+
field_name: {
141+
feature_name
142+
for feature_name, c in cnt.items()
143+
if c >= self.min_threshold
144+
}
145+
for field_name, cnt in feature_cnts.items()
146+
}
147+
feature_mapper['id'] = {
148+
feature_name
149+
for feature_name, c in feature_cnts['id'].items()
150+
}
151+
feature_mapper = {
152+
field_name:
153+
{feature_name: idx
154+
for idx, feature_name in enumerate(cnt)}
155+
for field_name, cnt in feature_mapper.items()
156+
}
157+
158+
shutil.rmtree(self.feature_map_cache, ignore_errors=True)
159+
with open(self.feature_map_cache, 'wb') as f:
160+
pickle.dump(feature_mapper, f)
161+
162+
self.feature_map = feature_mapper
163+
164+
def _build_split(self):
165+
full_lines = []
166+
self.data = []
167+
168+
_mkdir_if_not_exist(self.config.get("runner.train_data_dir"))
169+
_mkdir_if_not_exist(self.config.get("runner.test_data_dir"))
170+
171+
train_file = open(
172+
os.path.join(
173+
self.config.get("runner.train_data_dir"), 'train_data.txt'),
174+
'w')
175+
test_file = open(
176+
os.path.join(
177+
self.config.get("runner.test_data_dir"), 'test_data.txt'), 'w')
178+
179+
features = {} # dict for all feature columns and target column.
180+
181+
feature_mapper = self.feature_map
182+
sample_cnt = 0
183+
for file in [self.file_object]:
184+
with open(file, "r") as rf:
185+
train_cnt = 0
186+
test_cnt = 0
187+
rf.readline()
188+
pbar = tqdm(rf, mininterval=1, smoothing=0.1)
189+
pbar.set_description(
190+
'Split avazu dataset: train_dataset and test_dataset')
191+
for line in pbar:
192+
sample_cnt += 1
193+
194+
values = line.rstrip('\n').split(',')
195+
196+
if len(values) != len(self.field_names) + 1:
197+
continue
198+
199+
features = {
200+
self.idx_to_field_name[idx]:
201+
feature_mapper[self.idx_to_field_name[idx]][value]
202+
for idx, value in enumerate(values)
203+
if self.idx_to_field_name[idx] != 'click' and value in
204+
feature_mapper[self.idx_to_field_name[idx]]
205+
}
206+
features.update({'target': values[-1]})
207+
208+
if "14103000" in values[22]:
209+
test_cnt += 1
210+
value_n = 0
211+
for k, v in features.items():
212+
value_n += 1
213+
if value_n == len(list(features.values())):
214+
test_file.write(str(v) + '\n')
215+
else:
216+
test_file.write(str(v) + ',')
217+
else:
218+
train_cnt += 1
219+
value_n = 0
220+
for k, v in features.items():
221+
value_n += 1
222+
if value_n == len(list(features.values())):
223+
train_file.write(str(v) + '\n')
224+
else:
225+
train_file.write(str(v) + ',')
226+
227+
self.train_cnt = train_cnt
228+
self.test_cnt = test_cnt
229+
self.sample_cnt = sample_cnt
230+
231+
232+
def main(args):
233+
config = load_yaml(args.config_yaml)
234+
235+
logging.info("Starting preprocess dataset ...")
236+
data = PreDataset(config)
237+
logging.info("Finished preprocess dataset!")
238+
train_cnt = data.train_cnt
239+
test_cnt = data.test_cnt
240+
samples = data.sample_cnt
241+
fields = len(data.field_names)
242+
243+
logging.info('All Samples: %s ' % (samples))
244+
logging.info('Train Samples: %s ' % (train_cnt))
245+
logging.info('Test Samples: %s ' % (test_cnt))
246+
logging.info('Fields: %s ' % (fields))
247+
248+
249+
if __name__ == "__main__":
250+
# Commandline arguments
251+
parser = argparse.ArgumentParser(
252+
description="Parameter of preprocess data")
253+
parser.add_argument("-m", "--config_yaml", type=str)
254+
args = parser.parse_args()
255+
args.abs_dir = os.path.dirname(os.path.abspath(args.config_yaml))
256+
args.config_yaml = get_abs_model(args.config_yaml)
257+
258+
args = parser.parse_args()
259+
260+
main(args)

datasets/Avazu_flen/readme.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
### Avazu dataset for FLEN models:
2+
#### 1.Get raw datasets:
3+
you can go to:[https://www.kaggle.com/c/avazu-ctr-prediction/data](https://www.kaggle.com/c/avazu-ctr-prediction)
4+
5+
将下载的原始数据目录配置在data_config.yaml中,执行命令获取全量数据
6+
7+
| 名称 | 说明 |
8+
| -------- | -------- |
9+
| raw_file_dir | 原始数据集目录 |
10+
| raw_filled_file_dir | 原始数据缺失值处理后的目录 |
11+
| train_data_dir | 训练集存放目录 |
12+
| test_data_dir | 测试集存放目录 |
13+
| rebuild_feature_map | 是否重建类别特征,默认为True |
14+
| min_threshold | 类别特征计数临界值,默认为4 |
15+
| feature_map_cache | 特征缓存数据 |
16+
17+
18+
19+
```bash
20+
sh data_process.sh
21+
```
22+
#### 2.Get preprocessd datasets:
23+
you can also go to: [AiStudio数据集](https://aistudio.baidu.com/aistudio/datasetdetail/125200)

0 commit comments

Comments
 (0)