Skip to content

Commit 2bfa462

Browse files
authored
Merge branch 'master' into master
2 parents deb794b + 73c2277 commit 2bfa462

File tree

38 files changed

+1331
-128
lines changed

38 files changed

+1331
-128
lines changed

README_CN.md

Lines changed: 61 additions & 61 deletions
Large diffs are not rendered by default.

README_EN.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -159,8 +159,9 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # Training wit
159159
| Rank | [FLEN](models/rank/flen/) | - ||| >=2.1.0 | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) |
160160
| Rank | [DeepRec](models/rank/deeprec/) | - ||| >=2.1.0 | [2017][Training Deep AutoEncoders for Collaborative Filtering](https://arxiv.org/pdf/1708.01715v3.pdf) |
161161
| Rank | [AutoFIS](models/rank/autofis/) | - ||| >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf) |
162-
| Rank | [DCN_V2](models/rank/dcn_v2/) | - | ✓ | ✓ | >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf)
163-
| Rank | [SIGN](models/rank/sign/)([doc](https://paddlerec.readthedocs.io/en/latest/models/rank/sign.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3869111) | ✓ | ✓ | >=2.1.0 | [AAAI 2021][Detecting Beneficial Feature Interactions for Recommender Systems](https://arxiv.org/pdf/2008.00404v6.pdf)
162+
| Rank | [DCN_V2](models/rank/dcn_v2/) | - ||| >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf)|
163+
| Rank | [AITM](models/rank/aitm/) | - ||| >=2.1.0 | [KDD 2021][Modeling the Sequential Dependence among Audience Multi-step Conversions withMulti-task Learning in Targeted Display Advertising](https://arxiv.org/pdf/2105.08489v2.pdf) |
164+
| Rank | [SIGN](models/rank/sign/)([doc](https://paddlerec.readthedocs.io/en/latest/models/rank/sign.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3869111) ||| >=2.1.0 | [AAAI 2021][Detecting Beneficial Feature Interactions for Recommender Systems](https://arxiv.org/pdf/2008.00404v6.pdf) |
164165
| Multi-Task | [PLE](models/multitask/ple/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/ple.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238938) ||| >=2.1.0 | [RecSys 2020][Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations](https://dl.acm.org/doi/abs/10.1145/3383313.3412236) |
165166
| Multi-Task | [ESMM](models/multitask/esmm/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/esmm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238583) ||| >=2.1.0 | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
166167
| Multi-Task | [MMOE](models/multitask/mmoe/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/mmoe.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238934) ||| >=2.1.0 | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
'''
15+
process the Ali-CCP (Alibaba Click and Conversion Prediction) dataset.
16+
https://tianchi.aliyun.com/datalab/dataSet.html?dataId=408
17+
18+
@The author:
19+
Dongbo Xi ([email protected])
20+
'''
21+
import numpy as np
22+
import joblib
23+
import re
24+
import random
25+
random.seed(2020)
26+
np.random.seed(2020)
27+
data_path = 'data/sample_skeleton_{}.csv'
28+
common_feat_path = 'data/common_features_{}.csv'
29+
enum_path = 'data/ctrcvr_enum.pkl'
30+
write_path = 'data/ctr_cvr'
31+
use_columns = [
32+
'101', '121', '122', '124', '125', '126', '127', '128', '129', '205',
33+
'206', '207', '216', '508', '509', '702', '853', '301'
34+
]
35+
36+
37+
class process(object):
38+
def __init__(self):
39+
pass
40+
41+
def process_train(self):
42+
c = 0
43+
common_feat_dict = {}
44+
with open(common_feat_path.format('train'), 'r') as fr:
45+
for line in fr:
46+
line_list = line.strip().split(',')
47+
kv = np.array(re.split('\x01|\x02|\x03', line_list[2]))
48+
key = kv[range(0, len(kv), 3)]
49+
value = kv[range(1, len(kv), 3)]
50+
feat_dict = dict(zip(key, value))
51+
common_feat_dict[line_list[0]] = feat_dict
52+
c += 1
53+
if c % 100000 == 0:
54+
print(c)
55+
print('join feats...')
56+
c = 0
57+
vocabulary = dict(
58+
zip(use_columns, [{} for _ in range(len(use_columns))]))
59+
with open(data_path.format('train') + '.tmp', 'w') as fw:
60+
fw.write('click,purchase,' + ','.join(use_columns) + '\n')
61+
with open(data_path.format('train'), 'r') as fr:
62+
for line in fr:
63+
line_list = line.strip().split(',')
64+
if line_list[1] == '0' and line_list[2] == '1':
65+
continue
66+
kv = np.array(re.split('\x01|\x02|\x03', line_list[5]))
67+
key = kv[range(0, len(kv), 3)]
68+
value = kv[range(1, len(kv), 3)]
69+
feat_dict = dict(zip(key, value))
70+
feat_dict.update(common_feat_dict[line_list[3]])
71+
feats = line_list[1:3]
72+
for k in use_columns:
73+
feats.append(feat_dict.get(k, '0'))
74+
fw.write(','.join(feats) + '\n')
75+
for k, v in feat_dict.items():
76+
if k in use_columns:
77+
if v in vocabulary[k]:
78+
vocabulary[k][v] += 1
79+
else:
80+
vocabulary[k][v] = 0
81+
c += 1
82+
if c % 100000 == 0:
83+
print(c)
84+
print('before filter low freq:')
85+
for k, v in vocabulary.items():
86+
print(k + ':' + str(len(v)))
87+
new_vocabulary = dict(
88+
zip(use_columns, [set() for _ in range(len(use_columns))]))
89+
for k, v in vocabulary.items():
90+
for k1, v1 in v.items():
91+
if v1 > 10:
92+
new_vocabulary[k].add(k1)
93+
vocabulary = new_vocabulary
94+
print('after filter low freq:')
95+
for k, v in vocabulary.items():
96+
print(k + ':' + str(len(v)))
97+
joblib.dump(vocabulary, enum_path, compress=3)
98+
99+
print('encode feats...')
100+
vocabulary = joblib.load(enum_path)
101+
feat_map = {}
102+
for feat in use_columns:
103+
feat_map[feat] = dict(
104+
zip(vocabulary[feat], range(1, len(vocabulary[feat]) + 1)))
105+
c = 0
106+
with open(write_path + '.train', 'w') as fw1:
107+
with open(write_path + '.dev', 'w') as fw2:
108+
fw1.write('click,purchase,' + ','.join(use_columns) + '\n')
109+
fw2.write('click,purchase,' + ','.join(use_columns) + '\n')
110+
with open(data_path.format('train') + '.tmp', 'r') as fr:
111+
fr.readline() # remove header
112+
for line in fr:
113+
line_list = line.strip().split(',')
114+
new_line = line_list[:2]
115+
for value, feat in zip(line_list[2:], use_columns):
116+
new_line.append(
117+
str(feat_map[feat].get(value, '0')))
118+
if random.random() >= 0.9:
119+
fw2.write(','.join(new_line) + '\n')
120+
else:
121+
fw1.write(','.join(new_line) + '\n')
122+
c += 1
123+
if c % 100000 == 0:
124+
print(c)
125+
126+
def process_test(self):
127+
c = 0
128+
common_feat_dict = {}
129+
with open(common_feat_path.format('test'), 'r') as fr:
130+
for line in fr:
131+
line_list = line.strip().split(',')
132+
kv = np.array(re.split('\x01|\x02|\x03', line_list[2]))
133+
key = kv[range(0, len(kv), 3)]
134+
value = kv[range(1, len(kv), 3)]
135+
feat_dict = dict(zip(key, value))
136+
common_feat_dict[line_list[0]] = feat_dict
137+
c += 1
138+
if c % 100000 == 0:
139+
print(c)
140+
print('join feats...')
141+
c = 0
142+
with open(data_path.format('test') + '.tmp', 'w') as fw:
143+
fw.write('click,purchase,' + ','.join(use_columns) + '\n')
144+
with open(data_path.format('test'), 'r') as fr:
145+
for line in fr:
146+
line_list = line.strip().split(',')
147+
if line_list[1] == '0' and line_list[2] == '1':
148+
continue
149+
kv = np.array(re.split('\x01|\x02|\x03', line_list[5]))
150+
key = kv[range(0, len(kv), 3)]
151+
value = kv[range(1, len(kv), 3)]
152+
feat_dict = dict(zip(key, value))
153+
feat_dict.update(common_feat_dict[line_list[3]])
154+
feats = line_list[1:3]
155+
for k in use_columns:
156+
feats.append(str(feat_dict.get(k, '0')))
157+
fw.write(','.join(feats) + '\n')
158+
c += 1
159+
if c % 100000 == 0:
160+
print(c)
161+
162+
print('encode feats...')
163+
vocabulary = joblib.load(enum_path)
164+
feat_map = {}
165+
for feat in use_columns:
166+
feat_map[feat] = dict(
167+
zip(vocabulary[feat], range(1, len(vocabulary[feat]) + 1)))
168+
c = 0
169+
with open(write_path + '.test', 'w') as fw:
170+
fw.write('click,purchase,' + ','.join(use_columns) + '\n')
171+
with open(data_path.format('test') + '.tmp', 'r') as fr:
172+
fr.readline() # remove header
173+
for line in fr:
174+
line_list = line.strip().split(',')
175+
new_line = line_list[:2]
176+
for value, feat in zip(line_list[2:], use_columns):
177+
new_line.append(str(feat_map[feat].get(value, '0')))
178+
fw.write(','.join(new_line) + '\n')
179+
c += 1
180+
if c % 100000 == 0:
181+
print(c)
182+
183+
184+
if __name__ == "__main__":
185+
pros = process()
186+
pros.process_train()
187+
pros.process_test()

datasets/ali-cpp_aitm/run.sh

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
mkdir data
2+
mkdir data/whole_data && mkdir data/whole_data/train && mkdir data/whole_data/test
3+
train_source_path="./data/sample_train.tar.gz"
4+
train_target_path="train_data"
5+
test_source_path="./data/sample_test.tar.gz"
6+
test_target_path="test_data"
7+
cd data
8+
echo "downloading sample_train.tar.gz......"
9+
curl -# 'http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/file/opensearch/documents/408/sample_train.tar.gz?Expires=1586435769&OSSAccessKeyId=LTAIGx40tjZWxj6q&Signature=ahUDqhvKT1cGjC4%2FIER2EWtq7o4%3D&response-content-disposition=attachment%3B%20' -H 'Proxy-Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Accept-Language: zh-CN,zh;q=0.9' --compressed --insecure -o sample_train.tar.gz
10+
cd ..
11+
echo "unzipping sample_train.tar.gz......"
12+
tar -xzvf ${train_source_path} -C data && rm -rf ${train_source_path}
13+
cd data
14+
echo "downloading sample_test.tar.gz......"
15+
curl -# 'http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/file/opensearch/documents/408/sample_test.tar.gz?Expires=1586435821&OSSAccessKeyId=LTAIGx40tjZWxj6q&Signature=OwLMPjt1agByQtRVi8pazsAliNk%3D&response-content-disposition=attachment%3B%20' -H 'Proxy-Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Accept-Language: zh-CN,zh;q=0.9' --compressed --insecure -o sample_test.tar.gz
16+
cd ..
17+
echo "unzipping sample_test.tar.gz......"
18+
tar -xzvf ${test_source_path} -C data && rm -rf ${test_source_path}
19+
echo "preprocessing data......"
20+
python process_public_data.py
21+
mv data/ctr_cvr.train data/whole_data/train
22+
mv data/ctr_cvr.test data/whole_data/test

datasets/readme.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ sh data_process.sh
2323
|[Criteo](https://fleet.bj.bcebos.com/ctr_data.tar.gz)| [wide_deep](../models/rank/wide_deep/criteo_reader.py) |该数据集包括两部分:训练集和测试集。训练集包含一段时间内Criteo的部分流量,测试集则对应训练数据后一天的广告点击流量。|[kaggle](https://www.kaggle.com/c/criteo-display-ad-challenge/)|
2424
|[letor07](https://paddlerec.bj.bcebos.com/match_pyramid/match_pyramid_data.tar.gz)| [match-pyramid](../models/match/match-pyramid/letor_reader.py) |LETOR是一套用于学习排名研究的基准数据集,其中包含标准特征、相关性判断、数据划分、评估工具和若干基线|[LETOR: Learning to Rank for Information Retrieval](https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fbeijing%2Fprojects%2Fletor%2F)|
2525
|[senti_clas](https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz)| [textcnn](../models/contentunderstanding/textcnn/senti_clas_reader.py)|情感倾向分析(Sentiment Classification,简称Senta)针对带有主观描述的中文文本,可自动判断该文本的情感极性类别并给出相应的置信度。情感类型分为积极、消极。情感倾向分析能够帮助企业理解用户消费习惯、分析热点话题和危机舆情监控,为企业提供有利的决策支持|--|
26-
|[one_billion](http://www.statmt.org/lm-benchmark/)| [word2vec](../models/recall/word2vec/word2vec_reader.py) |拥有十亿个单词基准,为语言建模实验提供标准的训练和测试|[One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling](https://arxiv.org/abs/1312.3005)|
26+
|[one_billion](https://paddlerec.bj.bcebos.com/word2vec/1-billion-word-language-modeling-benchmark-r13output.tar)| [word2vec](../models/recall/word2vec/word2vec_reader.py) |拥有十亿个单词基准,为语言建模实验提供标准的训练和测试|[One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling](https://arxiv.org/abs/1312.3005)|
2727
|[MIND](https://paddlerec.bj.bcebos.com/datasets/MIND/bigdata.zip)| [naml](../models/rank/naml/NAMLDataReader.py) |MIND即MIcrosoft News Dataset的简写,MIND里的数据来自Microsoft News用户的行为日志。MIND的数据集里包含了1,000,000的用户以及这些用户与160,000的文章的交互行为。|[Microsoft(2020)](https://msnews.github.io)|
2828
|[movielens_pinterest_NCF](https://paddlerec.bj.bcebos.com/ncf/Data.zip)| [NCF](../models/recall/ncf/movielens_reader.py) |论文原作者处理过的movielens数据集和pinterest数据集,[github](https://github.com/hexiangnan/neural_collaborative_filtering)|[《Neural Collaborative Filtering 》](https://arxiv.org/pdf/1708.05031.pdf)|
2929
|[Anime](https://paddlerec.bj.bcebos.com/datasets/Anime/archive.zip)| -- |该数据集包含73,516个用户对12,294个动漫的用户偏好数据。每个用户都可以将动漫添加到列表中并给它一个评分,该数据集是这些评分的汇总。|[Kaggle](https://www.kaggle.com/CooperUnion/anime-recommendations-database)|
@@ -40,7 +40,7 @@ sh data_process.sh
4040
|[Ali_Display_Ad_Click](https://paddlerec.bj.bcebos.com/datasets/dmr/dataset_full.zip)| [dmr](../models/rank/dmr/alimama_reader.py) |预处理过的Alimama数据集 |[Deep Match to Rank Model for Personalized Click-Through Rate Prediction](https://github.com/lvze92/DMR)|
4141
|[omniglot](https://paddlerec.bj.bcebos.com/datasets/omniglot/omniglot.tar)| [maml](../models/multitask/maml/omniglot_reader.py) |预处理过的omniglot数据集 |[Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks](https://arxiv.org/pdf/1703.03400.pdf)|
4242
|[LastFM](https://paddlerec.bj.bcebos.com/datasets/LastFM/lastfm-2k.zip)| -- |提供音乐推荐的数据集,对于数据集中的每个用户,包含他们最受欢迎的艺术家的列表以及播放次数 |[HetRec 2011](https://grouplens.org/datasets/hetrec-2011/)|
43-
|[Epinions](https://paddlerec.bj.bcebos.com/datasets/Epinions/epinions_data.tar.gz)| -- |Epinions数据集由一个普通消费者评论网站Epinions.com的who信任who在线社交网络构建 |[Epinions](https://snap.stanford.edu/data/soc-Epinions1.html)|
43+
|[Epinions](https://paddlerec.bj.bcebos.com/datasets/Epinions/soc-Epinions1.txt.gz)| -- |Epinions数据集由一个普通消费者评论网站Epinions.com的who信任who在线社交网络构建 |[Epinions](https://snap.stanford.edu/data/soc-Epinions1.html)|
4444
|[Yelp](https://paddlerec.bj.bcebos.com/datasets/Epinions/soc-Epinions1.txt.gz)| -- |Yelp数据集是我们用于个人,教育和学术目的的业务,评论和用户数据的子集。以JSON文件形式提供,可用于在学习如何制作移动应用程序的同时,教给学生有关数据库,学习NLP或用于样例生产数据的信息。 |[Yelp](https://www.yelp.com/dataset)|
4545
|[book-crossing](https://paddlerec.bj.bcebos.com/datasets/book-crossing/BX-CSV-Dump.zip)| -- |Book-Crossings是由Cai-Nicolas Ziegler根据 bookcrossing.com 的数据编写的图书评分数据集。|[IIF](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)|
4646
|[Pinterest](https://paddlerec.bj.bcebos.com/datasets/Pinterest/pinterest-20.train.rating)| -- |Pinterest数据集包含超过100万张与Pinterest用户相关联的图像。|[Learning Image and User Features for Recommendation in Social Networks](https://openaccess.thecvf.com/content_iccv_2015/papers/Geng_Learning_Image_and_ICCV_2015_paper.pdf)|

0 commit comments

Comments
 (0)