PaddlePaddle
diff --git a/‎README_CN.md‎
Lines changed: 61 additions & 61 deletions b/‎README_CN.md‎
Lines changed: 61 additions & 61 deletions
diff --git a/‎README_EN.md‎
Lines changed: 3 additions & 2 deletions b/‎README_EN.md‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎datasets/ali-cpp_aitm/process_public_data.py‎
Lines changed: 187 additions & 0 deletions b/‎datasets/ali-cpp_aitm/process_public_data.py‎
Lines changed: 187 additions & 0 deletions
diff --git a/‎datasets/ali-cpp_aitm/run.sh‎
Lines changed: 22 additions & 0 deletions b/‎datasets/ali-cpp_aitm/run.sh‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎datasets/readme.md‎
Lines changed: 2 additions & 2 deletions b/‎datasets/readme.md‎
Lines changed: 2 additions & 2 deletions
@@ -159,8 +159,9 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml #  Training wit
   |         Rank          |                     [FLEN](models/rank/flen/)                     |  -  |         ✓         |     ✓     |  >=2.1.0 | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf)                                                                                                           |
   |   Rank   |                     [DeepRec](models/rank/deeprec/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [2017][Training Deep AutoEncoders for Collaborative Filtering](https://arxiv.org/pdf/1708.01715v3.pdf)                                                                                                          |
   |   Rank   |                     [AutoFIS](models/rank/autofis/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf)                                                                                                          |
-  |   Rank   |                     [DCN_V2](models/rank/dcn_v2/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf) 
-  |   Rank   |                     [SIGN](models/rank/sign/)([doc](https://paddlerec.readthedocs.io/en/latest/models/rank/sign.html))                     |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3869111)  |       ✓     |     ✓     | >=2.1.0 | [AAAI 2021][Detecting Beneficial Feature Interactions for Recommender Systems](https://arxiv.org/pdf/2008.00404v6.pdf)
+  |   Rank   |                     [DCN_V2](models/rank/dcn_v2/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf)|
+  |   Rank   |                                                                          [AITM](models/rank/aitm/)                                                                          |  -  |       ✓     |     ✓     | >=2.1.0 | [KDD 2021][Modeling the Sequential Dependence among Audience Multi-step Conversions withMulti-task Learning in Targeted Display Advertising](https://arxiv.org/pdf/2105.08489v2.pdf)  |
+  |   Rank   |                     [SIGN](models/rank/sign/)([doc](https://paddlerec.readthedocs.io/en/latest/models/rank/sign.html))                     |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3869111)  |       ✓     |     ✓     | >=2.1.0 | [AAAI 2021][Detecting Beneficial Feature Interactions for Recommender Systems](https://arxiv.org/pdf/2008.00404v6.pdf) |
   |      Multi-Task       |                  [PLE](models/multitask/ple/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/ple.html))                   |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238938)  |     ✓     |     ✓     |  >=2.1.0 | [RecSys 2020][Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations](https://dl.acm.org/doi/abs/10.1145/3383313.3412236)                                                              |
   |      Multi-Task       |                  [ESMM](models/multitask/esmm/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/esmm.html))                   |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238583)  |         ✓         |     ✓     |      >=2.1.0     | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931)                                                              |
   |      Multi-Task       |                  [MMOE](models/multitask/mmoe/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/mmoe.html))                   |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238934)  |         ✓         |     ✓     |      >=2.1.0     | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007)                                                       |
 
@@ -0,0 +1,187 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''
+process the Ali-CCP (Alibaba Click and Conversion Prediction) dataset.
+https://tianchi.aliyun.com/datalab/dataSet.html?dataId=408
+
+@The author:
+Dongbo Xi ([email protected])
+'''
+import numpy as np
+import joblib
+import re
+import random
+random.seed(2020)
+np.random.seed(2020)
+data_path = 'data/sample_skeleton_{}.csv'
+common_feat_path = 'data/common_features_{}.csv'
+enum_path = 'data/ctrcvr_enum.pkl'
+write_path = 'data/ctr_cvr'
+use_columns = [
+    '101', '121', '122', '124', '125', '126', '127', '128', '129', '205',
+    '206', '207', '216', '508', '509', '702', '853', '301'
+]
+
+
+class process(object):
+    def __init__(self):
+        pass
+
+    def process_train(self):
+        c = 0
+        common_feat_dict = {}
+        with open(common_feat_path.format('train'), 'r') as fr:
+            for line in fr:
+                line_list = line.strip().split(',')
+                kv = np.array(re.split('\x01|\x02|\x03', line_list[2]))
+                key = kv[range(0, len(kv), 3)]
+                value = kv[range(1, len(kv), 3)]
+                feat_dict = dict(zip(key, value))
+                common_feat_dict[line_list[0]] = feat_dict
+                c += 1
+                if c % 100000 == 0:
+                    print(c)
+        print('join feats...')
+        c = 0
+        vocabulary = dict(
+            zip(use_columns, [{} for _ in range(len(use_columns))]))
+        with open(data_path.format('train') + '.tmp', 'w') as fw:
+            fw.write('click,purchase,' + ','.join(use_columns) + '\n')
+            with open(data_path.format('train'), 'r') as fr:
+                for line in fr:
+                    line_list = line.strip().split(',')
+                    if line_list[1] == '0' and line_list[2] == '1':
+                        continue
+                    kv = np.array(re.split('\x01|\x02|\x03', line_list[5]))
+                    key = kv[range(0, len(kv), 3)]
+                    value = kv[range(1, len(kv), 3)]
+                    feat_dict = dict(zip(key, value))
+                    feat_dict.update(common_feat_dict[line_list[3]])
+                    feats = line_list[1:3]
+                    for k in use_columns:
+                        feats.append(feat_dict.get(k, '0'))
+                    fw.write(','.join(feats) + '\n')
+                    for k, v in feat_dict.items():
+                        if k in use_columns:
+                            if v in vocabulary[k]:
+                                vocabulary[k][v] += 1
+                            else:
+                                vocabulary[k][v] = 0
+                    c += 1
+                    if c % 100000 == 0:
+                        print(c)
+        print('before filter low freq:')
+        for k, v in vocabulary.items():
+            print(k + ':' + str(len(v)))
+        new_vocabulary = dict(
+            zip(use_columns, [set() for _ in range(len(use_columns))]))
+        for k, v in vocabulary.items():
+            for k1, v1 in v.items():
+                if v1 > 10:
+                    new_vocabulary[k].add(k1)
+        vocabulary = new_vocabulary
+        print('after filter low freq:')
+        for k, v in vocabulary.items():
+            print(k + ':' + str(len(v)))
+        joblib.dump(vocabulary, enum_path, compress=3)
+
+        print('encode feats...')
+        vocabulary = joblib.load(enum_path)
+        feat_map = {}
+        for feat in use_columns:
+            feat_map[feat] = dict(
+                zip(vocabulary[feat], range(1, len(vocabulary[feat]) + 1)))
+        c = 0
+        with open(write_path + '.train', 'w') as fw1:
+            with open(write_path + '.dev', 'w') as fw2:
+                fw1.write('click,purchase,' + ','.join(use_columns) + '\n')
+                fw2.write('click,purchase,' + ','.join(use_columns) + '\n')
+                with open(data_path.format('train') + '.tmp', 'r') as fr:
+                    fr.readline()  # remove header
+                    for line in fr:
+                        line_list = line.strip().split(',')
+                        new_line = line_list[:2]
+                        for value, feat in zip(line_list[2:], use_columns):
+                            new_line.append(
+                                str(feat_map[feat].get(value, '0')))
+                        if random.random() >= 0.9:
+                            fw2.write(','.join(new_line) + '\n')
+                        else:
+                            fw1.write(','.join(new_line) + '\n')
+                        c += 1
+                        if c % 100000 == 0:
+                            print(c)
+
+    def process_test(self):
+        c = 0
+        common_feat_dict = {}
+        with open(common_feat_path.format('test'), 'r') as fr:
+            for line in fr:
+                line_list = line.strip().split(',')
+                kv = np.array(re.split('\x01|\x02|\x03', line_list[2]))
+                key = kv[range(0, len(kv), 3)]
+                value = kv[range(1, len(kv), 3)]
+                feat_dict = dict(zip(key, value))
+                common_feat_dict[line_list[0]] = feat_dict
+                c += 1
+                if c % 100000 == 0:
+                    print(c)
+        print('join feats...')
+        c = 0
+        with open(data_path.format('test') + '.tmp', 'w') as fw:
+            fw.write('click,purchase,' + ','.join(use_columns) + '\n')
+            with open(data_path.format('test'), 'r') as fr:
+                for line in fr:
+                    line_list = line.strip().split(',')
+                    if line_list[1] == '0' and line_list[2] == '1':
+                        continue
+                    kv = np.array(re.split('\x01|\x02|\x03', line_list[5]))
+                    key = kv[range(0, len(kv), 3)]
+                    value = kv[range(1, len(kv), 3)]
+                    feat_dict = dict(zip(key, value))
+                    feat_dict.update(common_feat_dict[line_list[3]])
+                    feats = line_list[1:3]
+                    for k in use_columns:
+                        feats.append(str(feat_dict.get(k, '0')))
+                    fw.write(','.join(feats) + '\n')
+                    c += 1
+                    if c % 100000 == 0:
+                        print(c)
+
+        print('encode feats...')
+        vocabulary = joblib.load(enum_path)
+        feat_map = {}
+        for feat in use_columns:
+            feat_map[feat] = dict(
+                zip(vocabulary[feat], range(1, len(vocabulary[feat]) + 1)))
+        c = 0
+        with open(write_path + '.test', 'w') as fw:
+            fw.write('click,purchase,' + ','.join(use_columns) + '\n')
+            with open(data_path.format('test') + '.tmp', 'r') as fr:
+                fr.readline()  # remove header
+                for line in fr:
+                    line_list = line.strip().split(',')
+                    new_line = line_list[:2]
+                    for value, feat in zip(line_list[2:], use_columns):
+                        new_line.append(str(feat_map[feat].get(value, '0')))
+                    fw.write(','.join(new_line) + '\n')
+                    c += 1
+                    if c % 100000 == 0:
+                        print(c)
+
+
+if __name__ == "__main__":
+    pros = process()
+    pros.process_train()
+    pros.process_test()
@@ -0,0 +1,22 @@
+mkdir data
+mkdir data/whole_data && mkdir data/whole_data/train && mkdir data/whole_data/test
+train_source_path="./data/sample_train.tar.gz"
+train_target_path="train_data"
+test_source_path="./data/sample_test.tar.gz"
+test_target_path="test_data"
+cd data
+echo "downloading sample_train.tar.gz......"
+curl -# 'http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/file/opensearch/documents/408/sample_train.tar.gz?Expires=1586435769&OSSAccessKeyId=LTAIGx40tjZWxj6q&Signature=ahUDqhvKT1cGjC4%2FIER2EWtq7o4%3D&response-content-disposition=attachment%3B%20' -H 'Proxy-Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Accept-Language: zh-CN,zh;q=0.9' --compressed --insecure -o sample_train.tar.gz
+cd ..
+echo "unzipping sample_train.tar.gz......"
+tar -xzvf  ${train_source_path} -C data && rm -rf ${train_source_path}
+cd data
+echo "downloading sample_test.tar.gz......"
+curl -# 'http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/file/opensearch/documents/408/sample_test.tar.gz?Expires=1586435821&OSSAccessKeyId=LTAIGx40tjZWxj6q&Signature=OwLMPjt1agByQtRVi8pazsAliNk%3D&response-content-disposition=attachment%3B%20' -H 'Proxy-Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Accept-Language: zh-CN,zh;q=0.9' --compressed --insecure -o sample_test.tar.gz
+cd ..
+echo "unzipping sample_test.tar.gz......"
+tar -xzvf  ${test_source_path} -C data && rm -rf ${test_source_path}
+echo "preprocessing data......"
+python process_public_data.py
+mv data/ctr_cvr.train data/whole_data/train
+mv data/ctr_cvr.test data/whole_data/test
@@ -23,7 +23,7 @@ sh data_process.sh
  |[Criteo](https://fleet.bj.bcebos.com/ctr_data.tar.gz)| [wide_deep](../models/rank/wide_deep/criteo_reader.py) |该数据集包括两部分：训练集和测试集。训练集包含一段时间内Criteo的部分流量，测试集则对应训练数据后一天的广告点击流量。|[kaggle](https://www.kaggle.com/c/criteo-display-ad-challenge/)|
  |[letor07](https://paddlerec.bj.bcebos.com/match_pyramid/match_pyramid_data.tar.gz)| [match-pyramid](../models/match/match-pyramid/letor_reader.py) |LETOR是一套用于学习排名研究的基准数据集，其中包含标准特征、相关性判断、数据划分、评估工具和若干基线|[LETOR: Learning to Rank for Information Retrieval](https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fbeijing%2Fprojects%2Fletor%2F)|
  |[senti_clas](https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz)| [textcnn](../models/contentunderstanding/textcnn/senti_clas_reader.py)|情感倾向分析（Sentiment Classification，简称Senta）针对带有主观描述的中文文本，可自动判断该文本的情感极性类别并给出相应的置信度。情感类型分为积极、消极。情感倾向分析能够帮助企业理解用户消费习惯、分析热点话题和危机舆情监控，为企业提供有利的决策支持|--|
- |[one_billion](http://www.statmt.org/lm-benchmark/)| [word2vec](../models/recall/word2vec/word2vec_reader.py) |拥有十亿个单词基准，为语言建模实验提供标准的训练和测试|[One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling](https://arxiv.org/abs/1312.3005)|
+ |[one_billion](https://paddlerec.bj.bcebos.com/word2vec/1-billion-word-language-modeling-benchmark-r13output.tar)| [word2vec](../models/recall/word2vec/word2vec_reader.py) |拥有十亿个单词基准，为语言建模实验提供标准的训练和测试|[One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling](https://arxiv.org/abs/1312.3005)|
  |[MIND](https://paddlerec.bj.bcebos.com/datasets/MIND/bigdata.zip)| [naml](../models/rank/naml/NAMLDataReader.py) |MIND即MIcrosoft News Dataset的简写，MIND里的数据来自Microsoft News用户的行为日志。MIND的数据集里包含了1,000,000的用户以及这些用户与160,000的文章的交互行为。|[Microsoft(2020)](https://msnews.github.io)|
  |[movielens_pinterest_NCF](https://paddlerec.bj.bcebos.com/ncf/Data.zip)| [NCF](../models/recall/ncf/movielens_reader.py) |论文原作者处理过的movielens数据集和pinterest数据集，[github](https://github.com/hexiangnan/neural_collaborative_filtering)|[《Neural Collaborative Filtering 》](https://arxiv.org/pdf/1708.05031.pdf)|
  |[Anime](https://paddlerec.bj.bcebos.com/datasets/Anime/archive.zip)| -- |该数据集包含73,516个用户对12,294个动漫的用户偏好数据。每个用户都可以将动漫添加到列表中并给它一个评分，该数据集是这些评分的汇总。|[Kaggle](https://www.kaggle.com/CooperUnion/anime-recommendations-database)|
@@ -40,7 +40,7 @@ sh data_process.sh
  |[Ali_Display_Ad_Click](https://paddlerec.bj.bcebos.com/datasets/dmr/dataset_full.zip)| [dmr](../models/rank/dmr/alimama_reader.py) |预处理过的Alimama数据集 |[Deep Match to Rank Model for Personalized Click-Through Rate Prediction](https://github.com/lvze92/DMR)|
  |[omniglot](https://paddlerec.bj.bcebos.com/datasets/omniglot/omniglot.tar)| [maml](../models/multitask/maml/omniglot_reader.py) |预处理过的omniglot数据集 |[Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks](https://arxiv.org/pdf/1703.03400.pdf)|
  |[LastFM](https://paddlerec.bj.bcebos.com/datasets/LastFM/lastfm-2k.zip)| -- |提供音乐推荐的数据集,对于数据集中的每个用户，包含他们最受欢迎的艺术家的列表以及播放次数 |[HetRec 2011](https://grouplens.org/datasets/hetrec-2011/)|
- |[Epinions](https://paddlerec.bj.bcebos.com/datasets/Epinions/epinions_data.tar.gz)| -- |Epinions数据集由一个普通消费者评论网站Epinions.com的who信任who在线社交网络构建 |[Epinions](https://snap.stanford.edu/data/soc-Epinions1.html)|
+ |[Epinions](https://paddlerec.bj.bcebos.com/datasets/Epinions/soc-Epinions1.txt.gz)| -- |Epinions数据集由一个普通消费者评论网站Epinions.com的who信任who在线社交网络构建 |[Epinions](https://snap.stanford.edu/data/soc-Epinions1.html)|
  |[Yelp](https://paddlerec.bj.bcebos.com/datasets/Epinions/soc-Epinions1.txt.gz)| -- |Yelp数据集是我们用于个人，教育和学术目的的业务，评论和用户数据的子集。以JSON文件形式提供，可用于在学习如何制作移动应用程序的同时，教给学生有关数据库，学习NLP或用于样例生产数据的信息。 |[Yelp](https://www.yelp.com/dataset)|
  |[book-crossing](https://paddlerec.bj.bcebos.com/datasets/book-crossing/BX-CSV-Dump.zip)| -- |Book-Crossings是由Cai-Nicolas Ziegler根据  bookcrossing.com 的数据编写的图书评分数据集。|[IIF](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)|
  |[Pinterest](https://paddlerec.bj.bcebos.com/datasets/Pinterest/pinterest-20.train.rating)| -- |Pinterest数据集包含超过100万张与Pinterest用户相关联的图像。|[Learning Image and User Features for Recommendation in Social Networks](https://openaccess.thecvf.com/content_iccv_2015/papers/Geng_Learning_Image_and_ICCV_2015_paper.pdf)|