-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
Thanks for your nice work, but the detail of data augmentation may have a leakage problem. More precisely, the pseudo-prior items may see the test information ahead of the inference.
def data_augment(model, dataset, args, sess, gen_num):
[train, valid, test, original_train, usernum, itemnum] = copy.deepcopy(dataset)
all_users = list(train.keys())
cumulative_preds = defaultdict(list)
for num_ind in range(gen_num):
batch_seq = []
batch_u = []
batch_item_idx = []
for u_ind, u in enumerate(all_users):
u_data = train.get(u, []) + valid.get(u, []) + test.get(u, []) + cumulative_preds.get(u, [])
if len(u_data) == 0 or len(u_data) >= args.M: continue
seq = np.zeros([args.maxlen], dtype=np.int32)
idx = args.maxlen - 1
for i in reversed(u_data):
if idx == -1: break
seq[idx] = i
idx -= 1
rated = set(u_data)
item_idx = list(set([i for i in range(itemnum)]) - rated)
batch_seq.append(seq)
batch_item_idx.append(item_idx)
batch_u.append(u)The user data (i.e. ‘u_data = train.get(u, []) + valid.get(u, []) + test.get(u, []) + cumulative_preds.get(u, [])’) consist of the test data and used for generate the prior data. And the augmented data (i.e. prior data + train data + valid data) training the left-to-right model in the fine-tuning stage and the model to infer the rec result. So both augmented data and the left-to-right model see the test data(leakage of the test data) ahead of the inference.
Metadata
Metadata
Assignees
Labels
No labels