Skip to content

A doubt about data augmentation #3

@nancheng58

Description

@nancheng58

Thanks for your nice work, but the detail of data augmentation may have a leakage problem. More precisely, the pseudo-prior items may see the test information ahead of the inference.

def data_augment(model, dataset, args, sess, gen_num):

    [train, valid, test, original_train, usernum, itemnum] = copy.deepcopy(dataset)
    all_users = list(train.keys())

    cumulative_preds = defaultdict(list)
    for num_ind in range(gen_num):
        batch_seq = []
        batch_u = []
        batch_item_idx = []

        for u_ind, u in enumerate(all_users):
            u_data = train.get(u, []) + valid.get(u, []) + test.get(u, []) + cumulative_preds.get(u, [])

            if len(u_data) == 0 or len(u_data) >= args.M: continue

            seq = np.zeros([args.maxlen], dtype=np.int32)
            idx = args.maxlen - 1
            for i in reversed(u_data):
                if idx == -1: break
                seq[idx] = i
                idx -= 1
            rated = set(u_data)
            item_idx = list(set([i for i in range(itemnum)]) - rated) 

            batch_seq.append(seq)
            batch_item_idx.append(item_idx)
            batch_u.append(u)

The user data (i.e. ‘u_data = train.get(u, []) + valid.get(u, []) + test.get(u, []) + cumulative_preds.get(u, [])’) consist of the test data and used for generate the prior data. And the augmented data (i.e. prior data + train data + valid data) training the left-to-right model in the fine-tuning stage and the model to infer the rec result. So both augmented data and the left-to-right model see the test data(leakage of the test data) ahead of the inference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions