Skip to content

Implementation of the paper "Minimizing Data, Maximizing Performance: Generative Examples for Continual Task Learning".

Notifications You must be signed in to change notification settings

Sekeh-Lab/EpochLoss_CL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

EpochLoss_CL

The accompanying code provides the implementation for the paper "Minimizing Data, Maximizing Performance: Generative Examples for Continual Task Learning". It provides code for removal of training data and experiments for substitution with synthetic data tasks, as well as code for generating various CL datasets based on GenImage.

Method Overview

The accompanying paper has two primary focuses. The first is to propose a method for identifying and filtering redundant or harmful training data for adversarial training, and the second is to determine the usefulness of replacing training data with synthetic images. These joint objectives aim to reduce the amount of natural training data that needs to be trained on. We implement experiments on custom continual learning datasets derived from the GenImage dataset, which provides synthetic copies of ImageNet images from several generative models including Stable Diffusion and Midjourney. We derived choose six of these generators, and for each one derive six CL tasks with synthetic images generated by the chosen generator, with an accompanying set of natural ImageNet images belonging to the same classes. Details of this dataset are provided in the accompanying paper. The underlying idea for EpochLoss is inspired by the Caper method, which is aims to remove images near decision boundaries that may be more susceptible to adversarial attacks leading to misclassification. As a result Caper removed training samples which showed the largest perturbation of activations when attacked. By contrast, EpochLoss takes a more direct approach to identifying samples near decision boundaries.

Data Removal

Overview

EpochLoss allows for relatively small number of Τ training epochs. This training is done without applying adversarial attack, so it presents a relatively low runtime overhead. During this period, the per-sample loss is recorded each epoch. The underlying idea is that samples near decision boundaries that may be susceptible to adversarial attack will be frequently misclassified as the decision boundaries are initially learned. After this training period, we remove some percent of samples with the highest averaged loss values. We then reset the model to the beginning of the task and adversarially train on only the remaining data. Notably, we reset the task for this research setting to undo any impact of the samples which were removed. Simply resuming training from epoch Τ + 1 may help further reduce the runtime overhead.

Data Removal on Natural Images

We find that when removing even 40-60% of training data, EpochLoss is able to match or outperform training on all samples. This was significantly better than Caper and random removal approaches, indicating that it was not simply a result of using an excessive amount of data. It is also worth noting that the GenImage dataset which we built our datasets from actively sought to minimize the amount of training samples needed for each ImageNet class, so even with this prior effort to reduce the training data used, we find much of it can still be found and removed with EpochLoss with minimal or beneficial impact on accuracy.

Synthetic Training Data

Synthetic Imagenet Tasks

To determine the impact of replacing training data with synthetic images, we provide a simple CL approach where some variable number of the early tasks are entirely replaced with synthetic images from a given generative model. This allows us to observe how well the knowledge learned from these synthetic images generalizes to natural test data for the same task, or subsequent tasks.

Progressive Task Substitution

We compare the impact of subsequently replacing more of the early tasks with sythetic variants. We found that there was a slight improvement in accuracy when the first task was replaced with synthetic training data, while subsequent tasks lowered the six-task average accuracy when replaced. To further investigate why this was, we compared the natural test accuracy when all six training tasks were synthesized from a single generator, to compare the usefulness of each generator in making training data.

Generative Model Comparison

We found that training on data generated by ADM resulted in the best overall accuracy, while other generative models often resulted in lower accuracies than training on natural images. As ADM was used as the first task in our experiments, this is most likely why we saw an improvement when replacing the first task with synthetic training data but not subsequent tasks, and highlights the importance of sourcing the best generative model for image production when supplementing natural training images.

Code Structure

The src directory contains slurm scripts for:
1. Training without removal (Baseline.slurm)
2. Training with EpochLoss, Caper, or Random removal (train_steps.slurm)
3. Evaluation of previously trained checkpoints (Evaluate.slurm)

The main python script is roughly divided into 3 sequential steps:
Step 1: Training on all data for current task
Step 2: Select and remove subset of training data using provided method
Step 3: Reload model from start of task and train on remaining subset of training data

Dataset Generation and Availability

The experiments largely use the CL variants of the GenImage dataset found at https://github.com/GenImage-Dataset/GenImage. We provide the tools for processing the original GenImage dataset files into the CL variants outlined in our paper within the "CL GenImage Generation" directory, with an accompanying ReadMe. For simplicity, if you would like to skip this process and just download the resulting CL datasets they are provided at the following Google Drive: https://drive.google.com/drive/folders/1b-81lrRKbQQ091JaZ_f4U7bZcR1LGZ0h?usp=drive_link with a total size of ~60 GB. Each of the included generators is split into 6 tasks, where GenImage-Disjoint uses one task from each generator.

About

Implementation of the paper "Minimizing Data, Maximizing Performance: Generative Examples for Continual Task Learning".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published