You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -52,10 +52,10 @@ However, since the essence of soft labels is **knowledge distillation**, we find
52
52
53
53
This makes us wonder: **Can the test accuracy of the model trained on distilled data reflect the real informativeness of the distilled data?**
54
54
55
-
Additionally, we have discoverd unfairness of using only test accuracy to demonstrate one's performance from the following three aspects:
56
-
1. Results of using hard and soft labels are not directly comparable since soft labels introduce teacher knowledge.
57
-
2. Strategies of using soft labels are diverse. For instance, different objective functions are used during evaluation, such as soft Cross-Entropy and Kullback–Leibler divergence. Also, one image may be mapped to one or multiple soft labels.
58
-
3. Different data augmentations are used during evaluation.
55
+
We summaize the evaluation configurations of existing works in the following table, with different colors highlighting different values for each configuration.
56
+

57
+
As can be easily seen, the evaluation configurations are diverse, leading to unfairness of using only test accuracy to demonstrate one's performance.
58
+
Among these inconsistencies, two critical factors significantly undermine the fairness of current evaluation protocols: label representation (including the corresponding loss function) and data augmentation techniques.
59
59
60
60
Motivated by this, we propose DD-Ranking, a new benchmark for DD evaluation. DD-Ranking provides a fair evaluation scheme for DD methods that can decouple the impacts from knowledge distillation and data augmentation to reflect the real informativeness of the distilled data.
61
61
@@ -76,7 +76,8 @@ Revisit the original goal of dataset distillation:
76
76
> The idea is to synthesize a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. (Wang et al., 2020)
77
77
>
78
78
79
-
The evaluation method for DD-Ranking is grounded in the essence of dataset distillation, aiming to better reflect the informativeness of the synthesized data by assessing the following two aspects:
79
+
#### Label-Robust Score (LRS)
80
+
For the label representation, we introduce the Label-Robust Score (LRS) to evaluate the informativeness of the synthesized data using the following two aspects:
80
81
1. The degree to which the real dataset is recovered under hard labels (hard label recovery): $\text{HLR}=\text{Acc.}{\text{real-hard}}-\text{Acc.}{\text{syn-hard}}$.
81
82
82
83
2. The improvement over random selection when using personalized evaluation methods (improvement over random): $\text{IOR}=\text{Acc.}{\text{syn-any}}-\text{Acc.}{\text{rdm-any}}$.
@@ -86,14 +87,21 @@ $\text{Acc.}$ is the accuracy of models trained on different samples. Samples' m
86
87
- $\text{syn-any}$: Synthetic dataset with personalized evaluation methods (hard or soft labels);
87
88
- $\text{rdm-any}$: Randomly selected dataset (under the same compression ratio) with the same personalized evaluation methods.
88
89
89
-
DD-Ranking uses a weight sum of $\text{IOR}$ and $-\text{HLR}$ to rank different methods:
90
-
$\alpha = w\text{IOR}-(1-w)\text{HLR}, \quad w \in [0, 1]$
91
-
92
-
Formally, the **DD-Ranking Score (DDRS)** is defined as:
93
-
$(e^{\alpha}-e^{-1}) / (e - e^{-1})$
90
+
LRS is defined as a weight sum of $\text{IOR}$ and $-\text{HLR}$ to rank different methods:
91
+
$\alpha = w\text{IOR}-(1-w)\text{HLR}, \quad w \in [0, 1]$.
92
+
Then, the LRS is normalized to $[0, 1]$ as follows:
93
+
$\text{LRS} = (e^{\alpha}-e^{-1}) / (e - e^{-1})$
94
94
95
95
By default, we set $w = 0.5$ on the leaderboard, meaning that both $\text{IOR}$ and $\text{HLR}$ are equally important. Users can adjust the weights to emphasize one aspect on the leaderboard.
96
96
97
+
#### Augmentation-Robust Score (ARS)
98
+
To disentangle data augmentation’s impact, we introduce the augmentation-robust score (ARS) which continues to leverage the relative improvement over randomly selected data. Specifically, we first evaluate synthetic data and a randomly selected subset under the same setting to obtain $\text{Acc.}{\text{syn-aug}}$ and $\text{Acc.}{\text{rdm-aug}}$ (same as IOR). Next, we evaluate both synthetic data and random data again without the data augmentation, and results are denoted as $\text{Acc.}{\text{syn-naug}}$ and $\text{Acc.}{\text{rdm-naug}}$.
99
+
Both differences, $\text{accsyn-aug} - \text{accrdm-aug}$ and $\text{accsyn-naug} - \text{accrdm-naug}$, are positively correlated to the real informativeness of the distilled dataset.
@@ -116,7 +124,12 @@ DD-Ranking currently includes the following datasets and methods (categorized by
116
124
|CIFAR-10|DC|DATM|
117
125
|CIFAR-100|DSA|SRe2L|
118
126
|TinyImageNet|DM|RDED|
119
-
||MTT|D4M|
127
+
|ImageNet1K|MTT|D4M|
128
+
|| DataDAM | EDF |
129
+
||| CDA |
130
+
||| DWA |
131
+
||| EDC |
132
+
||| G-VBSM |
120
133
121
134
122
135
@@ -139,16 +152,17 @@ python setup.py install
139
152
```
140
153
### Quickstart
141
154
142
-
Below is a step-by-step guide on how to use our `dd_ranking`. This demo is based on soft labels (source code can be found in `demo_soft.py`). You can find hard label demo in `demo_hard.py`.
155
+
Below is a step-by-step guide on how to use our `ddranking`. This demo is based on LRS on soft labels (source code can be found in `demo_lrs_soft.py`). You can find LRS on hard labels in `demo_lrs_hard.py` and ARS in `demo_aug.py`.
156
+
DD-Ranking supports multi-GPU Distributed evaluation. You can simply use `torchrun` to launch the evaluation.
143
157
144
158
**Step1**: Intialize a soft-label metric evaluator object. Config files are recommended for users to specify hyper-parameters. Sample config files are provided [here](https://github.com/NUS-HPC-AI-Lab/DD-Ranking/tree/main/configs).
The following results will be printed and saved to `save_path`:
219
242
-`HLR mean`: The mean of hard label recovery over `num_eval` runs.
220
243
-`HLR std`: The standard deviation of hard label recovery over `num_eval` runs.
221
244
-`IOR mean`: The mean of improvement over random over `num_eval` runs.
222
245
-`IOR std`: The standard deviation of improvement over random over `num_eval` runs.
246
+
-`LRS mean`: The mean of Label-Robust Score over `num_eval` runs.
247
+
-`LRS std`: The standard deviation of Label-Robust Score over `num_eval` runs.
223
248
224
249
Check out our <spanstyle="color: #ff0000;">[documentation](https://nus-hpc-ai-lab.github.io/DD-Ranking/)</span> to learn more.
225
250
226
-
## Coming Soon
227
-
228
-
-[ ] Evaluation results on ImageNet subsets.
229
-
-[ ] More baseline methods.
230
-
-[ ] DD-Ranking scores that decouple the impacts from data augmentation.
231
251
232
252
## Contributing
233
253
@@ -236,7 +256,7 @@ Feel free to submit grades to update the DD-Ranking list. We welcome and value a
236
256
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
237
257
238
258
239
-
## Technical Members:
259
+
<!--## Technical Members:
240
260
- [Zekai Li*](https://lizekai-richard.github.io/) (National University of Singapore)
241
261
- [Xinhao Zhong*](https://ndhg1213.github.io/) (National University of Singapore)
242
262
- [Zhiyuan Liang](https://jerryliang24.github.io/) (University of Science and Technology of China)
@@ -282,35 +302,25 @@ Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
282
302
- [Yang You](https://www.comp.nus.edu.sg/~youy/) (National University of Singapore)
283
303
- [Kai Wang](https://kaiwang960112.github.io/) (National University of Singapore)
284
304
285
-
\**equal contribution*
305
+
\* *equal contribution*-->
286
306
287
307
## License
288
308
289
309
DD-Ranking is released under the MIT License. See [LICENSE](./LICENSE) for more details.
290
310
291
-
## Related Works
292
-
293
-
-[Dataset Distillation](https://arxiv.org/abs/1811.10959), Wang et al., in arXiv 2018.
294
-
-[Dataset Condensation with Gradient Matching](https://arxiv.org/abs/2006.05929), Zhao et al., in ICLR 2020.
295
-
-[Dataset Condensation with Differentiable Siamese Augmentation](https://arxiv.org/abs/2102.08259), Zhao \& Bilen, in ICML 2021.
296
-
-[Dataset Distillation via Matching Training Trajectories](https://arxiv.org/abs/2203.11932), Cazenavette et al., in CVPR 2022.
297
-
-[Dataset Distillation with Distribution Matching](https://arxiv.org/abs/2110.04181), Zhao \& Bilen, in WACV 2023.
298
-
-[Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective](https://arxiv.org/abs/2306.13092), Yin et al., in NeurIPS 2023.
299
-
-[Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching](https://arxiv.org/abs/2310.05773), Guo et al., in ICLR 2024.
300
-
-[On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm](https://arxiv.org/abs/2312.03526), Sun et al., in CVPR 2024.
301
-
-[D4M: Dataset Distillation via Disentangled Diffusion Model](https://arxiv.org/abs/2407.15138), Su et al., in CVPR 2024.
302
-
303
311
304
312
## Reference
305
313
306
314
If you find DD-Ranking useful in your research, please consider citing the following paper:
307
315
308
316
```bibtex
309
-
@misc{li2024ddranking,
310
-
title = {DD-Ranking: Rethinking the Evaluation of Dataset Distillation},
311
-
author = {Li, Zekai and Zhong, Xinhao and Liang, Zhiyuan and Zhou, Yuhao and Shi, Mingjia and Wang, Ziqiao and Zhao, Wangbo and Zhao, Xuanlei and Wang, Haonan and Qin, Ziheng and Liu, Dai and Zhang, Kaipeng and Zhou, Tianyi and Zhu, Zheng and Wang, Kun and Li, Guang and Zhang, Junhao and Liu, Jiawei and Huang, Yiran and Lyu, Lingjuan and Lv, Jiancheng and Jin, Yaochu and Akata, Zeynep and Gu, Jindong and Vedantam, Rama and Shou, Mike and Deng, Zhiwei and Yan, Yan and Shang, Yuzhang and Cazenavette, George and Wu, Xindi and Cui, Justin and Chen, Tianlong and Yao, Angela and Kellis, Manolis and Plataniotis, Konstantinos N. and Zhao, Bo and Wang, Zhangyang and You, Yang and Wang, Kai},
title={DD-Ranking: Rethinking the Evaluation of Dataset Distillation},
319
+
author={Zekai Li and Xinhao Zhong and Samir Khaki and Zhiyuan Liang and Yuhao Zhou and Mingjia Shi and Ziqiao Wang and Xuanlei Zhao and Wangbo Zhao and Ziheng Qin and Mengxuan Wu and Pengfei Zhou and Haonan Wang and David Junhao Zhang and Jia-Wei Liu and Shaobo Wang and Dai Liu and Linfeng Zhang and Guang Li and Kun Wang and Zheng Zhu and Zhiheng Ma and Joey Tianyi Zhou and Jiancheng Lv and Yaochu Jin and Peihao Wang and Kaipeng Zhang and Lingjuan Lyu and Yiran Huang and Zeynep Akata and Zhiwei Deng and Xindi Wu and George Cazenavette and Yuzhang Shang and Justin Cui and Jindong Gu and Qian Zheng and Hao Ye and Shuo Wang and Xiaobo Wang and Yan Yan and Angela Yao and Mike Zheng Shou and Tianlong Chen and Hakan Bilen and Baharan Mirzasoleiman and Manolis Kellis and Konstantinos N. Plataniotis and Zhangyang Wang and Bo Zhao and Yang You and Kai Wang},
0 commit comments