Update metrics related content

lizekai-richard · lizekai-richard · commit d3825e8982bc · 2024-12-30T14:14:47.000+08:00
diff --git a/README.md b/README.md
@@ -65,13 +65,13 @@ DD-Ranking (DD, *i.e.*, Dataset Distillation) is an integrated and easy-to-use b
 
 <!-- Hard label is tested -->
 <!-- Keep the same compression ratio, comparing with random selection -->
-**Performance benchmark**
+### Benchmark
 
 Revisit the original goal of dataset distillation: 
-> The idea is to synthesize a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data.
+> The idea is to synthesize a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. (Wang et al., 2020)
 >
 
-The evaluation method for DD-Ranking is grounded in the essence of dataset distillation, aiming to better reflect the information content of the synthesized data by assessing the following two aspects:  
+The evaluation method for DD-Ranking is grounded in the essence of dataset distillation, aiming to better reflect the informativeness of the synthesized data by assessing the following two aspects:  
 1. The degree to which the original dataset is recovered under hard labels (hard label recovery): $\text{HLR}=\text{Acc.}{\text{full-hard}}-\text{Acc.}{\text{syn-hard}}$.  
 
 2. The improvement over random selection when using personalized evaluation methods (improvement over random): $\text{IOR}=\text{Acc.}{\text{syn-any}}-\text{Acc.}{\text{rdm-any}}$.
@@ -81,9 +81,13 @@ $\text{Acc.}$ is the accuracy of models trained on different samples. Samples' m
 - $\text{syn-any}$: Synthetic dataset with personalized evaluation methods (hard or soft labels);
 - $\text{rdm-any}$: Randomly selected dataset (under the same compression ratio) with the same personalized evaluation methods.
 
-To rank different methods, we combine the above two metrics as follows:
+<!-- To rank different methods, we combine the above two metrics as follows:
 
-$$\text{IOR}/\text{HLR} = \frac{(\text{Acc.}{\text{syn-any}}-\text{Acc.}{\text{rdm-any}})}{(\text{Acc.}{\text{full-hard}}-\text{Acc.}{\text{syn-hard}})}$$
+$$\text{IOR}/\text{HLR} = \frac{(\text{Acc.}{\text{syn-any}}-\text{Acc.}{\text{rdm-any}})}{(\text{Acc.}{\text{full-hard}}-\text{Acc.}{\text{syn-hard}})}$$ -->
+
+</details>
+
+## Overview
 
 DD-Ranking is integrated with:
 <!-- Uniform Fair Labels: loss on soft label -->
@@ -98,18 +102,15 @@ DD-Ranking has the following features:
 - **Extensible**: DD-Ranking supports various datasets and models.
 - **Customizable**: DD-Ranking supports various data augmentations and soft label strategies.
 
-</details>
-
-## Overview
-Included datasets and methods (categorized by hard/soft label).
+DD-Ranking currently includes the following datasets and methods (categorized by hard/soft label). Evaluation results can be found in the [leaderboard](https://huggingface.co/spaces/Soptq/DD-Ranking).
 |Supported Dataset|Evaluated Hard Label Methods|Evaluated Soft Label Methods|
 |:-|:-|:-|
 |CIFAR-10|DC|DATM|
 |CIFAR-100|DSA|SRe2L|
 |TinyImageNet|DM|RDED|
 ||MTT|D4M|
 
-Evaluation results can be found in the [leaderboard](https://huggingface.co/spaces/Soptq/DD-Ranking).
+
 
 ## Tutorial
 
@@ -221,6 +222,7 @@ The following results will be returned to you:
 ## Coming Soon
 - [ ] DD-Ranking scores that decouple the impacts from data augmentation.
 - [ ] Evaluation results on ImageNet subsets.
+- [ ] More baseline methods.
 
 ## Contributing
 
diff --git a/dd_ranking/metrics/hard_label.py b/dd_ranking/metrics/hard_label.py
@@ -186,7 +186,6 @@ def compute_metrics(self, image_tensor: Tensor=None, image_path: str=None, hard_
         if not hard_labels:
             hard_labels = torch.tensor(np.array([np.ones(self.ipc) * i for i in range(self.num_classes)]), dtype=torch.long, requires_grad=False).view(-1)
 
-        dd_ranking_scores = []
         hard_label_recovery = []
         improvement_over_random = []
         for i in range(self.num_eval):
@@ -256,30 +255,23 @@ def compute_metrics(self, image_tensor: Tensor=None, image_path: str=None, hard_
 
             hard_label_recovery.append(hlr)
             improvement_over_random.append(ior)
-            dd_ranking_scores.append(ior / hlr)
         
         results_to_save = {
             "hard_label_recovery": hard_label_recovery,
-            "improvement_over_random": improvement_over_random,
-            "dd_ranking_score": dd_ranking_scores
+            "improvement_over_random": improvement_over_random
         }
         save_results(results_to_save, self.save_path)
 
         hard_label_recovery_mean = np.mean(hard_label_recovery)
         hard_label_recovery_std = np.std(hard_label_recovery)
         improvement_over_random_mean = np.mean(improvement_over_random)
         improvement_over_random_std = np.std(improvement_over_random)
-        dd_ranking_score_mean = np.mean(dd_ranking_scores)
-        dd_ranking_score_std = np.std(dd_ranking_scores)
 
         print(f"Hard Label Recovery Mean: {hard_label_recovery_mean:.2f}%  Std: {hard_label_recovery_std:.2f}")
         print(f"Improvement Over Random Mean: {improvement_over_random_mean:.2f}%  Std: {improvement_over_random_std:.2f}")
-        print(f"DD-Ranking Score Mean: {dd_ranking_score_mean:.2f}  Std: {dd_ranking_score_std:.2f}")
         return {
             "hard_label_recovery_mean": hard_label_recovery_mean,
             "hard_label_recovery_std": hard_label_recovery_std,
             "improvement_over_random_mean": improvement_over_random_mean,
-            "improvement_over_random_std": improvement_over_random_std,
-            "dd_ranking_score_mean": dd_ranking_score_mean,
-            "dd_ranking_score_std": dd_ranking_score_std
+            "improvement_over_random_std": improvement_over_random_std
         }
diff --git a/dd_ranking/metrics/soft_label.py b/dd_ranking/metrics/soft_label.py
@@ -298,7 +298,7 @@ def compute_metrics(self, image_tensor: Tensor=None, image_path: str=None, soft_
 
         hard_labels = torch.tensor(np.array([np.ones(self.ipc) * i for i in range(self.num_classes)]), 
                                    dtype=torch.long, requires_grad=False).view(-1)
-        dd_ranking_scores = []
+
         hard_label_recovery = []
         improvement_over_random = []
         for i in range(self.num_eval):
@@ -378,32 +378,25 @@ def compute_metrics(self, image_tensor: Tensor=None, image_path: str=None, soft_
 
             hard_label_recovery.append(hlr)
             improvement_over_random.append(ior)
-            dd_ranking_scores.append(ior / hlr)
         
         results_to_save = {
             "hard_label_recovery": hard_label_recovery,
-            "improvement_over_random": improvement_over_random,
-            "dd_ranking_score": dd_ranking_scores
+            "improvement_over_random": improvement_over_random
         }
         save_results(results_to_save, self.save_path)
 
         hard_label_recovery_mean = np.mean(hard_label_recovery)
         hard_label_recovery_std = np.std(hard_label_recovery)
         improvement_over_random_mean = np.mean(improvement_over_random)
         improvement_over_random_std = np.std(improvement_over_random)
-        dd_ranking_score_mean = np.mean(dd_ranking_scores)
-        dd_ranking_score_std = np.std(dd_ranking_scores)
 
         print(f"Hard Label Recovery Mean: {hard_label_recovery_mean:.2f}%  Std: {hard_label_recovery_std:.2f}")
         print(f"Improvement Over Random Mean: {improvement_over_random_mean:.2f}%  Std: {improvement_over_random_std:.2f}")
-        print(f"DD-Ranking Score Mean: {dd_ranking_score_mean:.2f}  Std: {dd_ranking_score_std:.2f}")
         return {
             "hard_label_recovery_mean": hard_label_recovery_mean,
             "hard_label_recovery_std": hard_label_recovery_std,
             "improvement_over_random_mean": improvement_over_random_mean,
-            "improvement_over_random_std": improvement_over_random_std,
-            "dd_ranking_score_mean": dd_ranking_score_mean,
-            "dd_ranking_score_std": dd_ranking_score_std
+            "improvement_over_random_std": improvement_over_random_std
         }
 
 
diff --git a/doc/getting-started/quick-start.md b/doc/getting-started/quick-start.md
@@ -79,5 +79,5 @@ The following results will be returned to you:
 - `hard_label_recovery std`: The standard deviation of hard label recovery scores.
 - `improvement_over_random mean`: The mean of improvement over random scores.
 - `improvement_over_random std`: The standard deviation of improvement over random scores.
-- `dd_ranking_score mean`: The mean of dd ranking scores.
-- `dd_ranking_score std`: The standard deviation of dd ranking scores.
+<!-- - `dd_ranking_score mean`: The mean of dd ranking scores.
+- `dd_ranking_score std`: The standard deviation of dd ranking scores. -->
diff --git a/doc/introduction.md b/doc/introduction.md
@@ -56,7 +56,7 @@ The evaluation method for DD-Ranking is grounded in the essence of dataset disti
 - \\(\text{syn-any}\\): Synthetic dataset with personalized evaluation methods (hard or soft labels);
 - \\(\text{rdm-any}\\): Randomly selected dataset (under the same compression ratio) with the same personalized evaluation methods.
 
-To rank different methods, we combine the above two metrics as DD-Ranking Score:
+<!-- To rank different methods, we combine the above two metrics as DD-Ranking Score:
 
-\\[\text{DD-Ranking Score} = \frac{\text{IOR}}{\text{HLR}} = \frac{(\text{Acc.} \text{syn-any}-\text{Acc.} \text{rdm-any})}{(\text{Acc.} \text{full-hard}-\text{Acc.} \text{syn-hard})}\\]
+\\[\text{DD-Ranking Score} = \frac{\text{IOR}}{\text{HLR}} = \frac{(\text{Acc.} \text{syn-any}-\text{Acc.} \text{rdm-any})}{(\text{Acc.} \text{full-hard}-\text{Acc.} \text{syn-hard})}\\] -->
 
diff --git a/doc/metrics/hard-label.md b/doc/metrics/hard-label.md
@@ -71,7 +71,7 @@ This method computes the HLR, IOR, and DD-Ranking scores for the given image and
 1. Compute the test accuracy of the surrogate model on the synthetic dataset under hard labels. We tune the learning rate for the best performance if `syn_lr` is not provided.
 2. Compute the test accuracy of the surrogate model on the real dataset under the same setting as step 1.
 3. Compute the test accuracy of the surrogate model on the randomly selected dataset under the same setting as step 1.
-4. Compute the HLR, IOR, and DD-Ranking scores.
+4. Compute the HLR and IOR scores.
 
 The final scores are the average of the scores from `num_eval` rounds.
 
@@ -90,8 +90,6 @@ A dictionary with the following keys:
 - **hard_label_recovery_std**: Standard deviation of HLR scores from `num_eval` rounds.
 - **improvement_over_random_mean**: Mean of improvement over random scores from `num_eval` rounds.
 - **improvement_over_random_std**: Standard deviation of improvement over random scores from `num_eval` rounds.
-- **dd_ranking_mean**: Mean of DD-Ranking scores from `num_eval` rounds.
-- **dd_ranking_std**: Standard deviation of DD-Ranking scores from `num_eval` rounds.
 
 **Examples:**
 
diff --git a/doc/metrics/soft-label.md b/doc/metrics/soft-label.md
@@ -50,6 +50,7 @@ A class for evaluating the performance of a dataset distillation method with sof
 - **temperature**(<span style="color:#FF6B00;">float</span>): Temperature for knowledge distillation.
 - **data_aug_func**(<span style="color:#FF6B00;">str</span>): Data augmentation function used during training. Currently supports `dsa`, `cutmix`, `mixup`. See [augmentations](../augmentations/overview.md) for more details.
 - **aug_params**(<span style="color:#FF6B00;">dict</span>): Parameters for the data augmentation function.
+- **use_aug_for_hard**(<span style="color:#FF6B00;">bool</span>): Whether to use the data augmentation specified in `data_aug_func` for hard label evaluation.
 - **optimizer**(<span style="color:#FF6B00;">str</span>): Name of the optimizer. Currently supports torch-based optimizers - `sgd`, `adam`, and `adamw`.
 - **lr_scheduler**(<span style="color:#FF6B00;">str</span>): Name of the learning rate scheduler. Currently supports torch-based schedulers - `step`, `cosine`, `lambda_step`, and `lambda_cos`.
 - **weight_decay**(<span style="color:#FF6B00;">float</span>): Weight decay for the optimizer.
@@ -83,7 +84,7 @@ This method computes the HLR, IOR, and DD-Ranking scores for the given image and
 2. Compute the test accuracy of the surrogate model on the real dataset under the same setting as step 1.
 3. Compute the test accuracy of the surrogate model on the synthetic dataset under soft labels.
 4. Compute the test accuracy of the surrogate model on the randomly selected dataset under the same setting as step 3.
-5. Compute the HLR, IOR, and DD-Ranking scores.
+5. Compute the HLR and IOR scores.
 
 The final scores are the average of the scores from `num_eval` rounds.
 
@@ -102,8 +103,8 @@ A dictionary with the following keys:
 - **hard_label_recovery_std**: Standard deviation of HLR scores from `num_eval` rounds.
 - **improvement_over_random_mean**: Mean of improvement over random scores from `num_eval` rounds.
 - **improvement_over_random_std**: Standard deviation of improvement over random scores from `num_eval` rounds.
-- **dd_ranking_mean**: Mean of DD-Ranking scores from `num_eval` rounds.
-- **dd_ranking_std**: Standard deviation of DD-Ranking scores from `num_eval` rounds.
+<!-- - **dd_ranking_mean**: Mean of DD-Ranking scores from `num_eval` rounds.
+- **dd_ranking_std**: Standard deviation of DD-Ranking scores from `num_eval` rounds. -->
 
 </div>