-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Hi, thanks for the interesting work!
I'm reading the code and there is a detail I couldn't understand.
When selecting data from each cluster, the corresponding code is
size = (len(dataset) * portion) / K / round
exp_reward_diff = merged_df["exp_reward_diff"]
# random select from K clusters, with p as the weight.
select_new_iter = np.random.choice(
K, size=int(size), p=exp_reward_diff, replace=True
)
# Count how many times a cluster is chosen
selected_clusters_size = Counter(select_new_iter)
remaining_dataset = dataset.select(set(range(len(dataset))) - set(selected_indices))
remaining_dataset_df = remaining_dataset.to_pandas()
new_indices = []
for i in range(K):
# get current indices in the remaining dataset
indices = remaining_dataset_df[remaining_dataset_df["cluster"] == i]["index"]
# adjust size if the selected size exceeds the remaining size
size = min(selected_clusters_size[i], len(indices))
# pick real samples from each cluster
indices = np.random.choice(indices, size=size, replace=False)
new_indices.extend(indices)
new_indices = np.array(new_indices)
# update the selected samples
new_indices = np.concatenate([selected_indices, new_indices])If I understand correctly, in this iteration, the chosen size is (len(dataset) * portion) / K / round, then the code selects from clusters with weight, and Conuter is used to count how many samples are chosen from a cluster, the subsequent for loop is used to choose samples from K clusters. This results in (len(dataset) * portion) / K / round samples in total. But in the paper, the size for each iteration should be
It would be great if you can help me understand this detail.
Metadata
Metadata
Assignees
Labels
No labels