Skip to content

Commit 06b0551

Browse files
committed
Create div_sampling.py
For DIV (diversity sampling for InternVid-10M-DIV), we aim to sample video clips from all long videos available to maximize data diversity. This was done by counting the frequencies of long videos in the segmented clip pool and sampling clips with probabilities inverse to these frequencies.
1 parent 8b52001 commit 06b0551

File tree

1 file changed

+14
-0
lines changed

1 file changed

+14
-0
lines changed

Data/InternVid/div_sampling.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
from collections import Counter
2+
import json
3+
import random
4+
import numpy as np
5+
data = json.load(open("/path/to/to_sample"))
6+
video_id = set([x["video"].split("/")[-1][:11] for x in data])
7+
video_id_counter = Counter([x["video"].split("/")[-1][:11] for x in data])
8+
sampling_weights = [1.0 / video_id_counter[x["video"].split("/")[-1][:11]] for x in data]
9+
np.random.seed(42)
10+
sampling_weights = np.array(sampling_weights)
11+
sampling_weights = sampling_weights / sampling_weights.sum()
12+
sampled_index = np.random.choice(len(data), 10647458, replace=False, p=sampling_weights)
13+
data = [data[i] for i in sampled_index]
14+
json.dump(data, open("/path/to/sampled", "w"))

0 commit comments

Comments
 (0)