Skip to content

Commit 99a2d53

Browse files
committed
Create div_sampling.py
For DIV (diversity sampling for InternVid-10M-DIV), we aim to sample video clips from all long videos available to maximize data diversity. This was done by counting the frequencies of long videos in the segmented clip pool and sampling clips with probabilities inverse to these frequencies.
1 parent f5ef1c7 commit 99a2d53

File tree

1 file changed

+14
-0
lines changed

1 file changed

+14
-0
lines changed

Data/InternVid/div_sampling.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
from collections import Counter
2+
import json
3+
import random
4+
import numpy as np
5+
data = json.load(open("/path/to/to_sample"))
6+
video_id = set([x["video"].split("/")[-1][:11] for x in data])
7+
video_id_counter = Counter([x["video"].split("/")[-1][:11] for x in data])
8+
sampling_weights = [1.0 / video_id_counter[x["video"].split("/")[-1][:11]] for x in data]
9+
np.random.seed(42)
10+
sampling_weights = np.array(sampling_weights)
11+
sampling_weights = sampling_weights / sampling_weights.sum()
12+
sampled_index = np.random.choice(len(data), 10647458, replace=False, p=sampling_weights)
13+
data = [data[i] for i in sampled_index]
14+
json.dump(data, open("/path/to/sampled", "w"))

0 commit comments

Comments
 (0)