-
Notifications
You must be signed in to change notification settings - Fork 4
Description
In this section, you construct the final output score using features at two different scales: intra-scale and inter-scale.
For the intra-scale part, you feed multi-scale features from the feature pyramid into the class_head separately to obtain local scores within each scale. This part is clear to me.
However, in the inter-scale part, you first concatenate the multi-scale features and then feed them together into a convolution-based conf_head. My question is: why is it reasonable for features from different scales to interact with each other?
Let’s consider a simplified case with two stride settings. At these two scales, the feature sequences are [t0, t1, t2, ..., t7] and [T0, T1, T2, T3], respectively. After concatenation, the unified feature sequence becomes [t0, t1, ..., t7, T0, ..., T3].
I noticed that in your experiments, a convolution with kernel size 5 is used. When the kernel is centered at t0, it only captures features like t0, t1, and t2 — all from the same scale. When it slides to T0, it may capture t6, t7, T0, T1, and T2. However, t6, t7, and T0 do not correspond in time. What is the physical or semantic meaning of fusing such features? Could the resulting representations be confusing or inconsistent?
Moreover, cross-scale interactions only happen when the kernel is centered near the boundary between the two scales. In practice, the number of features at each scale is usually larger than the 8 and 4 used in this example, so inter-scale interactions would occur only in a small number of boundary positions.
I am not sure whether I have misunderstood the intention expressed in your paper or if I have overlooked something while reading the code. Nevertheless, this part has been quite puzzling for me, and I would appreciate the opportunity to discuss it further with you.