This is a very insightful piece of work! It's impressive to see robust motion segmentation and quality enhancement for dynamic point clouds achieved in a training-free manner.
I have a question regarding the implementation details. I noticed that when aggregating the cross-attention maps for a single image (referring to the _get_attn_k() function in model.py), the aggregation is performed along the query dimension rather than the key dimension.
Is there a specific reason or intuition behind this design choice? I would really appreciate it if you could share your insights. Thanks!