Skip to content
Discussion options

You must be logged in to vote

Hey!
So the first stage is applied to three images, and as you correctly said, produces feature representations for each frame.
These feature presentations, from all three frames, are then concatenated and forwarded to the second U-Net. So the information from each frame passes both networks consecutively.

The loss is calculated using ground truth positions from just a single frame. But the network has access to three frames to make its prediction (each localization made by the network is a function of all three frames). This way the gradients for both U-nets will depend on all three images. I hope that makes sense, please clarify if I misunderstood your question.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by tsuijenk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants