More training improvements

- Check training with EMA
- Check out other loss functions for empty blocks again
- Implement domain adaptation / semi-supervised learning
    - Implement pseudo-labeler that uses confidence mask from first channel (=fg/bg pred) for all channels

(this is just a mental note for me)