whenever context is passed to a block like in:
h, res_samples = downsample_block(hidden_states=h, temb=emb, context=context)
the forward function uses "del context"
so it is not really implementing this conditioning.
further more if the context is an image it should pass through an encoder before added/concat to the image embeddings.