I've noticed that currently the load_masked and store_masked only supports batch_bool_constant.
I think load_masked and store_masked is very suitable for dealing with loop tails, however in this case the mask is dynamic. Since most architectures that support masked load and masked store doesn't require the mask to be a constant, perhaps it's better to provide a batch_bool version of load_masked and store_masked.
Thank you very much.