I have a CLIP implementation

Hi there,

thank you for the dataset. 

I've implemented a CLIP benchmark of the dataset -> [CLIP_visual-spatial-reasoning](https://github.com/Sohojoe/CLIP_visual-spatial-reasoning)

I found I was able to go from 50% to ~55% true zero shot (i.e. no retraining at all) through prompt engineering. I'm implementing retraining now and will keep updating with the results.