Hi, and thank you for making this code available! What would the pipeline look like to train on my own dataset? What is required, what are the pre-processing steps? Thanks!