- find out wht the best training dataset should contain
- parallelize dataset creation
- docker image for dataset creation
- publish docker images
- create a new training dataset of size 100 gb with runpod
- upload dataset to huggingface, versioning also with huggingface