Dump raw training data for the LLM-jp-3 series. For each training instance, the following fields should be included at least:
token_ids: A list of token IDs for the training instance
training_step: Training step at which the training instance was processed
dataset: Name of the dataset from which the instance was sourced
document_ids: IDs of the documents associated with the training instance