Skip to content

Datasets miss extern data handling and other things #248

@albertz

Description

@albertz

I'm looking into how to convert my old DatasetConfig-based datasets to the new Dataset interface (#231).

What I'm missing:

  • I want to combine train, dev, devtrain somehow together. This is what I would want for all training jobs. Should we provide a common data structure for this? TrainingDatasets?
  • With a dataset always comes the extern data. Shouldn't this be part of the Dataset interface? Otherwise you must do this manually, and somehow infer it from the dataset? Or I would need some other extended structure DatasetWithExternData or so.
  • Extern data would use dimension tag, and it's important that they would be shared among train/dev/devtrain. How would we do this?
  • I'm building some setup pipeline for standard supervised training, i.e. for the pipeline I somewhere need to define which is the input and which is the output data key in extern data. This would also be in TrainingDatasets, or maybe there would be a more special variant SupervisedTrainingDatasets?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions