update datapreprocessor adr

dushyantbehl · dushyantbehl · commit eca1d4483c47 · 2024-11-06T16:27:23.000+05:30
diff --git a/architecture_records/004-datapreprocessor.md b/architecture_records/004-datapreprocessor.md
@@ -20,32 +20,73 @@
 
 ## Summary and Objective
 
-The reason for motivating datapreprocessor design for fms-hf-tuning is to have a unified interface which supports many type of data formats, streaming and non streaming data, weight based data mixing and many others.
+<!-->
+Context goes here.
+Describe the forces at play, including technological, political, social, and project local. These forces are likely in tension, and should be called out as such. The language in this section is value-neutral. It is simply describing facts.
+<-->
 
-1. Support for different formats of data → Arrow, Parquet, CSV, etc.
+The primary objective of the `DataPreProcessor` design for fms-hf-tuning is to provide a unified yet powerful interface for handling diverse data formats and configurations.
+This interface should cater to various user expertise levels, enabling basic users to easily load and process data, while allowing advanced users to customize their data handling through pre-defined configuration files.
 
-1. Support for multiple files in a dataset.
+### Key Goals:
+1. **Broad Data Format Support**: Allow datasets in formats such as Arrow, Parquet, and CSV.
+1. **Compatibility with Multiple Datasets and Files**: Enable multiple files per dataset and interleaving or mixing of datasets.
+1. **Support for Different Data Modalities**: Include images, audio, and text data, along with modality-specific preprocessing options.
+1. **User-Focused Configurations**: Provide simple data loading for regular users, while enabling advanced configurations for expert users.
+1. **Template-Based Preprocessing**: Support chat templates and masking, where necessary, for chat-based or other template-dependent preprocesing requirements.
 
-1. Support for multiple datasets.
+### Motivation
 
-1. Support for different modalities of data -> Images, Audio etc.
+<!-->
+Why this is a valuable problem to solve? What background information is needed to show how this design addresses the problem?
+Which users are affected by the problem? Why is it a problem? What data supports this? What related work exists?
+<-->
 
-1. Support for mixing datasets based on static weights.
+The main motivation for this ADR stems from the fact that fms-hf-tuning is being used by many teams for a diverse set of use cases which are not currently supported in the library.
 
-1. Support for streaming datasets (or Iterable Datasets in HuggingFace)
+In the library for data preposssing we currently take two primary arguments `training_data_path` and `validataion_data_path` which take in a single file location for a dataset. Current library supports only `Json` data but can handle both pretokenised or non tokenised data by performing `input` masking and custom data formatting.
 
-1. Support for chat template based masking
+The first motivation for a change is the requirements from users asking for multiple datasets and even multiple data files in a dataset. Also there are teams which are training using
+Parquet and Arrow format data so they require support for additional data formats in the code.
+Futher requirements from teams is to have a way to interleave datasets at run time by specifying static weights to mix different datasets which is also not supported by the code yet.
+Finally other requirements are to have preprocesing support for multiple modalities of data (starting with Image first) and have support for custom preprocesing like jinja based template rendering of the dataset before consumption.
 
-1. Tool Usage - for example using jinja templates in the datasets which require preprocessing of the data.
+All these requirements are new and are currently not supported by the library which motivated us to propose a change in the design of data preprocesing in this library to incorporate these and potentially any new changes in one go.
 
-### Motivation
+### User Benefit
 
-The current design of the data processing in fms-hf-tuning contains predefined use cases and support for a simple interface
-to the users. While the simple interface provides ease of use over HuggingFace APIs it doesn't provide users any interface to define custom data preprocessing for e.g. in case of multi modal support. 
+<!-- How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post? -->
 
-Ths motivation for this design is to implement the data preprocessor in a flexible way and expose to the users a powerful API which can allow them custom processing of data however they want, and at the same time have backwards compatibility retaining the simple interface of the existing API.
+Users will benefit from the additional argument which allows users to pass a single `data_config` file specifying how to preprocess their dataset.
+In the config, users will be able to pass multiple data files and multiple datasets and specify static weights in the configuration to interleave datasets.
+In the config, users will also be able to define which preprocessing routines to apply on the data and in which order making the process of handling custom datasets
+which might require rendering jinja template or processing image data in a custom way much easier.
+
+Its not mandatory for users to learn the specification of the additional `data_config` as the existing arguments to process data as present in the code `tuning.config.configs.DataArguments` will not be deprecated and users can keep using the same data arguments for use cases being served by the library currently.
+
+## Decision
+
+<!-->
+This is the meat of the document, where you explain the decision. If you have multiple alternatives, be sure to use sub-sections for better separation of the idea, and list pros/cons to each approach. If there are alternatives that you have eliminated, you should also list those here, and explain why you believe your chosen approach is superior. Make sure you’ve thought through and addressed the following sections. If a section is not relevant to your specific proposal, please explain why, e.g. your ADR addresses a convention or process, not an API.
+<-->
+
+The primary decision at our hand for this ADR is to consider how to handle the incoming use cases. The current code as explained above handles limited data format (only `json`) and limited preprocessing `custom data formatting` with or without a data template, `tokenization and input masking`. 
+
+One way to handle a set number of usecases is to have use case specific implementation of data pre processing and let users choose which preprocessing to utilise via 
+existing or new commandline arguments.  
+
+### Alternatives Considered
+<!-->
+
+Make sure to discuss the relative merits of alternatives to your proposal.
+<-->
+
+## Consequences
+
+<!-->
+Describe the resulting context, after applying the decision. All consequences should be listed here, not just the "positive" ones. A particular decision may have positive, negative, and neutral consequences, but all of them affect the team and project in the future.
+<-->
 
-### User Benefit
 
 ### Simple User Perspective
 
@@ -111,34 +152,12 @@ datasets:
 Our perspective is that the advanced users will create config files for data preprocessing and the intermediate users can use these existing configs and modify them according to their preference to get the desired result.
 
 ## Detailed Design
+<!-->
+This section is optional. Elaborate on details if they're important to understanding the design, but would make it hard to read the proposal section above.
+<-->
 
 ### The proposed design to implement support for this spec is follows,
 
-Config Representation in code.
-
-```
-@dataclass
-class DataHandlerConfig:
-    name: str
-    arguments: Optional[Dict]
-
-@dataclass
-class DatasetConfig:
-    name: str
-    sampling: Optional[Dict] = None
-    data_paths: List[str]
-    data_handlers: List[DataHandlerConfig] = None
-
-@dataclass
-class DataPreProcessorConfig:
-    streaming: Optional[bool] = None
-
-@dataclass
-class DataConfig:
-    datapreprocessor: Optional[DataPreProcessorConfig]
-    datasets: List[DatasetConfig]
-```
-
 Data Pre Processor abstract class
 
 ```
@@ -226,17 +245,7 @@ The functionality listed by HF in implementing the use of image and audio datase
 
 This means the image and audio multi modal datasets will be compatible with our data handler routines. Once we implement the data handler routine processing, we will allow users to train with multi modal datasets too.
 
-### Alternatives Considered
-
-## Consequences
-
-### Advantages
-
-### Impact on performance
-
-
 # Implementing stages.
-
 1. Stage 1: 
     * Refactoring the code in `fms-hf-tuning` into the abstract data class and adding support for preliminery data handling routines.
         This will automatically enable support for multi modal data which is our priority.