add first cut dataloader v2 ADR

dushyantbehl · dushyantbehl · commit 8710f687e6fd · 2024-11-05T19:50:33.000+05:30
Signed-off-by: Dushyant Behl &lt;dushyantbehl@users.noreply.github.com&gt;
diff --git a/architecture_records/004-dataloader-v2.md b/architecture_records/004-dataloader-v2.md
@@ -0,0 +1,274 @@
+# Data Pre Processor Design For fms-hf-tuning
+
+**Deciders(s)**: Sukriti Sharma (sukriti.sharma4@ibm.com), Will Johnson (Will.Johnson@ibm.com) , Abhishek Maurya (maurya.abhishek@ibm.com), Yu Chin Fabian Lim (flim@sg.ibm.com), Dushyant Behl (dushyantbehl@in.ibm.com), Ashok Pon Kumar (ashokponkumar@in.ibm.com)
+
+**Date (YYYY-MM-DD)**:  2024-03-06
+
+**Obsoletes ADRs**:  NA
+
+**Modified By ADRs**:  NA
+
+**Relevant Issues**: [1]
+
+- [Summary and Objective](#summary-and-objective)
+  - [Motivation](#motivation)
+  - [User Benefit](#user-benefit)
+- [Decision](#decision)
+  - [Alternatives Considered](#alternatives-considered)
+- [Consequences](#consequences)
+- [Detailed Design](#detailed-design)
+
+## Summary and Objective
+
+The reason for motivating datapreprocessor design for fms-hf-tuning is to have a unified interface which supports many type of data formats, streaming and non streaming data, weight based data mixing and many others.
+
+    1. Support for different formats of data → Arrow, Parquet, CSV, etc.
+
+    1. Support for multiple files in a dataset.
+
+    1. Support for multiple datasets.
+
+    1. Support for different modalities of data -> Images, Audio etc.
+
+    1. Support for mixing datasets based on static weights.
+
+    1. Support for streaming datasets (or Iterable Datasets in HuggingFace)
+
+    1. Support for chat template based masking
+
+    1. Tool Usage - for example using jinja templates in the datasets which require preprocessing of the data.
+
+### Motivation
+
+The current design of the data processing in this library contains predefined use cases and support for a simple interface
+to the users, while the simple interface provides ease of use over hugging face apis it doesn't provide users any interface to design custom data preprocessing for e.g. in case of multi modal support. 
+Ths motivation for this design is to implement the data preprocessor in a flexible way and expose to the users a powerful api
+which can allow them custom processing of data however they want at the same time retaining the simple interface of the api.
+
+### User Benefit
+
+### Simple User Perspective
+
+For simple and base users of the code we want to retain the same functionality wherever possible i.e,
+allow users to pass in a single data file and perform simple preprocessing.
+This means retaining the arguments to the library which exist currently and ensuring that the appropriate
+processing required on the data is handeled internally.
+
+### Advanced User Perspective
+
+For advanced users we want to open up an argument to our library `data_config_file` which will take input
+a data preprocesing config file and can specify what preprocessing to apply on the data and in what order.
+Here our goal is not to reimplement the functionality provided by hugging face but rather have a clean interface
+using a config where advanced users can use advanced HF functions like splitting a dataset or perform custom 
+preprocessing like applying jinja templates etc.
+
+The input spec which user specifies on how to pass information for such preprocessing is this
+
+```
+datapreprocessor:
+    streaming: true
+datasets:
+  - name: dataset1
+    sampling:
+      ratio: 0.3
+    data_paths:
+      - /data/stackoverflow-kubectl_posts
+      - /data/stackoverflow-kubernetes_posts
+      - /data/stackoverflow-openshift_posts
+    data_handlers:
+      - name: render_template
+        arguments:
+          remove_columns: all
+          batched: false
+          fn_kwargs:
+            jinja_template: "{<jinja-template>}"
+  - name: dataset1
+    sampling:
+      ratio: 0.4
+    data_paths:
+      - /data/stackoverflow-kubectl_posts
+      - /data/stackoverflow-kubernetes_posts
+      - /data/stackoverflow-openshift_posts
+    data_handlers:
+      - name: render_template
+        arguments:
+          remove_columns: all
+          batched: false
+          fn_kwargs:
+            jinja_template: "{<jinja-template>}"
+  - name: dataset2
+    sampling:
+      ratio: 0.3
+    data_handlers:
+      - name: apply_tokenizer_chat_template
+        arguments:
+          remove_columns: all
+          batched: false
+    data_files:
+      - /data/stackoverflow-kubectl_posts.jsonl
+      - /data/stackoverflow-kubernetes_posts.jsonl
+```
+
+### Intermediate User Perspective
+Our perspective is that the advanced users will create config files for data preprocessing and the intermediate users can use these existing configs and modify them according to their preference to get the desired result. 
+
+## Detailed Design
+
+### The proposed design to implement support for this spec is follows,
+
+Config Representation in code.
+
+```
+@dataclass
+class DataHandlerConfig:
+    name: str
+    arguments: Optional[Dict]
+
+@dataclass
+class DatasetConfig:
+    name: str
+    sampling: Optional[Dict] = None
+    splitter_arguments: Optional[Dict] = None
+    data_paths: List[str]
+    data_handlers: List[DataHandlerConfig] = None
+
+@dataclass
+class DataPreProcessorConfig:
+    streaming: Optional[bool] = None
+
+@dataclass
+class DataConfig:
+    datapreprocessor: Optional[DataPreProcessorConfig]
+    datasets: List[DatasetConfig]
+```
+
+Data Pre Processor abstract class
+
+```
+class DataPreProcessor(ABC):
+
+    tokenizer = None
+    model_name_or_path = None
+    block_size = None
+    data_config: DataConfig = None
+    data_handlers: Dict[str, Callable] = None
+
+    def __init__(self, dataconfig: DataConfig, tokenizer, model_name_or_path, block_size):
+        self.data_config = dataconfig
+        self.tokenizer = tokenizer
+        self.model_name_or_path = model_name_or_path
+        self.block_size = block_size
+        self.data_handlers = {}
+
+    def register_data_handler(self, name: str, d: Callable):
+        self.data_handlers[name] = d
+
+    @abstractmethod
+    def process_data_config(self, data_config: DataConfig):
+        pass
+```
+
+At the top level we propose to have this `class DataPreProcessor` which is an abstract class 
+and requires functions to process the data config proposed above.
+
+We also propose a full length config verification code which preceeds the call to function 
+`DataPreProcessor.process_data_config` as the function expects a `DataConfig` object.
+
+The data pre processor needs to support custom data handlers which are provided by users of the library
+or even predefined handlers which need to be registered with the top level class using the 
+call `DataPreProcessor.register_data_handler`.
+
+## How are handlers provided and registered - 
+
+Data handlers are python callables which can be called on single/few samples of data and can perform
+things like applying chat template, tokenising the data, applying tools like jinja template or even
+things like encoding or decoding multi modal formats like images/audio for processing by the model.
+
+The abstract datapreprocessor class provides a way to register datahandler against a `name` which is a string.
+The data handler config `DataHandlerConfig` taken by `execute_data_handlers` represents a DAG of data handling
+routines which are to be executed on the data. 
+
+For standard HF API you can think of these as the HF Processing routines. Which could be Map/Filter/Select operations
+We implement most of the routines as map and because of this even the tokenisation of data which is done today
+in fms-hf-tuning via `tuning/utils/preprocessing_utils.py::get_preprocessed_dataset` can be retained as a data 
+handler which performs tokenization.
+
+The implementation is flexible enough for very advanced users to specify their own implementation of data handling routines by importing fms-hf-tuning and extending the preprocessing by calling `register_data_handler` on the preprocessor. This is left for advanced users of the library and not for simple users.
+
+To this end, one way to design is we can provide the users and API on like the one shown in the `DataPreProcessor` class
+which they can utilise to register custom data handlers, in this case however the user needs to use `fms-hf-tuning` as
+a module but not via the implementation of its `main` functionality.
+
+Please note that our implementation needs to support certain predefined built-in handlers like `apply_chat_template`
+or `tokenize` which user can request just by a name.
+
+For example see this implementation - https://github.ibm.com/ai4code-wisdom/platform/blob/main/modelops/modelops/train.py#L251
+
+# Implementation of the default Data Preprocessor.
+
+The default data preprocessor implemented as an instance of the `DataPreProcessor` class uses HF APIs where ever possible
+to miminize custom reimplementation of code.
+
+When the datapreprocessor goes through each `DataSetConfig`
+
+The HF datapreprocessor processes different type of files via its `load_dataset` factory.
+If not supported automatically via this, we can look to extend the factory to use an other type of interest via
+`Dataset.from_generator(<generator>)` functionality.
+
+This also means that any implementation like `get_json_object` which load `json(l)` and then return a custom json dict
+can be implemented as data handlers.
+
+### Interleaving datasets
+
+In case of multiple datasets the user can request how the datasets are to be interleaved.
+The probabilies specified by users in the config `sampling.ratio` can be collected from individual datasets and passed to
+[`datasets.interleave_datasets`](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.interleave_datasets).
+
+### Streaming datasets
+
+In HuggingFace the `streaming` argument can be handled by using `IterableDatasets` instead of standard `Datasets`.
+HF provides same APIs like `datasets.interleave_datasets` over the `Iterable` datasets as well.
+
+Further important thing to note is in case of HF, if we use hugging face the `map` functionality which we use to implement data handling is handled in a lazy fashion meaning we don't need to handle the data handlers in a different way for streaming data. [More Information on HF Page.](https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable#eager-data-processing-and-lazy-data-processing)
+
+## Handling data collators.
+
+Data collators specifically for TRL use cases like chat based interactions which apply chat templates and proper attention masking on the tokenized data like in the case of `DataCollatorForCompletionOnlyLM` handle a specific functionality on the data. In this design our approach is to pass data collators from hugging face api directly to SFTTrainer.
+Retaining the current code path, the code collators are collected by `get_data_collator` functionality and passed to `SFTTrainer`. We can retain the same functionality and keep the design simpler.
+The job of the data pre processor is to provide a single interface over multiple datasets in the config while keeping a collator like this means we will keep the collator same across all datasets but keeps the design simpler.
+
+## Simplification of code and user configuration
+
+The flexibility provided by this design is that it simplifies the configuration requirement for various use cases.
+If chat template and chat style data is requested users can specify chat specific data handlers and not specify all configurations which are not required.
+This can also simplify configuration handling in the code. TODO: give example
+
+## Handling Multi Modal Data.
+
+HF does provide support for handling [image datasets](https://huggingface.co/docs/datasets/en/image_process) and [audio datasets](https://huggingface.co/docs/datasets/en/audio_load) which can be utilised by us in our HF datapreprocessor.
+
+The functionality listed by HF in implementing the use of image and audio datasets is `map` based functions to perform resize, encoding and other such operations on the dataset (see the link above).
+
+This means the image and audio multi modal datasets will be compatible with our data handler routines. Once we implement the data handler routine processing, we will allow users to train with multi modal datasets too.
+
+### Alternatives Considered
+
+## Consequences
+
+### Advantages
+
+### Impact on performance
+
+
+# Implementing stages.
+
+1. Stage 1: 
+    * Refactoring the code in `fms-hf-tuning` into the abstract data class and adding support for preliminery data handling routines.
+        This will automatically enable support for multi modal data which is our priority.
+    Note at this stage it might be wise to have two side by side implementations, i.e. not deleting the existing implementation.
+1. State 2:
+    * Implementing `streaming` data or `iterable` dataset support for the HF datapreprocessor implementation.
+    * Data handling support for streaming data
+1. State 3:
+    * Identify and add any other required predefined data handlers.
+    * Phase out the old implementation in support of the new one.