Try all data deserializers before failing (#423)

sjmonson · web-flow · commit c31fdde1bb20 · 2025-10-20T13:53:27.000-04:00
## Summary

This pull request makes the code try all dataset deserializers before
failing. This ensures that the data format not working with one
deserializer doesn't cause premature failure.

## Details

It collects all errors, and if none of them succeed in deserializing a
dataset it either prints the errors, if present, or it reaches the old
code which prints an error stating that there are no suitable
deserializers for the data.

Here is an example of the error message raised if I force it to fail:
`
guidellm.data.deserializers.deserializer.DataNotSupportedError: data
deserialization failed; 2 errors occurred while attempting to
deserialize data {"prompt_tokens": 1, "output_tokens": 100}:
[HFValidationError('Repo id must use alphanumeric chars or \'-\', \'_\',
\'.\', \'--\' and \'..\' are forbidden, \'-\' and \'.\' cannot start or
end the name, max length is 96: \'{"prompt_tokens": 1, "output_tokens":
100}\'.'), TypeError('InMemoryDictDatasetDeserializer.__call__() takes 4
positional arguments but 5 were given')]
`

## Test Plan

Run GuideLLM with various formats of data to ensure the proper one is
used, and test invalid inputs, too.

---

- [x] "I certify that all code in this PR is my own, except as noted
below."

## Use of AI

- [ ] Includes AI-assisted code completion
- [ ] Includes code generated by an AI application
- [ ] Includes AI-generated tests (NOTE: AI written tests should have a
docstring that includes `## WRITTEN BY AI ##`)
diff --git a/src/guidellm/data/deserializers/deserializer.py b/src/guidellm/data/deserializers/deserializer.py
@@ -50,31 +50,32 @@ def deserialize(
         dataset = None
 
         if type_ is None:
+            errors = []
+            # Note: There is no priority order for the deserializers, so all deserializers
+            #  must be mutually exclusive to ensure deterministic behavior.
             for name, deserializer in cls.registry.items():
-                if name == "huggingface":
-                    # Save Hugging Face til the end since it is a catch-all.
-                    continue
-
                 deserializer_fn: DatasetDeserializer = (
                     deserializer() if isinstance(deserializer, type) else deserializer
                 )
 
-                with contextlib.suppress(DataNotSupportedError):
-                    dataset = deserializer_fn(
-                        data=data,
-                        processor_factory=processor_factory,
-                        random_seed=random_seed,
-                        **data_kwargs,
-                    )
-
-            if dataset is None:
-                deserializer_fn = cls.get_registered_object("huggingface")()
-                dataset = deserializer_fn(
-                    data=data,
-                    processor_factory=processor_factory,
-                    random_seed=random_seed,
-                    **data_kwargs,
-                )
+                try:
+                    with contextlib.suppress(DataNotSupportedError):
+                        dataset = deserializer_fn(
+                            data=data,
+                            processor_factory=processor_factory,
+                            random_seed=random_seed,
+                            **data_kwargs,
+                        )
+                except Exception as e:
+                    errors.append(e)
+
+                if dataset is not None:
+                    break # Found one that works. Continuing could overwrite it.
+
+            if dataset is None and len(errors) > 0:
+                raise DataNotSupportedError(f"data deserialization failed; {len(errors)} errors occurred while "
+                                            f"attempting to deserialize data {data}: {errors}")
+
         elif deserializer := cls.get_registered_object(type_) is not None:
             deserializer_fn: DatasetDeserializer = (
                 deserializer() if isinstance(deserializer, type) else deserializer