feat: Support multiple seed dataset sources #21

mikeknep · 2025-11-07T22:57:54Z

Adds support for registering multiple seed dataset sources.

`engine` changes

Rename from DataStore suffix to Source, e.g. LocalSeedDatasetSource
Sources can be given names
Adds a registry for sources. This already existed in the microservice, but is moved into the library. It is a collection of sources functioning very similar to the existing ModelProviderRegistry
Adds a "repository" for sources. Open to a better name here... this object is similar to the ModelRegistry. It takes the registry (which is just configuration of seed sources) and turns it into objects with the actual ability to fetch and use datasets. We stick this on the ResourceProvider to pass around to the generators that need it, in place of the previous singular "datastore" object that we'd pass around

`config` changes

Main API change is adding an optional source field to the SeedConfig, so that you can specify "the dataset should be fetched from this particular source"
Similar rename away from "Datastore", e.g. we now have HfHubSeedDatasetReference
Moved most of datastore.py into seed.py where it's predominately used.
- The upload_to_hf_hub method is only used by the NMP client's upload_seed_dataset; I'm going to move it there.
- Drop DatastoreSettings and just store the SeedDatasetReferenceT on the BuilderConfig

github-actions · 2025-11-07T22:58:03Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

mikeknep · 2025-11-07T23:06:27Z

I have read the DCO document and I hereby sign the DCO.

johnnygreco · 2025-11-10T15:46:53Z

src/data_designer/config/config_builder.py

+                # TODO: Should this just log a warning and recommend re-running with_seed_dataset, or raise?
+                raise BuilderConfigurationError("🛑 Found seed_config without seed_dataset_reference.")


I was confused because I didn't realize seed_dataset_reference is on the builder config. My vote would be to throw a warning instead. Probably would be helpful to mention that it is on the builder config in the message.

Yeah, seed_dataset_reference replaces datastore_settings in this PR. We do today raise an error in the equivalent scenario (datastore_settings not existing on the builder config), which is why I raise here. Logging and moving on would be more forgiving—you get "most of" the config builder rehydrated and only need to call with_seed_dataset again explicitly. However:

what if they forget/ignore the warning and just proceed?

will they be able to run with_seed_dataset themselves after the fact? How would they know what arguments to pass?

Hard to tell if warn-but-not-raise here is helpful, or a foot-gun.

johnnygreco · 2025-11-10T15:58:23Z

src/data_designer/config/seed.py

+    def get_column_names(self) -> list[str]:
+        file_type = Path(self.dataset).suffix.lower()[1:]
+        return _get_file_column_names(self.dataset, file_type)


loving have these here on the reference objects!

johnnygreco · 2025-11-10T16:02:16Z

src/data_designer/config/seed.py

+        source: Optional source name if you are running in a context with pre-registered, named
+            sources from which seed datasets can be used.


I don't have a better alternative at the moment, but this feels weird, which I know you know. "source" might be a bit confusing too. Somehow the argument name needs to clearly refer to an identifier / key that maps on to one of a few options.

johnnygreco · 2025-11-10T16:03:52Z

src/data_designer/config/seed.py

+    dataset: str
+    endpoint: str = "https://huggingface.co"
+    token: Optional[str] = None
+    source_name: Optional[str] = None


probably okay for this to be the same arg name as whatever we come up with in the SeedConifg?

johnnygreco · 2025-11-10T16:04:54Z

src/data_designer/config/seed.py

+        matching_files = sorted(file_path.parent.glob(file_path.name))
+        if not matching_files:
+            raise InvalidFilePathError(f"🛑 No files found matching pattern: {str(file_path)!r}")
+        logger.debug(f"0️⃣Using the first matching file in {str(file_path)!r} to determine column names in seed dataset")


I think you copied this directly, but what's that emoji mean? Also, can you add a space?

Weird copy/paste error, will fix

johnnygreco · 2025-11-10T16:06:43Z

src/data_designer/engine/analysis/dataset_profiler.py

-            if self.resource_provider.model_registry is None:
-                raise DatasetProfilerConfigurationError("Model registry is required for column profiler configs")


Why did this get dropped? Redundant or something?

No longer needed since the attributes of the ResourceProvider are now required

johnnygreco · 2025-11-10T16:09:05Z

src/data_designer/engine/column_generators/generators/seed_dataset.py

    @functools.cached_property
    def duckdb_conn(self) -> duckdb.DuckDBPyConnection:
-        return self.resource_provider.datastore.create_duckdb_connection()
+        return self.resource_provider.seed_dataset_repository.create_duckdb_connection(self.config.source)


Does this work? create_duckdb_connection doesn't take an argument?

johnnygreco · 2025-11-10T16:10:10Z

src/data_designer/engine/resources/resource_provider.py

+    blob_storage: ManagedBlobStorage
+    seed_dataset_repository: SeedDatasetRepository
+    model_registry: ModelRegistry


I like requiring all these now that I think about it. Can we remove the need to specify required resources in the base task object now?

Yes. I didn't go digging too far to find where to do that, but should clean that up. As noted in a separate comment I was able to remove at least one check for "if None"

johnnygreco · 2025-11-10T16:11:50Z

src/data_designer/engine/resources/seed_dataset_source.py

+class MalformedFileIdError(Exception):
+    """Raised when file_id format is invalid."""


Following the repo pattern, can we put this in engine/resources/errors.py?

johnnygreco · 2025-11-10T16:13:51Z

src/data_designer/engine/resources/seed_dataset_source.py

+    def get_dataset_uri(self, file_id: str) -> str: ...
+
+
+class LocalSeedDatasetSource(SeedDatasetSource):


references and sources are tangled in my mind

johnnygreco · 2025-11-10T16:17:05Z

src/data_designer/engine/resources/seed_dataset_source.py

+    def create_duckdb_connection(self) -> duckdb.DuckDBPyConnection: ...
+
+    @abstractmethod
+    def get_dataset_uri(self, file_id: str) -> str: ...


Do we think "file_id" as a term is the best to describe our new implementation? Feels like an artifact from the before times.

johnnygreco · 2025-11-10T16:20:50Z

src/data_designer/engine/resources/seed_dataset_source.py

+class SeedDatasetSourceRegistry(BaseModel):
+    sources: list[SeedDatasetSourceT]
+    default: str | None = None


Seeing this here makes me wonder if any of this can / should live in the service instead. It seems like the library should just implement the local and HF interfaces. Storing and fetching information about different sources feels more like a service responsibility.

I am thinking more about the different datastore options rather than HF vs local. I think HF vs local is all the library needs, which makes all of this tooling for managing different sources feel super extra.

johnnygreco · 2025-11-10T16:24:27Z

src/data_designer/interface/data_designer.py


 DEFAULT_BUFFER_SIZE = 1000

+DEFAULT_SECRET_RESOLVER = CompositeResolver(resolvers=[EnvironmentResolver(), PlaintextResolver()])


johnnygreco · 2025-11-10T16:26:50Z

src/data_designer/interface/data_designer.py

+    def _create_seed_dataset_source_registry(
+        self, config_builder: DataDesignerConfigBuilder


looks like this can be an isolated helper function or your favorite staticmethod

johnnygreco · 2025-11-10T16:34:56Z

src/data_designer/engine/resources/seed_dataset_source.py

+    def create_duckdb_connection(self, source_name: str | None) -> duckdb.DuckDBPyConnection:
+        return self._get_resolved_source(source_name).create_duckdb_connection()


ah, there it is. maybe something like create_duckdb_connection_with_source would be a helpful distinction.

nabinchha · 2025-11-10T16:12:15Z

src/data_designer/config/seed.py

+        source: Optional source name if you are running in a context with pre-registered, named
+            sources from which seed datasets can be used.


Is this ever a case when running only in library mode?

may be source_name? Just as is I think both dataset and source could be interpreted as the same thing.

nabinchha · 2025-11-10T16:15:24Z

src/data_designer/config/seed.py

+    def get_dataset(self) -> str:
+        return self.dataset
+
+    def get_source(self) -> Optional[str]:
+        return self.source_name


nit: these getters are probably not need since the thing these are returning are public properties themselves.

For a while I had renamed the LocalSeedDatasetReference dataset attribute to path, and introduced the getters allowed a consistent interface for clients without forcing the objects themselves to share attributes. At some point I reverted that attribute name back to dataset—I think at some point I thought "eh, dataset as the attribute name is good enough and minimizes the diff". Do you have a preference? I don't feel strongly about it.

nabinchha · 2025-11-10T16:15:54Z

src/data_designer/config/seed.py

+
+    dataset: str
+    endpoint: str = "https://huggingface.co"
+    token: Optional[str] = None


Is this also an opportunity to use SecretStr if this will be provided in plain text?

nabinchha · 2025-11-10T16:17:10Z

src/data_designer/config/seed.py

+        filename = self.dataset.split("/")[-1]
+        repo_id = "/".join(self.dataset.split("/")[:-1])
+
+        file_type = filename.split(".")[-1]
+        if f".{file_type}" not in VALID_DATASET_FILE_EXTENSIONS:
+            raise InvalidFileFormatError(f"🛑 Unsupported file type: {filename!r}")


May be for the next PR. Would be good to support path/to/*.parquet glob pattern for HF source as well. Perhaps we can add a feature request.

nabinchha · 2025-11-10T16:21:27Z

src/data_designer/config/seed.py

+        if self.token is not None:
+            # Check if the value is an env var name and if so resolve it,
+            # otherwise assume the value is the raw token string in plain text
+            _token = os.environ.get(self.token, self.token)


super nit: It might look silly for this use case, but since we have EnvironmentResolver I'd recommend we use that resource. In case we ever want to tweak the behavior of how that works, we'd have one place to make the change. Just my 2cs, feel free to ignore.

Ah, I see this is in config...

nabinchha · 2025-11-10T16:25:02Z

src/data_designer/config/config_builder.py

+        if dd_config.seed_config:
+            if (seed_dataset_reference := builder_config.seed_dataset_reference) is None:


Could it be confusing to see the dataset: str | Path prop both at the ConfigBuilder and DataDesignerConfig.SeedConfig level?

nabinchha · 2025-11-10T16:25:34Z

src/data_designer/engine/analysis/dataset_profiler.py

-            if self.resource_provider.model_registry is None:
-                raise DatasetProfilerConfigurationError("Model registry is required for column profiler configs")


Just curious if this is intentional?

nabinchha · 2025-11-10T16:31:33Z

src/data_designer/engine/resources/seed_dataset_source.py

+from data_designer.engine.secret_resolver import SecretResolver
+from data_designer.logging import quiet_noisy_logger
+
+quiet_noisy_logger("httpx")


This is already done in

DataDesigner/src/data_designer/engine/models/litellm_overrides.py

Line 160 in affc46f

quiet_noisy_logger("httpx")

nabinchha · 2025-11-10T16:34:38Z

src/data_designer/engine/resources/seed_dataset_source.py

+        """Extract repo_id and filename from identifier."""
+        parts = identifier.split("/", 2)
+        if len(parts) < 3:
+            raise MalformedFileIdError(


If this is specific to HF, MalformedHuggingFaceFileIdError is clearer?

nabinchha · 2025-11-10T16:37:57Z

src/data_designer/engine/resources/seed_dataset_source.py

+        try:
+            return self._sources_dict[name]
+        except KeyError:
+            raise UnknownSeedDatasetSourceError(f"No seed dataset source named {name!r} registered")


It might be help to show possible valid names here?

mikeknep · 2025-11-13T14:49:42Z

Closing this, we're going to take a different approach to this problem.

Support for multiple seed dataset sources

7351444

mikeknep requested review from johnnygreco and nabinchha November 7, 2025 22:57

Ruff fix

da8373b

Another ruffian

f31689e

johnnygreco reviewed Nov 10, 2025

View reviewed changes

nabinchha reviewed Nov 10, 2025

View reviewed changes

mikeknep closed this Nov 13, 2025

		# TODO: Should this just log a warning and recommend re-running with_seed_dataset, or raise?
		raise BuilderConfigurationError("🛑 Found seed_config without seed_dataset_reference.")

		source: Optional source name if you are running in a context with pre-registered, named
		sources from which seed datasets can be used.

		if self.resource_provider.model_registry is None:
		raise DatasetProfilerConfigurationError("Model registry is required for column profiler configs")

		class MalformedFileIdError(Exception):
		"""Raised when file_id format is invalid."""

		def get_dataset_uri(self, file_id: str) -> str: ...


		class LocalSeedDatasetSource(SeedDatasetSource):


		DEFAULT_BUFFER_SIZE = 1000

		DEFAULT_SECRET_RESOLVER = CompositeResolver(resolvers=[EnvironmentResolver(), PlaintextResolver()])

		def _create_seed_dataset_source_registry(
		self, config_builder: DataDesignerConfigBuilder

		def create_duckdb_connection(self, source_name: str \| None) -> duckdb.DuckDBPyConnection:
		return self._get_resolved_source(source_name).create_duckdb_connection()

		if dd_config.seed_config:
		if (seed_dataset_reference := builder_config.seed_dataset_reference) is None:

feat: Support multiple seed dataset sources #21

feat: Support multiple seed dataset sources #21

Uh oh!

Conversation

mikeknep commented Nov 7, 2025

engine changes

config changes

Uh oh!

github-actions bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikeknep commented Nov 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nabinchha Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

`engine` changes

`config` changes

github-actions bot commented Nov 7, 2025 •

edited

Loading

nabinchha Nov 10, 2025 •

edited

Loading