Skip to content

Conversation

@mikeknep
Copy link
Contributor

@mikeknep mikeknep commented Nov 7, 2025

Adds support for registering multiple seed dataset sources.

engine changes

  • Rename from DataStore suffix to Source, e.g. LocalSeedDatasetSource
  • Sources can be given names
  • Adds a registry for sources. This already existed in the microservice, but is moved into the library. It is a collection of sources functioning very similar to the existing ModelProviderRegistry
  • Adds a "repository" for sources. Open to a better name here... this object is similar to the ModelRegistry. It takes the registry (which is just configuration of seed sources) and turns it into objects with the actual ability to fetch and use datasets. We stick this on the ResourceProvider to pass around to the generators that need it, in place of the previous singular "datastore" object that we'd pass around

config changes

  • Main API change is adding an optional source field to the SeedConfig, so that you can specify "the dataset should be fetched from this particular source"
  • Similar rename away from "Datastore", e.g. we now have HfHubSeedDatasetReference
  • Moved most of datastore.py into seed.py where it's predominately used.
    • The upload_to_hf_hub method is only used by the NMP client's upload_seed_dataset; I'm going to move it there.
    • Drop DatastoreSettings and just store the SeedDatasetReferenceT on the BuilderConfig

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@mikeknep
Copy link
Contributor Author

mikeknep commented Nov 7, 2025

I have read the DCO document and I hereby sign the DCO.

Comment on lines +112 to +113
# TODO: Should this just log a warning and recommend re-running with_seed_dataset, or raise?
raise BuilderConfigurationError("🛑 Found seed_config without seed_dataset_reference.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused because I didn't realize seed_dataset_reference is on the builder config. My vote would be to throw a warning instead. Probably would be helpful to mention that it is on the builder config in the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seed_dataset_reference replaces datastore_settings in this PR. We do today raise an error in the equivalent scenario (datastore_settings not existing on the builder config), which is why I raise here. Logging and moving on would be more forgiving—you get "most of" the config builder rehydrated and only need to call with_seed_dataset again explicitly. However:

  • what if they forget/ignore the warning and just proceed?
  • will they be able to run with_seed_dataset themselves after the fact? How would they know what arguments to pass?

Hard to tell if warn-but-not-raise here is helpful, or a foot-gun.

Comment on lines +163 to +165
def get_column_names(self) -> list[str]:
file_type = Path(self.dataset).suffix.lower()[1:]
return _get_file_column_names(self.dataset, file_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loving have these here on the reference objects!

Comment on lines +84 to +85
source: Optional source name if you are running in a context with pre-registered, named
sources from which seed datasets can be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a better alternative at the moment, but this feels weird, which I know you know. "source" might be a bit confusing too. Somehow the argument name needs to clearly refer to an identifier / key that maps on to one of a few options.

dataset: str
endpoint: str = "https://huggingface.co"
token: Optional[str] = None
source_name: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably okay for this to be the same arg name as whatever we come up with in the SeedConifg?

matching_files = sorted(file_path.parent.glob(file_path.name))
if not matching_files:
raise InvalidFilePathError(f"🛑 No files found matching pattern: {str(file_path)!r}")
logger.debug(f"0️⃣Using the first matching file in {str(file_path)!r} to determine column names in seed dataset")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you copied this directly, but what's that emoji mean? Also, can you add a space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird copy/paste error, will fix

Comment on lines -110 to -111
if self.resource_provider.model_registry is None:
raise DatasetProfilerConfigurationError("Model registry is required for column profiler configs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this get dropped? Redundant or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed since the attributes of the ResourceProvider are now required

@functools.cached_property
def duckdb_conn(self) -> duckdb.DuckDBPyConnection:
return self.resource_provider.datastore.create_duckdb_connection()
return self.resource_provider.seed_dataset_repository.create_duckdb_connection(self.config.source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work? create_duckdb_connection doesn't take an argument?

Comment on lines +24 to +26
blob_storage: ManagedBlobStorage
seed_dataset_repository: SeedDatasetRepository
model_registry: ModelRegistry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like requiring all these now that I think about it. Can we remove the need to specify required resources in the base task object now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I didn't go digging too far to find where to do that, but should clean that up. As noted in a separate comment I was able to remove at least one check for "if None"

Comment on lines +21 to +22
class MalformedFileIdError(Exception):
"""Raised when file_id format is invalid."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the repo pattern, can we put this in engine/resources/errors.py?

def get_dataset_uri(self, file_id: str) -> str: ...


class LocalSeedDatasetSource(SeedDatasetSource):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

references and sources are tangled in my mind

def create_duckdb_connection(self) -> duckdb.DuckDBPyConnection: ...

@abstractmethod
def get_dataset_uri(self, file_id: str) -> str: ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think "file_id" as a term is the best to describe our new implementation? Feels like an artifact from the before times.

Comment on lines +100 to +102
class SeedDatasetSourceRegistry(BaseModel):
sources: list[SeedDatasetSourceT]
default: str | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this here makes me wonder if any of this can / should live in the service instead. It seems like the library should just implement the local and HF interfaces. Storing and fetching information about different sources feels more like a service responsibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking more about the different datastore options rather than HF vs local. I think HF vs local is all the library needs, which makes all of this tooling for managing different sources feel super extra.


DEFAULT_BUFFER_SIZE = 1000

DEFAULT_SECRET_RESOLVER = CompositeResolver(resolvers=[EnvironmentResolver(), PlaintextResolver()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE

Comment on lines +275 to +276
def _create_seed_dataset_source_registry(
self, config_builder: DataDesignerConfigBuilder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this can be an isolated helper function or your favorite staticmethod

Comment on lines +163 to +164
def create_duckdb_connection(self, source_name: str | None) -> duckdb.DuckDBPyConnection:
return self._get_resolved_source(source_name).create_duckdb_connection()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, there it is. maybe something like create_duckdb_connection_with_source would be a helpful distinction.

Comment on lines +84 to +85
source: Optional source name if you are running in a context with pre-registered, named
sources from which seed datasets can be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever a case when running only in library mode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be source_name? Just as is I think both dataset and source could be interpreted as the same thing.

Comment on lines +176 to +180
def get_dataset(self) -> str:
return self.dataset

def get_source(self) -> Optional[str]:
return self.source_name
Copy link
Contributor

@nabinchha nabinchha Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these getters are probably not need since the thing these are returning are public properties themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a while I had renamed the LocalSeedDatasetReference dataset attribute to path, and introduced the getters allowed a consistent interface for clients without forcing the objects themselves to share attributes. At some point I reverted that attribute name back to dataset—I think at some point I thought "eh, dataset as the attribute name is good enough and minimizes the diff". Do you have a preference? I don't feel strongly about it.


dataset: str
endpoint: str = "https://huggingface.co"
token: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also an opportunity to use SecretStr if this will be provided in plain text?

Comment on lines +183 to +188
filename = self.dataset.split("/")[-1]
repo_id = "/".join(self.dataset.split("/")[:-1])

file_type = filename.split(".")[-1]
if f".{file_type}" not in VALID_DATASET_FILE_EXTENSIONS:
raise InvalidFileFormatError(f"🛑 Unsupported file type: {filename!r}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be for the next PR. Would be good to support path/to/*.parquet glob pattern for HF source as well. Perhaps we can add a feature request.

if self.token is not None:
# Check if the value is an env var name and if so resolve it,
# otherwise assume the value is the raw token string in plain text
_token = os.environ.get(self.token, self.token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: It might look silly for this use case, but since we have EnvironmentResolver I'd recommend we use that resource. In case we ever want to tweak the behavior of how that works, we'd have one place to make the change. Just my 2cs, feel free to ignore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see this is in config...

Comment on lines +110 to +111
if dd_config.seed_config:
if (seed_dataset_reference := builder_config.seed_dataset_reference) is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be confusing to see the dataset: str | Path prop both at the ConfigBuilder and DataDesignerConfig.SeedConfig level?

Comment on lines -110 to -111
if self.resource_provider.model_registry is None:
raise DatasetProfilerConfigurationError("Model registry is required for column profiler configs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if this is intentional?

from data_designer.engine.secret_resolver import SecretResolver
from data_designer.logging import quiet_noisy_logger

quiet_noisy_logger("httpx")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already done in

"""Extract repo_id and filename from identifier."""
parts = identifier.split("/", 2)
if len(parts) < 3:
raise MalformedFileIdError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is specific to HF, MalformedHuggingFaceFileIdError is clearer?

try:
return self._sources_dict[name]
except KeyError:
raise UnknownSeedDatasetSourceError(f"No seed dataset source named {name!r} registered")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be help to show possible valid names here?

@mikeknep
Copy link
Contributor Author

Closing this, we're going to take a different approach to this problem.

@mikeknep mikeknep closed this Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants