-
Notifications
You must be signed in to change notification settings - Fork 51
feat: Support multiple seed dataset sources #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support multiple seed dataset sources #21
Conversation
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
| # TODO: Should this just log a warning and recommend re-running with_seed_dataset, or raise? | ||
| raise BuilderConfigurationError("🛑 Found seed_config without seed_dataset_reference.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused because I didn't realize seed_dataset_reference is on the builder config. My vote would be to throw a warning instead. Probably would be helpful to mention that it is on the builder config in the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seed_dataset_reference replaces datastore_settings in this PR. We do today raise an error in the equivalent scenario (datastore_settings not existing on the builder config), which is why I raise here. Logging and moving on would be more forgiving—you get "most of" the config builder rehydrated and only need to call with_seed_dataset again explicitly. However:
- what if they forget/ignore the warning and just proceed?
- will they be able to run
with_seed_datasetthemselves after the fact? How would they know what arguments to pass?
Hard to tell if warn-but-not-raise here is helpful, or a foot-gun.
| def get_column_names(self) -> list[str]: | ||
| file_type = Path(self.dataset).suffix.lower()[1:] | ||
| return _get_file_column_names(self.dataset, file_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loving have these here on the reference objects!
| source: Optional source name if you are running in a context with pre-registered, named | ||
| sources from which seed datasets can be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a better alternative at the moment, but this feels weird, which I know you know. "source" might be a bit confusing too. Somehow the argument name needs to clearly refer to an identifier / key that maps on to one of a few options.
| dataset: str | ||
| endpoint: str = "https://huggingface.co" | ||
| token: Optional[str] = None | ||
| source_name: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably okay for this to be the same arg name as whatever we come up with in the SeedConifg?
| matching_files = sorted(file_path.parent.glob(file_path.name)) | ||
| if not matching_files: | ||
| raise InvalidFilePathError(f"🛑 No files found matching pattern: {str(file_path)!r}") | ||
| logger.debug(f"0️⃣Using the first matching file in {str(file_path)!r} to determine column names in seed dataset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you copied this directly, but what's that emoji mean? Also, can you add a space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird copy/paste error, will fix
| if self.resource_provider.model_registry is None: | ||
| raise DatasetProfilerConfigurationError("Model registry is required for column profiler configs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this get dropped? Redundant or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed since the attributes of the ResourceProvider are now required
| @functools.cached_property | ||
| def duckdb_conn(self) -> duckdb.DuckDBPyConnection: | ||
| return self.resource_provider.datastore.create_duckdb_connection() | ||
| return self.resource_provider.seed_dataset_repository.create_duckdb_connection(self.config.source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work? create_duckdb_connection doesn't take an argument?
| blob_storage: ManagedBlobStorage | ||
| seed_dataset_repository: SeedDatasetRepository | ||
| model_registry: ModelRegistry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like requiring all these now that I think about it. Can we remove the need to specify required resources in the base task object now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I didn't go digging too far to find where to do that, but should clean that up. As noted in a separate comment I was able to remove at least one check for "if None"
| class MalformedFileIdError(Exception): | ||
| """Raised when file_id format is invalid.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the repo pattern, can we put this in engine/resources/errors.py?
| def get_dataset_uri(self, file_id: str) -> str: ... | ||
|
|
||
|
|
||
| class LocalSeedDatasetSource(SeedDatasetSource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
references and sources are tangled in my mind
| def create_duckdb_connection(self) -> duckdb.DuckDBPyConnection: ... | ||
|
|
||
| @abstractmethod | ||
| def get_dataset_uri(self, file_id: str) -> str: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think "file_id" as a term is the best to describe our new implementation? Feels like an artifact from the before times.
| class SeedDatasetSourceRegistry(BaseModel): | ||
| sources: list[SeedDatasetSourceT] | ||
| default: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this here makes me wonder if any of this can / should live in the service instead. It seems like the library should just implement the local and HF interfaces. Storing and fetching information about different sources feels more like a service responsibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking more about the different datastore options rather than HF vs local. I think HF vs local is all the library needs, which makes all of this tooling for managing different sources feel super extra.
|
|
||
| DEFAULT_BUFFER_SIZE = 1000 | ||
|
|
||
| DEFAULT_SECRET_RESOLVER = CompositeResolver(resolvers=[EnvironmentResolver(), PlaintextResolver()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NICE
| def _create_seed_dataset_source_registry( | ||
| self, config_builder: DataDesignerConfigBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this can be an isolated helper function or your favorite staticmethod
| def create_duckdb_connection(self, source_name: str | None) -> duckdb.DuckDBPyConnection: | ||
| return self._get_resolved_source(source_name).create_duckdb_connection() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, there it is. maybe something like create_duckdb_connection_with_source would be a helpful distinction.
| source: Optional source name if you are running in a context with pre-registered, named | ||
| sources from which seed datasets can be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ever a case when running only in library mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be source_name? Just as is I think both dataset and source could be interpreted as the same thing.
| def get_dataset(self) -> str: | ||
| return self.dataset | ||
|
|
||
| def get_source(self) -> Optional[str]: | ||
| return self.source_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these getters are probably not need since the thing these are returning are public properties themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a while I had renamed the LocalSeedDatasetReference dataset attribute to path, and introduced the getters allowed a consistent interface for clients without forcing the objects themselves to share attributes. At some point I reverted that attribute name back to dataset—I think at some point I thought "eh, dataset as the attribute name is good enough and minimizes the diff". Do you have a preference? I don't feel strongly about it.
|
|
||
| dataset: str | ||
| endpoint: str = "https://huggingface.co" | ||
| token: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also an opportunity to use SecretStr if this will be provided in plain text?
| filename = self.dataset.split("/")[-1] | ||
| repo_id = "/".join(self.dataset.split("/")[:-1]) | ||
|
|
||
| file_type = filename.split(".")[-1] | ||
| if f".{file_type}" not in VALID_DATASET_FILE_EXTENSIONS: | ||
| raise InvalidFileFormatError(f"🛑 Unsupported file type: {filename!r}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be for the next PR. Would be good to support path/to/*.parquet glob pattern for HF source as well. Perhaps we can add a feature request.
| if self.token is not None: | ||
| # Check if the value is an env var name and if so resolve it, | ||
| # otherwise assume the value is the raw token string in plain text | ||
| _token = os.environ.get(self.token, self.token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: It might look silly for this use case, but since we have EnvironmentResolver I'd recommend we use that resource. In case we ever want to tweak the behavior of how that works, we'd have one place to make the change. Just my 2cs, feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see this is in config...
| if dd_config.seed_config: | ||
| if (seed_dataset_reference := builder_config.seed_dataset_reference) is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be confusing to see the dataset: str | Path prop both at the ConfigBuilder and DataDesignerConfig.SeedConfig level?
| if self.resource_provider.model_registry is None: | ||
| raise DatasetProfilerConfigurationError("Model registry is required for column profiler configs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious if this is intentional?
| from data_designer.engine.secret_resolver import SecretResolver | ||
| from data_designer.logging import quiet_noisy_logger | ||
|
|
||
| quiet_noisy_logger("httpx") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already done in
| quiet_noisy_logger("httpx") |
| """Extract repo_id and filename from identifier.""" | ||
| parts = identifier.split("/", 2) | ||
| if len(parts) < 3: | ||
| raise MalformedFileIdError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is specific to HF, MalformedHuggingFaceFileIdError is clearer?
| try: | ||
| return self._sources_dict[name] | ||
| except KeyError: | ||
| raise UnknownSeedDatasetSourceError(f"No seed dataset source named {name!r} registered") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be help to show possible valid names here?
|
Closing this, we're going to take a different approach to this problem. |
Adds support for registering multiple seed dataset sources.
enginechangesDataStoresuffix toSource, e.g.LocalSeedDatasetSourceModelProviderRegistryModelRegistry. It takes the registry (which is just configuration of seed sources) and turns it into objects with the actual ability to fetch and use datasets. We stick this on theResourceProviderto pass around to the generators that need it, in place of the previous singular "datastore" object that we'd pass aroundconfigchangessourcefield to theSeedConfig, so that you can specify "the dataset should be fetched from this particular source"HfHubSeedDatasetReferencedatastore.pyintoseed.pywhere it's predominately used.upload_to_hf_hubmethod is only used by the NMP client'supload_seed_dataset; I'm going to move it there.DatastoreSettingsand just store theSeedDatasetReferenceTon theBuilderConfig