Skip to content

Conversation

@anupkalburgi
Copy link
Collaborator

Changes

Linked issues

Resolves #..

Requirements

  • manually tested
  • updated documentation
  • updated demos
  • updated tests

@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 53.44262% with 142 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@3aca7f3). Learn more about missing BASE report.
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
dbldatagen/spec/generator_spec_impl.py 0.00% 111 Missing ⚠️
dbldatagen/spec/generator_spec.py 84.21% 13 Missing and 2 partials ⚠️
dbldatagen/spec/__init__.py 28.57% 10 Missing ⚠️
dbldatagen/spec/output_targets.py 85.18% 3 Missing and 1 partial ⚠️
dbldatagen/spec/compat.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master     #372   +/-   ##
=========================================
  Coverage          ?   89.41%           
=========================================
  Files             ?       53           
  Lines             ?     4450           
  Branches          ?      809           
=========================================
  Hits              ?     3979           
  Misses            ?      325           
  Partials          ?      146           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anupkalburgi anupkalburgi marked this pull request as ready for review December 2, 2025 14:31
@anupkalburgi anupkalburgi requested review from a team as code owners December 2, 2025 14:31
@anupkalburgi anupkalburgi requested review from nfx and renardeinside and removed request for a team December 2, 2025 14:31
@ghanse ghanse requested a review from Copilot December 4, 2025 18:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new Pydantic-based specification API for dbldatagen, providing a declarative, type-safe approach to synthetic data generation. The changes add comprehensive validation, test coverage, and example specifications while updating documentation and build configuration to support both Pydantic V1 and V2.

Key Changes:

  • New spec-based API with Pydantic models for defining data generation configurations
  • Comprehensive validation framework with error collection and reporting
  • Pydantic V1/V2 compatibility layer for broad environment support

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_specs.py Comprehensive test suite for ValidationResult, ColumnDefinition, DatagenSpec validation, and target configurations
tests/test_datasets_with_specs.py Tests for Pydantic model validation with BasicUser and BasicStockTicker specifications
tests/test_datagen_specs.py Tests for DatagenSpec creation, validation, and generator options
pyproject.toml Added ipython dependency, test matrix for Pydantic versions, and disabled warn_unused_ignores
makefile Updated to use Pydantic version-specific test environments and removed .venv target
examples/datagen_from_specs/basic_user_datagen_spec.py Example DatagenSpec factory for generating basic user data with pre-configured specs
examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py Complex example with OHLC stock data generation including time-series and volatility modeling
examples/datagen_from_specs/README.md Documentation for Pydantic-based dataset specifications with usage examples
dbldatagen/spec/validation.py ValidationResult class for collecting and reporting validation errors and warnings
dbldatagen/spec/output_targets.py Pydantic models for UCSchemaTarget and FilePathTarget output destinations
dbldatagen/spec/generator_spec_impl.py Generator class implementing the spec-to-DataFrame conversion logic
dbldatagen/spec/generator_spec.py Core DatagenSpec and TableDefinition models with comprehensive validation
dbldatagen/spec/compat.py Pydantic V1/V2 compatibility layer enabling cross-version support
dbldatagen/spec/column_spec.py ColumnDefinition model with validation for primary keys and constraints
dbldatagen/spec/init.py Module initialization with lazy imports to avoid heavy dependencies
README.md Updated feature list and formatting to mention new Pydantic-based API
CHANGELOG.md Added entry for Pydantic-based specification API feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 104 to 112
conflicting_opts_for_pk = [
"distribution", "template", "dataRange", "random", "omit",
"min", "max", "uniqueValues", "values", "expr"
]

for opt_key in conflicting_opts_for_pk:
if opt_key in kwargs:
logger.warning(
f"Primary key '{col_name}': Option '{opt_key}' may be ignored")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we disallow min, max, and dataRange for primary keys?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the primary key has a min, max it might end up conflicting with num_of_rows. Eg if the number of rows is 100, and min is 10, max is 50.


# Process each column
for col_def in table_spec.columns:
kwargs = self._columnSpecToDatagenColumnSpec(col_def)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use column_options instead to avoid any conflicts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is conflict you are referring to here?

Comment on lines +263 to +272
# Write data based on destination type
if isinstance(output_destination, FilePathTarget):
output_path = posixpath.join(output_destination.base_path, table_name)
df.write.format(output_destination.output_format).mode("overwrite").save(output_path)
logger.info(f"Wrote table '{table_name}' to file path: {output_path}")

elif isinstance(output_destination, UCSchemaTarget):
output_table = f"{output_destination.catalog}.{output_destination.schema_}.{table_name}"
df.write.mode("overwrite").saveAsTable(output_table)
logger.info(f"Wrote table '{table_name}' to Unity Catalog: {output_table}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use utils.write_data_to_output for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method in utils relies on OutputDataset. See the other comments above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have OutputDataset in config.py. I think we can reuse it here instead of creating new classes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, The output target in spec would be better suited for the spec as it does validations for UC and file paths volumes. This config.OutputDataset is generic, and getting specific errors would mean essientially copying this over to the new spec folder or other way round. Is there something I am missing here ?

ghanse and others added 2 commits December 10, 2025 14:36
* added use of ABC to mark TextGenerator as abstract

* Lint text generators module

* Add persistence methods

* Add tests and docs; Update PR template

* Update hatch installation for push action

* Refactor

* Update method names and signatures

---------

Co-authored-by: ronanstokes-db <[email protected]>
Co-authored-by: Ronan Stokes <[email protected]>
brettaulbaugh-db and others added 6 commits December 10, 2025 14:36
* Add files via upload

Adding an example notebook on how to leverage DBLDATAGEN in the creation of oil and gas datasets/. Currently placing this in the examples/notebooks file

* Update and rename [DBLDATAGEN]Oil&GasWellHeaderDailyProductionTypeCurve.py to oil_gas_data_generation.py

implemented changes Greg suggested aorund function formatting

* Update oil_gas_data_generation.py

added documentaiton link and install clarification

---------

Co-authored-by: Greg Hansen <[email protected]>
* Add files via upload

A notebook that generates synthetic log in data for various gamers. The gamers can have multiple devices that are consistent across them and are located in different areas.

* Update VideoGameLoginSyntheticDataGeneration.py

Fixed code with comments from PR

* Update and rename VideoGameLoginSyntheticDataGeneration.py to gaming_data_generation.py
* Format data_analyzer.py and text_generator_plugins.py

* Update formatting and fix docstrings

* Update inferred max value for DecimalType columns

* Update docstrings and inferred max value for DecimalType columns
* Added CPG supply chain dbldatagen notebook

* Deleting original CPG_supply_chain_datagen.py notebook

* Adding new file with the changes

Adding new file in the correct location in the repo.

* Cleaned up comments

---------

Co-authored-by: Greg Hansen <[email protected]>
"""

from dbldatagen.spec.generator_spec import DatagenSpec, TableDefinition
from dbldatagen.spec.generator_spec import DatagenSpec, DatasetDefinition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the param usage on line# 87 is from a previous version of the class? Would it be "datasets =" instead of "tables ="?

spec = DatagenSpec(
tables={"users": table_def},

import random

from dbldatagen.spec.generator_spec import DatagenSpec, TableDefinition
from dbldatagen.spec.generator_spec import DatagenSpec, DatasetDefinition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as the other example:

Looks like the param usage on line# 275 is from a previous version of the class? Would it be "datasets =" instead of "tables ="?

spec = DatagenSpec(
tables={"users": table_def},

for col_def in table_spec.columns:
kwargs = _columnSpecToDatagenColumnSpec(col_def)
data_gen = data_gen.withColumn(colName=col_def.name, **kwargs)
# Has performance implications.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert to TODO? What are the performance implications here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants