Skip to content

Duckdb hardening#217

Closed
tim-band wants to merge 110 commits intoalan-turing-institute:mainfrom
SAFEHR-data:duckdb-hardening
Closed

Duckdb hardening#217
tim-band wants to merge 110 commits intoalan-turing-institute:mainfrom
SAFEHR-data:duckdb-hardening

Conversation

@tim-band
Copy link
Copy Markdown
Collaborator

DuckDB can be prevented from accessing parquet files.
We should use this technique to stop Datafaker in its generation phases from accessing Parquet files, because those Parquet files might hold sensitive data, and DuckDB can access these files sometimes when you don't expect it to.
The hardest part of this is writing the tests to make sure that Datafaker cannot access Parquet files when it is also accessing the destination database.

Tim Band added 30 commits November 27, 2024 18:27
...with a small config file
Added WITHOUT TIME ZONE.
Removed sqlacodegen dependency.
including union column sets
Tim Band and others added 29 commits May 20, 2025 09:47
Version 0.15 knows that click 8.2 breaks it.
'dsn' no longer contains password.
'schema' is taken from SRC_SCHEMA directly.
Tables no longer have schema.
Also documentation fixes.
* Fixes #33, #34, #31

* documentation normal->generate

---------

Co-authored-by: Tim Band <t.b@ucl>
* Fixes #33, #34, #31

* documentation normal->generate

* Don't unnecessarily create schema

* dump-data command #38

* Initial documentation of orm.yaml

---------

Co-authored-by: Tim Band <t.b@ucl>
* Initial change of name from sqlsynthgen to datafaker

* Refactoring of interactive tests

* `remove-` commands tests replaced

---------

Co-authored-by: Tim Band <t.b@ucl>
* src-stats gain query and date

* Queries gain comments that get copied into src-stats.yaml
Fixed stories

* single letter command synonyms in configure-generators

---------

Co-authored-by: Tim Band <t.b@ucl>
* test_make updated
test_settings fixed

* Fixed main and create tests

* Tests all pass individually now. Many fixes:
* create-vocab actually runs
* create-data reports correct count of story rows
* -f can be used as well as --force
* the logger is called "datafaker" not "utils"
* row generators can fully exhaust unique constraints
* row/story generator modules can just be files (this may just have been broken for tests)
* turning on max constraint retries doesn't break create-data
* unique constraint failure does not blow up datafaker

* More test robustness

* All test pass together!
Row generators can be instantiated objects

---------

Co-authored-by: Tim Band <t.b@ucl>
* Tests all pass individually now. Many fixes:
* create-vocab actually runs
* create-data reports correct count of story rows
* -f can be used as well as --force
* the logger is called "datafaker" not "utils"
* row generators can fully exhaust unique constraints
* row/story generator modules can just be files (this may just have been broken for tests)
* turning on max constraint retries doesn't break create-data
* unique constraint failure does not blow up datafaker

* More test robustness

* #44 create-generators figures out if it needs a stats file

* bump version to 0.2.1

---------

Co-authored-by: Tim Band <t.b@ucl>
* sampled and suppressed choice generators

* Fixed problems found trying this out for real.

---------

Co-authored-by: Tim Band <t.b@ucl>
and --spec=table-column-gen.csv

Co-authored-by: Tim Band <t.b@ucl>
Co-authored-by: Tim Band <t.b@ucl>
* test_prompts added for configure-generators
* Refactored GeneratorCmd to allow multi-column generators: Fixes #54
* configure-generators merge and unmerge commands
* multivariate normal and lognormal generator
* Added (univariate) lognormal generator
* Weighted choice generator
* null-partitioned grouped lognormal plus sampled and suppressed
* VARCHAR(N) generators truncate results
* Updated health_data documentation
* #59 Foreign Keys to ignored tables supported
Co-authored-by: Tim Band <t.b@ucl>
* configure-generators --spec now allows fallbacks and multi-column generators
* null-partitioned grouped sampled generators
* SUPPRESS_COUNT is now 7
* Automatic pre-commit fixes
* Fixed variances in tests
* precommit cleanup, NullPartitionedGrouped fix
* Moved DistributionGenerator to providers.py
Co-authored-by: Tim Band <t.b@ucl>
* Refactoring query construction
out of the generator proposer

* Removd _get_row_partition

* A bit more refactoring

* Fixed #73 Grouped generators query results overlap

* FKs to concept table named, initial implementation

* Many extra comments output with partitions

* Named column fetched from config.yaml

* configure-tables allows the setting of naming columns

---------

Co-authored-by: Tim Band <t.b@ucl>
* DuckDB support via `duckdb_engine` with compiler hooks to work round bugs
* Parquet output for dump-data
* DuckDB usage documentation
* dump-data --output as-directory
* dump-data tests now run with DuckDB and PostgreSQL
* some bugs fixed:
  - Generator writers `go_to` can cope with table names that have dots.
  - `dump-data --parquet` can cope with `TIMESTAMP`s
  - Foreign Keys to ignored tables fixed.
* Added --parquet-dir to make-tables
* Fixed for opiates example
* Dockerfile needs build tools for DuckDB support
* New Overview page
  - Mermaid diagrams can be embedded in docs
  - Table CSS fixed
* First attempt at a documentation publishing workflow
* Manual dispatch of documentation publish action
* Add Mermaid to the RST lint test
* Fixed backslashes in docstrings
* Worked around sphinx-rtd-theme CSS: tables should be 100% wide
* Fixes #78: select command causes crash
* Don't skip DuckDB tests even if DuckDB isn't installed
@tim-band tim-band closed this Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant