-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Describe the bug
netquack’s extract_domain() performance appears to degrade sharply on DuckDB v1.3.2. Even on simple synthetic emails in-memory, runtime grows to ~12.7s for 10k rows, which is unexpectedly slow for basic domain extraction.
| Number of Domains | Time to Run (avg of 5 runs) |
|---|---|
| 100 | 0.133s |
| 1,000 | 1.332s |
| 10,000 | 12.700s |
To Reproduce
Steps to reproduce the behavior (pure SQL, DuckDB CLI):
-
Start DuckDB v1.3.2 (any environment is fine; the issue reproduces in
:memory:). -
Enable timing in the DuckDB CLI:
.timer on -
Install and load the extension:
SET allow_community_extensions = true; INSTALL netquack FROM community; LOAD netquack;
-
Confirm DuckDB version:
SELECT version() AS duckdb_version;
(Optional) If netquack exposes a version function, please advise the canonical query (e.g.
SELECT * FROM netquack_version();). -
Run the benchmark queries and record the timings shown by
.timer on:100 rows
SELECT count(*) AS n, sum(length(NULLIF(extract_domain(email), ''))) AS total_domain_chars FROM ( SELECT printf('user%08d@sub%d.domain%d.example.com', i, i % 10, i % 100) AS email FROM range(100) t(i) ) s;
1,000 rows
SELECT count(*) AS n, sum(length(NULLIF(extract_domain(email), ''))) AS total_domain_chars FROM ( SELECT printf('user%08d@sub%d.domain%d.example.com', i, i % 10, i % 100) AS email FROM range(1000) t(i) ) s;
10,000 rows
SELECT count(*) AS n, sum(length(NULLIF(extract_domain(email), ''))) AS total_domain_chars FROM ( SELECT printf('user%08d@sub%d.domain%d.example.com', i, i % 10, i % 100) AS email FROM range(10000) t(i) ) s;
-
Observe that runtime increases substantially with row count (e.g., ~12s at 10k rows on DuckDB v1.3.2).
Expected behavior
Domain extraction from simple emails should be much faster (sub-second at 10k rows), especially in-memory on synthetic data. Current performance makes extract_domain() a bottleneck in downstream dbt models.
Versions
Netquack Version: 6a0d651
DuckDB Version: 1.3.2
Additional context
- Reproduces in
:memory:and does not rely on file IO. - Query forces evaluation by aggregating the
extract_domain()output. - Timings reported above were measured from the same workload using DuckDB v1.3.2 (Python package), averaged over 5 runs after warm-up.