Skip to content

[BUG] extract_domain is extremely slow on DuckDB 1.3.2 #15

@ctoprimer

Description

@ctoprimer

Describe the bug

netquack’s extract_domain() performance appears to degrade sharply on DuckDB v1.3.2. Even on simple synthetic emails in-memory, runtime grows to ~12.7s for 10k rows, which is unexpectedly slow for basic domain extraction.

Number of Domains Time to Run (avg of 5 runs)
100 0.133s
1,000 1.332s
10,000 12.700s

To Reproduce

Steps to reproduce the behavior (pure SQL, DuckDB CLI):

  1. Start DuckDB v1.3.2 (any environment is fine; the issue reproduces in :memory:).

  2. Enable timing in the DuckDB CLI:

    .timer on
  3. Install and load the extension:

    SET allow_community_extensions = true;
    INSTALL netquack FROM community;
    LOAD netquack;
  4. Confirm DuckDB version:

    SELECT version() AS duckdb_version;

    (Optional) If netquack exposes a version function, please advise the canonical query (e.g. SELECT * FROM netquack_version();).

  5. Run the benchmark queries and record the timings shown by .timer on:

    100 rows

    SELECT
      count(*) AS n,
      sum(length(NULLIF(extract_domain(email), ''))) AS total_domain_chars
    FROM (
      SELECT printf('user%08d@sub%d.domain%d.example.com', i, i % 10, i % 100) AS email
      FROM range(100) t(i)
    ) s;

    1,000 rows

    SELECT
      count(*) AS n,
      sum(length(NULLIF(extract_domain(email), ''))) AS total_domain_chars
    FROM (
      SELECT printf('user%08d@sub%d.domain%d.example.com', i, i % 10, i % 100) AS email
      FROM range(1000) t(i)
    ) s;

    10,000 rows

    SELECT
      count(*) AS n,
      sum(length(NULLIF(extract_domain(email), ''))) AS total_domain_chars
    FROM (
      SELECT printf('user%08d@sub%d.domain%d.example.com', i, i % 10, i % 100) AS email
      FROM range(10000) t(i)
    ) s;
  6. Observe that runtime increases substantially with row count (e.g., ~12s at 10k rows on DuckDB v1.3.2).

Expected behavior

Domain extraction from simple emails should be much faster (sub-second at 10k rows), especially in-memory on synthetic data. Current performance makes extract_domain() a bottleneck in downstream dbt models.

Versions
Netquack Version: 6a0d651
DuckDB Version: 1.3.2

Additional context

  • Reproduces in :memory: and does not rely on file IO.
  • Query forces evaluation by aggregating the extract_domain() output.
  • Timings reported above were measured from the same workload using DuckDB v1.3.2 (Python package), averaged over 5 runs after warm-up.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions