Optimize URL parsing with ClickHouse-inspired architecture and gperf TLD lookup #13

redox · 2025-09-26T20:30:20Z

as you might have seen @utay and I started to be fairly active on https://github.com/hatamiarash7/duckdb-netquack as we use DuckDB more and more at Altertable.

Knowing what ClickHouse & other state-of-the-art engines can deliver in terms of performance, we got fairly disappointed by the performance of netquack. Looking more deeply at the code, we quickly understood the reasons: the extensive usage of std::regex and the nested queries to public_suffix_list to validate TLDs were definitely the culprits 😅.

Inspired by how ClickHouse is doing it and after a few hours working collaboratively with Claude Sonnet 4 I ended up rewriting most of the extension with optimized version of each functions.

I forced myself to not touch any of your tests and ensure they were all still passing. I also spent some time to work on a benchmarking script, which I hope provides numbers that will convince you this is the right way to go.

I understand I'm pretty much re-writing all your code so if you prefer that I create a new extension instead of upstreaming this to yours, I'll gladly do it. Similarly, if you want us to take over the maintenance of netquack we'll be happy to help.

PS: this PR includes the 2 commits of #12

TLDR;

ClickHouse-inspired parsing replacing std::regex with optimized character parsing for all URL functions
Static TLD lookup using gperf-generated perfect hash for TLD lookups (O(1) with zero collisions) and no more database queries.
Remove now useless update_suffixes() function and database PSL loading (breaking change).

All tests passing with significant performance improvements (./scripts/benchmark.sh):

Function	Before	After	Improvement
extract_schema	0.069s	0.010s	6.9x faster
extract_host	0.086s	0.008s	10.7x faster
extract_port	0.533s	0.009s	59.2x faster
extract_path	0.127s	0.008s	15.8x faster
extract_query_string	0.202s	0.004s	50.5x faster
extract_domain	37.503s	0.012s	3125.2x faster
extract_subdomain	35.831s	0.010s	3583.1x faster
extract_tld	36.003s	0.009s	4000.3x faster
extract_extension	0.329s	0.007s	47.0x faster

PS: I committed the gperf-generated code so CI and anyone without gperf can compile. It's the larger part of the 65k lines of diffs 😅

Ensure the `CURLOPT_WRITEFUNCTION` knows how to write to `string` or `File`.

…TLD lookup Performance optimization overhaul replacing regex-based parsing with character-by-character processing and database-dependent TLD lookups with compile-time perfect hash functions. - Replace std::regex with optimized character parsing for all URL functions - Implement gperf-generated perfect hash for ~9,700 TLD lookups (O(1) with zero collisions) - Use std::string_view for zero-copy string operations - Eliminate database dependencies for TLD resolution - **ClickHouse-inspired parsing**: Character-by-character URL component extraction - **Static TLD lookup**: Mozilla Public Suffix List compiled into extension binary - **Perfect hash generation**: gperf-based collision-free hash function (like ClickHouse) - **Compile-time optimization**: All TLD data embedded at build time - Remove update_suffixes() function and database PSL loading - Remove CURL-based PSL downloads and table creation - Clean up unused includes and dependencies - Eliminate ~100 lines of legacy database interaction code - **No runtime PSL downloads**: Faster cold starts, no network dependencies - **No database queries**: TLD lookups resolved via in-memory perfect hash - **Predictable performance**: Consistent O(n) parsing regardless of URL complexity - **Reduced memory footprint**: Perfect hash more efficient than hash tables - gperf now required for optimal TLD lookup generation - Automatic fallback removed to ensure consistent performance - Update README with performance optimization details - Document ClickHouse-inspired architecture approach - Remove outdated database-dependent function documentation 📊 All tests passing with significant performance improvements expected: | Function | Before | After | Improvement | |----------|----------------|----------------|-------------| | `extract_schema` | 0.029s | 0.003s | 9.6x faster | | `extract_host` | 0.029s | 0.004s | 7.2x faster | | `extract_port` | 0.180s | 0.003s | 60.0x faster | | `extract_path` | 0.044s | 0.003s | 14.6x faster | | `extract_query_string` | 0.069s | 0.002s | 34.5x faster | | `extract_domain` | 14.646s | 0.006s | 2441.0x faster | | `extract_subdomain` | 13.158s | 0.003s | 4386.0x faster | | `extract_tld` | 12.987s | 0.003s | 4329.0x faster | | `extract_extension` | 0.108s | 0.002s | 54.0x faster |

redox · 2025-09-27T09:47:16Z

src/functions/ipcalc.cpp

+            return_types.emplace_back (LogicalType(LogicalTypeId::VARCHAR));
            names.emplace_back ("ipClass");

            return make_uniq<IPCalcData> ();


those changes were done to fix a linking issue with those symbols being duplicated on some platforms with C++ 17

hatamiarash7 · 2025-09-28T04:21:16Z

First of all, thank you for this PR.

I have no guard or strictness about rewriting my own code. I'm definitely not the best programmer, and there's no bias. Anyone can make mistakes 😄 . I don't think there's a need for a new extension, and we can solve the problems with this one.

Let me also take a look at the Clickhouse architecture before we merge. I've only been using Clickhouse for the past few years and have never looked at its code.

redox · 2025-09-28T09:02:22Z

First of all, thank you for this PR.

Pleasure!

I'm definitely not the best programmer, and there's no bias. Anyone can make mistakes 😄 .

Absolutely!

I don't think there's a need for a new extension, and we can solve the problems with this one.

Great to read!

Let me also take a look at the Clickhouse architecture before we merge. I've only been using Clickhouse for the past few years and have never looked at its code.

Sure thing, you let me know if you need ay help understanding this or that.

redox · 2025-10-01T20:13:41Z

@hatamiarash7 do you think we could merge and release this as a 2.0.0 or 1.6.0?

redox added 2 commits September 22, 2025 11:39

Fix cURL segfault

4aae816

Ensure the `CURLOPT_WRITEFUNCTION` knows how to write to `string` or `File`.

try to fix Windows' compilation

98f9885

redox requested a review from hatamiarash7 as a code owner September 26, 2025 20:30

redox force-pushed the su/optim branch 2 times, most recently from f94aab5 to d2d5690 Compare September 26, 2025 21:18

redox force-pushed the su/optim branch from d2d5690 to eb17894 Compare September 26, 2025 21:44

redox commented Sep 27, 2025

View reviewed changes

hatamiarash7 approved these changes Oct 2, 2025

View reviewed changes

hatamiarash7 merged commit 3081a3c into hatamiarash7:main Oct 2, 2025
23 checks passed

redox deleted the su/optim branch October 2, 2025 07:50

hatamiarash7 mentioned this pull request Oct 2, 2025

Bump netquack to 1.6.0 duckdb/community-extensions#640

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize URL parsing with ClickHouse-inspired architecture and gperf TLD lookup #13

Optimize URL parsing with ClickHouse-inspired architecture and gperf TLD lookup #13

redox commented Sep 26, 2025 •

edited

Loading

Uh oh!

redox Sep 27, 2025

Uh oh!

hatamiarash7 commented Sep 28, 2025

Uh oh!

redox commented Sep 28, 2025

Uh oh!

redox commented Oct 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize URL parsing with ClickHouse-inspired architecture and gperf TLD lookup #13

Optimize URL parsing with ClickHouse-inspired architecture and gperf TLD lookup #13

Conversation

redox commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TLDR;

Uh oh!

redox Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

hatamiarash7 commented Sep 28, 2025

Uh oh!

redox commented Sep 28, 2025

Uh oh!

redox commented Oct 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

redox commented Sep 26, 2025 •

edited

Loading