Skip to content

Conversation

@redox
Copy link
Contributor

@redox redox commented Sep 26, 2025

Hello @hatamiarash7 ,

as you might have seen @utay and I started to be fairly active on https://github.com/hatamiarash7/duckdb-netquack as we use DuckDB more and more at Altertable.

Knowing what ClickHouse & other state-of-the-art engines can deliver in terms of performance, we got fairly disappointed by the performance of netquack. Looking more deeply at the code, we quickly understood the reasons: the extensive usage of std::regex and the nested queries to public_suffix_list to validate TLDs were definitely the culprits 😅.

Inspired by how ClickHouse is doing it and after a few hours working collaboratively with Claude Sonnet 4 I ended up rewriting most of the extension with optimized version of each functions.

I forced myself to not touch any of your tests and ensure they were all still passing. I also spent some time to work on a benchmarking script, which I hope provides numbers that will convince you this is the right way to go.

I understand I'm pretty much re-writing all your code so if you prefer that I create a new extension instead of upstreaming this to yours, I'll gladly do it. Similarly, if you want us to take over the maintenance of netquack we'll be happy to help.

PS: this PR includes the 2 commits of #12


TLDR;

  • ClickHouse-inspired parsing replacing std::regex with optimized character parsing for all URL functions
  • Static TLD lookup using gperf-generated perfect hash for TLD lookups (O(1) with zero collisions) and no more database queries.
  • Remove now useless update_suffixes() function and database PSL loading (breaking change).

All tests passing with significant performance improvements (./scripts/benchmark.sh):

Function Before After Improvement
extract_schema 0.069s 0.010s 6.9x faster
extract_host 0.086s 0.008s 10.7x faster
extract_port 0.533s 0.009s 59.2x faster
extract_path 0.127s 0.008s 15.8x faster
extract_query_string 0.202s 0.004s 50.5x faster
extract_domain 37.503s 0.012s 3125.2x faster
extract_subdomain 35.831s 0.010s 3583.1x faster
extract_tld 36.003s 0.009s 4000.3x faster
extract_extension 0.329s 0.007s 47.0x faster

PS: I committed the gperf-generated code so CI and anyone without gperf can compile. It's the larger part of the 65k lines of diffs 😅

Ensure the `CURLOPT_WRITEFUNCTION` knows how to write to `string` or
`File`.
@redox redox requested a review from hatamiarash7 as a code owner September 26, 2025 20:30
@redox redox force-pushed the su/optim branch 2 times, most recently from f94aab5 to d2d5690 Compare September 26, 2025 21:18
…TLD lookup

Performance optimization overhaul replacing regex-based parsing with character-by-character
processing and database-dependent TLD lookups with compile-time perfect hash functions.

- Replace std::regex with optimized character parsing for all URL functions
- Implement gperf-generated perfect hash for ~9,700 TLD lookups (O(1) with zero collisions)
- Use std::string_view for zero-copy string operations
- Eliminate database dependencies for TLD resolution

- **ClickHouse-inspired parsing**: Character-by-character URL component extraction
- **Static TLD lookup**: Mozilla Public Suffix List compiled into extension binary
- **Perfect hash generation**: gperf-based collision-free hash function (like ClickHouse)
- **Compile-time optimization**: All TLD data embedded at build time

- Remove update_suffixes() function and database PSL loading
- Remove CURL-based PSL downloads and table creation
- Clean up unused includes and dependencies
- Eliminate ~100 lines of legacy database interaction code

- **No runtime PSL downloads**: Faster cold starts, no network dependencies
- **No database queries**: TLD lookups resolved via in-memory perfect hash
- **Predictable performance**: Consistent O(n) parsing regardless of URL complexity
- **Reduced memory footprint**: Perfect hash more efficient than hash tables

- gperf now required for optimal TLD lookup generation
- Automatic fallback removed to ensure consistent performance

- Update README with performance optimization details
- Document ClickHouse-inspired architecture approach
- Remove outdated database-dependent function documentation

📊 All tests passing with significant performance improvements expected:

| Function | Before | After | Improvement |
|----------|----------------|----------------|-------------|
| `extract_schema` | 0.029s | 0.003s | 9.6x faster |
| `extract_host` | 0.029s | 0.004s | 7.2x faster |
| `extract_port` | 0.180s | 0.003s | 60.0x faster |
| `extract_path` | 0.044s | 0.003s | 14.6x faster |
| `extract_query_string` | 0.069s | 0.002s | 34.5x faster |
| `extract_domain` | 14.646s | 0.006s | 2441.0x faster |
| `extract_subdomain` | 13.158s | 0.003s | 4386.0x faster |
| `extract_tld` | 12.987s | 0.003s | 4329.0x faster |
| `extract_extension` | 0.108s | 0.002s | 54.0x faster |
return_types.emplace_back (LogicalType(LogicalTypeId::VARCHAR));
names.emplace_back ("ipClass");

return make_uniq<IPCalcData> ();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those changes were done to fix a linking issue with those symbols being duplicated on some platforms with C++ 17

@hatamiarash7
Copy link
Owner

First of all, thank you for this PR.

I have no guard or strictness about rewriting my own code. I'm definitely not the best programmer, and there's no bias. Anyone can make mistakes 😄 . I don't think there's a need for a new extension, and we can solve the problems with this one.

Let me also take a look at the Clickhouse architecture before we merge. I've only been using Clickhouse for the past few years and have never looked at its code.

@redox
Copy link
Contributor Author

redox commented Sep 28, 2025

First of all, thank you for this PR.

Pleasure!

I'm definitely not the best programmer, and there's no bias. Anyone can make mistakes 😄 .

Absolutely!

I don't think there's a need for a new extension, and we can solve the problems with this one.

Great to read!

Let me also take a look at the Clickhouse architecture before we merge. I've only been using Clickhouse for the past few years and have never looked at its code.

Sure thing, you let me know if you need ay help understanding this or that.

@redox
Copy link
Contributor Author

redox commented Oct 1, 2025

@hatamiarash7 do you think we could merge and release this as a 2.0.0 or 1.6.0?

@hatamiarash7 hatamiarash7 merged commit 3081a3c into hatamiarash7:main Oct 2, 2025
23 checks passed
@redox redox deleted the su/optim branch October 2, 2025 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants