-
Notifications
You must be signed in to change notification settings - Fork 2
Optimize URL parsing with ClickHouse-inspired architecture and gperf TLD lookup #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ensure the `CURLOPT_WRITEFUNCTION` knows how to write to `string` or `File`.
f94aab5 to
d2d5690
Compare
…TLD lookup Performance optimization overhaul replacing regex-based parsing with character-by-character processing and database-dependent TLD lookups with compile-time perfect hash functions. - Replace std::regex with optimized character parsing for all URL functions - Implement gperf-generated perfect hash for ~9,700 TLD lookups (O(1) with zero collisions) - Use std::string_view for zero-copy string operations - Eliminate database dependencies for TLD resolution - **ClickHouse-inspired parsing**: Character-by-character URL component extraction - **Static TLD lookup**: Mozilla Public Suffix List compiled into extension binary - **Perfect hash generation**: gperf-based collision-free hash function (like ClickHouse) - **Compile-time optimization**: All TLD data embedded at build time - Remove update_suffixes() function and database PSL loading - Remove CURL-based PSL downloads and table creation - Clean up unused includes and dependencies - Eliminate ~100 lines of legacy database interaction code - **No runtime PSL downloads**: Faster cold starts, no network dependencies - **No database queries**: TLD lookups resolved via in-memory perfect hash - **Predictable performance**: Consistent O(n) parsing regardless of URL complexity - **Reduced memory footprint**: Perfect hash more efficient than hash tables - gperf now required for optimal TLD lookup generation - Automatic fallback removed to ensure consistent performance - Update README with performance optimization details - Document ClickHouse-inspired architecture approach - Remove outdated database-dependent function documentation 📊 All tests passing with significant performance improvements expected: | Function | Before | After | Improvement | |----------|----------------|----------------|-------------| | `extract_schema` | 0.029s | 0.003s | 9.6x faster | | `extract_host` | 0.029s | 0.004s | 7.2x faster | | `extract_port` | 0.180s | 0.003s | 60.0x faster | | `extract_path` | 0.044s | 0.003s | 14.6x faster | | `extract_query_string` | 0.069s | 0.002s | 34.5x faster | | `extract_domain` | 14.646s | 0.006s | 2441.0x faster | | `extract_subdomain` | 13.158s | 0.003s | 4386.0x faster | | `extract_tld` | 12.987s | 0.003s | 4329.0x faster | | `extract_extension` | 0.108s | 0.002s | 54.0x faster |
| return_types.emplace_back (LogicalType(LogicalTypeId::VARCHAR)); | ||
| names.emplace_back ("ipClass"); | ||
|
|
||
| return make_uniq<IPCalcData> (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those changes were done to fix a linking issue with those symbols being duplicated on some platforms with C++ 17
|
First of all, thank you for this PR. I have no guard or strictness about rewriting my own code. I'm definitely not the best programmer, and there's no bias. Anyone can make mistakes 😄 . I don't think there's a need for a new extension, and we can solve the problems with this one. Let me also take a look at the Clickhouse architecture before we merge. I've only been using Clickhouse for the past few years and have never looked at its code. |
Pleasure!
Absolutely!
Great to read!
Sure thing, you let me know if you need ay help understanding this or that. |
|
@hatamiarash7 do you think we could merge and release this as a 2.0.0 or 1.6.0? |
Hello @hatamiarash7 ,
as you might have seen @utay and I started to be fairly active on https://github.com/hatamiarash7/duckdb-netquack as we use DuckDB more and more at Altertable.
Knowing what ClickHouse & other state-of-the-art engines can deliver in terms of performance, we got fairly disappointed by the performance of
netquack. Looking more deeply at the code, we quickly understood the reasons: the extensive usage ofstd::regexand the nested queries topublic_suffix_listto validate TLDs were definitely the culprits 😅.Inspired by how ClickHouse is doing it and after a few hours working collaboratively with Claude Sonnet 4 I ended up rewriting most of the extension with optimized version of each functions.
I forced myself to not touch any of your tests and ensure they were all still passing. I also spent some time to work on a benchmarking script, which I hope provides numbers that will convince you this is the right way to go.
I understand I'm pretty much re-writing all your code so if you prefer that I create a new extension instead of upstreaming this to yours, I'll gladly do it. Similarly, if you want us to take over the maintenance of netquack we'll be happy to help.
PS: this PR includes the 2 commits of #12
TLDR;
std::regexwith optimized character parsing for all URL functionsgperf-generated perfect hash for TLD lookups (O(1) with zero collisions) and no more database queries.update_suffixes()function and database PSL loading (breaking change).All tests passing with significant performance improvements (
./scripts/benchmark.sh):PS: I committed the gperf-generated code so CI and anyone without gperf can compile. It's the larger part of the 65k lines of diffs 😅