Skip to content

Commit 7f0f36e

Browse files
committed
Optimize URL parsing with ClickHouse-inspired architecture and gperf TLD lookup
Performance optimization overhaul replacing regex-based parsing with character-by-character processing and database-dependent TLD lookups with compile-time perfect hash functions. - Replace std::regex with optimized character parsing for all URL functions - Implement gperf-generated perfect hash for ~9,700 TLD lookups (O(1) with zero collisions) - Use std::string_view for zero-copy string operations - Eliminate database dependencies for TLD resolution - **ClickHouse-inspired parsing**: Character-by-character URL component extraction - **Static TLD lookup**: Mozilla Public Suffix List compiled into extension binary - **Perfect hash generation**: gperf-based collision-free hash function (like ClickHouse) - **Compile-time optimization**: All TLD data embedded at build time - Remove update_suffixes() function and database PSL loading - Remove CURL-based PSL downloads and table creation - Clean up unused includes and dependencies - Eliminate ~100 lines of legacy database interaction code - **No runtime PSL downloads**: Faster cold starts, no network dependencies - **No database queries**: TLD lookups resolved via in-memory perfect hash - **Predictable performance**: Consistent O(n) parsing regardless of URL complexity - **Reduced memory footprint**: Perfect hash more efficient than hash tables - gperf now required for optimal TLD lookup generation - Automatic fallback removed to ensure consistent performance - Update README with performance optimization details - Document ClickHouse-inspired architecture approach - Remove outdated database-dependent function documentation 📊 All tests passing with significant performance improvements expected: | Function | Before | After | Improvement | |----------|----------------|----------------|-------------| | `extract_schema` | 0.029s | 0.003s | 9.6x faster | | `extract_host` | 0.029s | 0.004s | 7.2x faster | | `extract_port` | 0.180s | 0.003s | 60.0x faster | | `extract_path` | 0.044s | 0.003s | 14.6x faster | | `extract_query_string` | 0.069s | 0.002s | 34.5x faster | | `extract_domain` | 14.646s | 0.006s | 2441.0x faster | | `extract_subdomain` | 13.158s | 0.003s | 4386.0x faster | | `extract_tld` | 12.987s | 0.003s | 4329.0x faster | | `extract_extension` | 0.108s | 0.002s | 54.0x faster |
1 parent 98f9885 commit 7f0f36e

27 files changed

+65873
-505
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,5 @@ test/python/__pycache__/
88
.Rhistory
99
*.log
1010
*.csv
11+
public_suffix_list.dat
1112
!test/data/*.csv

.vscode/settings.json

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,17 @@
11
{
2+
"cmake.generator": "Ninja",
3+
"cmake.sourceDirectory": "${workspaceFolder}",
4+
"cmake.buildDirectory": "${workspaceFolder}/build/release",
5+
"cmake.configureOnOpen": false,
6+
// Use compile_commands.json generated by Ninja for accurate include paths and flags
7+
"C_Cpp.default.compileCommands": "${workspaceFolder}/build/release/compile_commands.json",
8+
9+
"clangd.arguments": [
10+
"--compile-commands-dir=${workspaceFolder}/build/release",
11+
"--background-index",
12+
"--enable-config",
13+
"--suggest-missing-includes"
14+
],
215
"cSpell.words": [
316
"duckdb",
417
"Hostroute",

README.md

Lines changed: 10 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ This extension is designed to simplify working with domains, URIs, and web paths
88

99
With Netquack, you can unlock deeper insights from your web-related datasets without the need for external tools or complex workflows.
1010

11+
NetQuack uses ClickHouse-inspired character-by-character parsing and gperf-generated perfect hash functions for optimal performance.
12+
1113
Table of Contents
1214

1315
- [DuckDB Netquack Extension](#duckdb-netquack-extension)
@@ -55,9 +57,7 @@ Once installed, the [macro functions](https://duckdb.org/community_extensions/ex
5557

5658
### Extracting The Main Domain
5759

58-
This function extracts the main domain from a URL. For this purpose, the extension will get all public suffixes from the [publicsuffix.org](https://publicsuffix.org/) list and extract the main domain from the URL.
59-
60-
The download process of the public suffix list is done automatically when the function is called for the first time. After that, the list is stored in the `public_suffix_list` table to avoid downloading it again.
60+
This function extracts the main domain from a URL using an optimized static TLD lookup system. The extension uses Mozilla's Public Suffix List compiled into a gperf-generated perfect hash function for O(1) TLD lookups with zero collisions.
6161

6262
```sql
6363
D SELECT extract_domain('a.example.com') AS domain;
@@ -77,23 +77,7 @@ D SELECT extract_domain('https://b.a.example.com/path') AS domain;
7777
└─────────────┘
7878
```
7979

80-
You can use the `update_suffixes` function to update the public suffix list manually.
81-
82-
```sql
83-
D SELECT update_suffixes();
84-
┌───────────────────┐
85-
│ update_suffixes() │
86-
varchar
87-
├───────────────────┤
88-
│ updated │
89-
└───────────────────┘
90-
```
91-
92-
> [!WARNING]
93-
> This a public service with a limited number of requests. If you call the function too many times, you may get a 403 error.
94-
> `<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message></Error>`
95-
> The list usually changes a few times per week; more frequent downloading will cause rate limiting.
96-
> In this case, you can download the list manually from [publicsuffix.org](https://publicsuffix.org/) and save it in the `public_suffix_list` table.
80+
The TLD lookup is built into the extension at compile time using the latest Mozilla Public Suffix List. No runtime downloads or database operations are required.
9781

9882
### Extracting The Path
9983

@@ -234,7 +218,7 @@ D SELECT extract_extension('http://example.com/image.jpg') AS ext;
234218

235219
### Extracting The TLD (Top-Level Domain)
236220

237-
This function extracts the top-level domain from a URL. This function will use the public suffix list to extract the TLD. Check the [Extracting The Main Domain](#extracting-the-main-domain) section for more information about the public suffix list.
221+
This function extracts the top-level domain from a URL using the optimized gperf-based public suffix lookup system. The function correctly handles multi-part TLDs (like `com.au`) using the longest-match algorithm from Mozilla's Public Suffix List.
238222

239223
```sql
240224
D SELECT extract_tld('https://example.com.ac/path/path') AS tld;
@@ -256,7 +240,7 @@ D SELECT extract_tld('a.example.com') AS tld;
256240

257241
### Extracting The Sub Domain
258242

259-
This function extracts the sub-domain from a URL. This function will use the public suffix list to extract the TLD. Check the [Extracting The Main Domain](#extracting-the-main-domain) section for more information about the public suffix list.
243+
This function extracts the sub-domain from a URL using the optimized public suffix lookup system to correctly identify the domain boundary and extract everything before it.
260244

261245
```sql
262246
D SELECT extract_subdomain('http://a.b.example.com/path') AS dns_record;
@@ -398,6 +382,10 @@ D SELECT * FROM netquack_version();
398382
└─────────┘
399383
```
400384

385+
### 🛠 **Build Requirements**
386+
387+
- **gperf required**: Perfect hash generation requires `gperf` (install via `brew install gperf` or `apt-get install gperf`)
388+
401389
## Debugging
402390

403391
The debugging process for DuckDB extensions is not an easy job. For Netquack, we have created a log file in the current directory. The log file is named `netquack.log` and contains all the logs for the extension. You can use this file to debug your code.

docs/functions/extract-domain.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@ layout:
1414

1515
# Extract Domain
1616

17-
This function extracts the main domain from a URL. For this purpose, the extension will get all public suffixes from the [publicsuffix.org](https://publicsuffix.org/) list and extract the main domain from the URL.
17+
This function extracts the main domain from a URL using an optimized static TLD lookup system. The extension uses Mozilla's Public Suffix List compiled into a gperf-generated perfect hash function for O(1) TLD lookups with zero collisions.
1818

19-
The download process of the public suffix list is done automatically when the function is called for the first time. After that, the list is stored in the `public_suffix_list` table to avoid downloading it again.
19+
The TLD lookup is built into the extension at compile time using the latest Mozilla Public Suffix List. No runtime downloads or database operations are required.
2020

2121
```sql
2222
D SELECT extract_domain('a.example.com') AS domain;

scripts/benchmark.sh

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
#!/usr/bin/env bash
2+
3+
echo "=== NetQuack Performance Comparison ==="
4+
echo ""
5+
6+
# Change to the project root directory (parent of script location)
7+
CWD="$(cd "$(dirname "$0")/.." && pwd)"
8+
cd "$CWD"
9+
10+
# Create results directory in build (gitignored)
11+
mkdir -p build/benchmark_results
12+
13+
# Function to create benchmark SQL with specified load command
14+
create_benchmark_sql() {
15+
local load_command="$1"
16+
local version_label="$2"
17+
local output_file="$3"
18+
19+
cat > "$output_file" << EOF
20+
${load_command}
21+
22+
-- Create test data
23+
CREATE TABLE benchmark_urls AS SELECT * FROM (VALUES
24+
('https://www.example.com/path?query=value#fragment'),
25+
('http://subdomain.example.co.uk:8080/long/path/to/file.html?param1=value1&param2=value2'),
26+
('ftp://user:[email protected]:21/directory/file.zip'),
27+
('https://blog.example.org/2023/12/article-title.html'),
28+
('http://api.service.example.net:3000/v1/users/123?format=json'),
29+
('https://cdn.assets.example.com/images/logo.png'),
30+
('mailto:[email protected]'),
31+
('http://localhost:8080/debug'),
32+
('https://secure.payment.example.gov:8443/transaction?id=abc123'),
33+
('https://example.com')
34+
) AS t(url);
35+
36+
-- Expand dataset
37+
CREATE TABLE large_benchmark_urls AS
38+
WITH RECURSIVE series(i) AS (
39+
SELECT 1
40+
UNION ALL
41+
SELECT i + 1 FROM series WHERE i <= 3000
42+
)
43+
SELECT url FROM benchmark_urls CROSS JOIN series;
44+
45+
.timer on
46+
.print '=== ${version_label} BENCHMARKS ==='
47+
48+
.print 'extract_schema:'
49+
SELECT extract_schema(url) FROM large_benchmark_urls;
50+
51+
.print 'extract_host:'
52+
SELECT extract_host(url) FROM large_benchmark_urls;
53+
54+
.print 'extract_port:'
55+
SELECT extract_port(url) FROM large_benchmark_urls;
56+
57+
.print 'extract_path:'
58+
SELECT extract_path(url) FROM large_benchmark_urls;
59+
60+
.print 'extract_query_string:'
61+
SELECT extract_query_string(url) FROM large_benchmark_urls;
62+
63+
.print 'extract_domain:'
64+
SELECT extract_domain(url) FROM large_benchmark_urls;
65+
66+
.print 'extract_subdomain:'
67+
SELECT extract_subdomain(url) FROM large_benchmark_urls;
68+
69+
.print 'extract_tld:'
70+
SELECT extract_tld(url) FROM large_benchmark_urls;
71+
72+
.print 'extract_extension:'
73+
SELECT extract_extension(url) FROM large_benchmark_urls;
74+
EOF
75+
}
76+
77+
# Create temporary directory
78+
mkdir -p build/tmp
79+
80+
# Create benchmark SQL files using the DRY function
81+
create_benchmark_sql "FORCE INSTALL netquack FROM community; LOAD netquack;" "PUBLISHED VERSION" "build/tmp/published_benchmark.sql"
82+
create_benchmark_sql "LOAD './build/release/extension/netquack/netquack.duckdb_extension';" "local VERSION" "build/tmp/local_benchmark.sql"
83+
84+
echo "Step 1: Installing and benchmarking PUBLISHED NetQuack extension..."
85+
duckdb < build/tmp/published_benchmark.sql > build/benchmark_results/published_full_output.txt 2>&1
86+
87+
echo "Step 2: Running benchmarks on LOCAL implementation..."
88+
./build/release/duckdb < build/tmp/local_benchmark.sql > build/benchmark_results/local_full_output.txt 2>&1
89+
90+
echo "Step 3: Generating comparison analysis..."
91+
92+
# Extract times and calculate improvements
93+
echo "📊 RESULTS SUMMARY:" > build/benchmark_results/analysis.txt
94+
echo "" >> build/benchmark_results/analysis.txt
95+
96+
echo "| Function | Published Time | Local Time | Improvement |" >> build/benchmark_results/analysis.txt
97+
echo "|----------|----------------|----------------|-------------|" >> build/benchmark_results/analysis.txt
98+
99+
# Extract timing data using a more robust approach
100+
functions=("extract_schema" "extract_host" "extract_port" "extract_path" "extract_query_string" "extract_domain" "extract_subdomain" "extract_tld" "extract_extension")
101+
102+
# Extract all times into temporary files
103+
grep "Run Time" build/benchmark_results/published_full_output.txt | grep -o "real [0-9.]*" | cut -d' ' -f2 > build/tmp/published_times.txt
104+
grep "Run Time" build/benchmark_results/local_full_output.txt | grep -o "real [0-9.]*" | cut -d' ' -f2 > build/tmp/local_times.txt
105+
106+
# Process each function
107+
for i in {0..9}; do
108+
func=${functions[$i]}
109+
line_num=$((i + 1))
110+
111+
published_time=$(sed -n "${line_num}p" build/tmp/published_times.txt)
112+
local_time=$(sed -n "${line_num}p" build/tmp/local_times.txt)
113+
114+
if [ ! -z "$published_time" ] && [ ! -z "$local_time" ]; then
115+
improvement=$(echo "scale=1; $published_time / $local_time" | bc -l)
116+
echo "| ${func} | ${published_time}s | ${local_time}s | ${improvement}x faster |" >> build/benchmark_results/analysis.txt
117+
fi
118+
done
119+
120+
# Clean up timing files
121+
rm -f build/tmp/published_times.txt build/tmp/local_times.txt
122+
123+
echo ""
124+
echo "✅ Benchmark comparison complete!"
125+
echo ""
126+
cat build/benchmark_results/analysis.txt

scripts/generate_tld_lookup.sh

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
#!/usr/bin/env bash
2+
3+
# Download the public suffix list if not present
4+
if [ ! -f public_suffix_list.dat ]; then
5+
echo "Downloading public suffix list..."
6+
wget -nv -O public_suffix_list.dat https://publicsuffix.org/list/public_suffix_list.dat
7+
fi
8+
9+
# Generate C++ header with TLD lookup
10+
echo "Generating TLD lookup header..."
11+
12+
# Generate gperf file like ClickHouse does
13+
echo '%language=C++
14+
%define lookup-function-name isValidTLD
15+
%define class-name TLDLookupHash
16+
%readonly-tables
17+
%includes
18+
%compare-strncmp
19+
%{
20+
#pragma GCC diagnostic push
21+
#pragma GCC diagnostic ignored "-Wimplicit-fallthrough"
22+
#pragma GCC diagnostic ignored "-Wzero-as-null-pointer-constant"
23+
#pragma GCC diagnostic ignored "-Wunused-macros"
24+
// NOLINTBEGIN(modernize-macro-to-enum)
25+
%}
26+
# List generated using https://publicsuffix.org/list/public_suffix_list.dat
27+
%%' > src/utils/tld_lookup.gperf
28+
29+
# Extract ALL TLDs from the public suffix list (both single and multi-part)
30+
# Remove comments, empty lines, and wildcards
31+
grep -v "^//" public_suffix_list.dat | \
32+
grep -v "^$" | \
33+
grep -v "^\!" | \
34+
grep -v "^\*" | \
35+
sed 's/^[[:space:]]*//' | \
36+
sort -u >> src/utils/tld_lookup.gperf
37+
38+
echo "%%" >> src/utils/tld_lookup.gperf
39+
echo '// NOLINTEND(modernize-macro-to-enum)
40+
#pragma GCC diagnostic pop' >> src/utils/tld_lookup.gperf
41+
42+
# Generate perfect hash function using gperf (required)
43+
if ! command -v gperf >/dev/null 2>&1; then
44+
echo "Error: gperf is required for optimal TLD lookup generation"
45+
echo "Please install gperf: brew install gperf (macOS) or apt-get install gperf (Linux)"
46+
exit 1
47+
fi
48+
49+
echo "Generating perfect hash function with gperf..."
50+
gperf src/utils/tld_lookup.gperf > src/utils/tld_lookup_generated.hpp
51+
52+
# Remove deprecated 'register' storage class specifier for C++17 compatibility
53+
echo "Removing deprecated 'register' keywords for C++17 compatibility..."
54+
sed -i.bak 's/register const char \*/const char */g; s/register unsigned int/unsigned int/g' src/utils/tld_lookup_generated.hpp
55+
rm -f src/utils/tld_lookup_generated.hpp.bak
56+
57+
# Create header file
58+
cat > src/utils/tld_lookup.hpp << 'EOF'
59+
// Auto-generated from Mozilla Public Suffix List using gperf
60+
61+
#pragma once
62+
63+
#include <string>
64+
#include <cstring>
65+
66+
namespace duckdb
67+
{
68+
namespace netquack
69+
{
70+
// Check if a suffix is a valid public suffix (TLD) using perfect hash
71+
bool isValidTLD(const char* str, size_t len);
72+
bool isValidTLD(const std::string& suffix);
73+
74+
// Get the effective TLD for a hostname
75+
std::string getEffectiveTLD(const std::string& hostname);
76+
}
77+
}
78+
EOF
79+
80+
# Create implementation file
81+
cat > src/utils/tld_lookup.cpp << 'EOF'
82+
// Auto-generated from Mozilla Public Suffix List using gperf
83+
84+
#include "tld_lookup.hpp"
85+
#include "tld_lookup_generated.hpp"
86+
87+
namespace duckdb
88+
{
89+
namespace netquack
90+
{
91+
bool isValidTLD(const char* str, size_t len)
92+
{
93+
return TLDLookupHash::isValidTLD(str, len) != nullptr;
94+
}
95+
96+
bool isValidTLD(const std::string& suffix)
97+
{
98+
return isValidTLD(suffix.c_str(), suffix.length());
99+
}
100+
101+
std::string getEffectiveTLD(const std::string& hostname)
102+
{
103+
if (hostname.empty())
104+
return "";
105+
106+
// Implement proper public suffix algorithm:
107+
// Find the longest matching public suffix
108+
109+
// First check if the entire hostname is a TLD
110+
if (isValidTLD(hostname))
111+
{
112+
return hostname;
113+
}
114+
115+
// Try all possible suffixes and find the longest match
116+
std::string longest_tld;
117+
118+
for (size_t pos = 0; pos < hostname.length(); ++pos)
119+
{
120+
if (hostname[pos] == '.')
121+
{
122+
std::string candidate = hostname.substr(pos + 1);
123+
if (isValidTLD(candidate))
124+
{
125+
// Keep the longest match
126+
if (candidate.length() > longest_tld.length())
127+
{
128+
longest_tld = candidate;
129+
}
130+
}
131+
}
132+
}
133+
134+
// If we found a valid TLD, return it
135+
if (!longest_tld.empty())
136+
{
137+
return longest_tld;
138+
}
139+
140+
// If no valid TLD found, return the last part after the last dot
141+
size_t last_dot = hostname.find_last_of('.');
142+
if (last_dot != std::string::npos)
143+
{
144+
return hostname.substr(last_dot + 1);
145+
}
146+
147+
// No dots, return entire hostname
148+
return hostname;
149+
}
150+
}
151+
}
152+
EOF
153+
154+
echo "TLD lookup files generated successfully!"

0 commit comments

Comments
 (0)