Skip to content

Commit eb17894

Browse files
committed
Optimize URL parsing with ClickHouse-inspired architecture and gperf TLD lookup
Performance optimization overhaul replacing regex-based parsing with character-by-character processing and database-dependent TLD lookups with compile-time perfect hash functions. - Replace std::regex with optimized character parsing for all URL functions - Implement gperf-generated perfect hash for ~9,700 TLD lookups (O(1) with zero collisions) - Use std::string_view for zero-copy string operations - Eliminate database dependencies for TLD resolution - **ClickHouse-inspired parsing**: Character-by-character URL component extraction - **Static TLD lookup**: Mozilla Public Suffix List compiled into extension binary - **Perfect hash generation**: gperf-based collision-free hash function (like ClickHouse) - **Compile-time optimization**: All TLD data embedded at build time - Remove update_suffixes() function and database PSL loading - Remove CURL-based PSL downloads and table creation - Clean up unused includes and dependencies - Eliminate ~100 lines of legacy database interaction code - **No runtime PSL downloads**: Faster cold starts, no network dependencies - **No database queries**: TLD lookups resolved via in-memory perfect hash - **Predictable performance**: Consistent O(n) parsing regardless of URL complexity - **Reduced memory footprint**: Perfect hash more efficient than hash tables - gperf now required for optimal TLD lookup generation - Automatic fallback removed to ensure consistent performance - Update README with performance optimization details - Document ClickHouse-inspired architecture approach - Remove outdated database-dependent function documentation 📊 All tests passing with significant performance improvements expected: | Function | Before | After | Improvement | |----------|----------------|----------------|-------------| | `extract_schema` | 0.029s | 0.003s | 9.6x faster | | `extract_host` | 0.029s | 0.004s | 7.2x faster | | `extract_port` | 0.180s | 0.003s | 60.0x faster | | `extract_path` | 0.044s | 0.003s | 14.6x faster | | `extract_query_string` | 0.069s | 0.002s | 34.5x faster | | `extract_domain` | 14.646s | 0.006s | 2441.0x faster | | `extract_subdomain` | 13.158s | 0.003s | 4386.0x faster | | `extract_tld` | 12.987s | 0.003s | 4329.0x faster | | `extract_extension` | 0.108s | 0.002s | 54.0x faster |
1 parent 98f9885 commit eb17894

30 files changed

+65887
-515
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,5 @@ test/python/__pycache__/
88
.Rhistory
99
*.log
1010
*.csv
11+
public_suffix_list.dat
1112
!test/data/*.csv

.vscode/settings.json

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,17 @@
11
{
2+
"cmake.generator": "Ninja",
3+
"cmake.sourceDirectory": "${workspaceFolder}",
4+
"cmake.buildDirectory": "${workspaceFolder}/build/release",
5+
"cmake.configureOnOpen": false,
6+
// Use compile_commands.json generated by Ninja for accurate include paths and flags
7+
"C_Cpp.default.compileCommands": "${workspaceFolder}/build/release/compile_commands.json",
8+
9+
"clangd.arguments": [
10+
"--compile-commands-dir=${workspaceFolder}/build/release",
11+
"--background-index",
12+
"--enable-config",
13+
"--suggest-missing-includes"
14+
],
215
"cSpell.words": [
316
"duckdb",
417
"Hostroute",

CMakeLists.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ set(LOADABLE_EXTENSION_NAME ${TARGET_NAME}_loadable_extension)
77

88
project(${TARGET_NAME})
99

10+
# Force C++17 standard for string_view support
11+
set(CMAKE_CXX_STANDARD 17)
12+
set(CMAKE_CXX_STANDARD_REQUIRED ON)
13+
1014
set(EXTENSION_SOURCES ${EXTENSION_SOURCES})
1115
include_directories(src/include)
1216
file(GLOB_RECURSE EXTENSION_SOURCES src/*.cpp)

README.md

Lines changed: 10 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ This extension is designed to simplify working with domains, URIs, and web paths
88

99
With Netquack, you can unlock deeper insights from your web-related datasets without the need for external tools or complex workflows.
1010

11+
NetQuack uses ClickHouse-inspired character-by-character parsing and gperf-generated perfect hash functions for optimal performance.
12+
1113
Table of Contents
1214

1315
- [DuckDB Netquack Extension](#duckdb-netquack-extension)
@@ -55,9 +57,7 @@ Once installed, the [macro functions](https://duckdb.org/community_extensions/ex
5557

5658
### Extracting The Main Domain
5759

58-
This function extracts the main domain from a URL. For this purpose, the extension will get all public suffixes from the [publicsuffix.org](https://publicsuffix.org/) list and extract the main domain from the URL.
59-
60-
The download process of the public suffix list is done automatically when the function is called for the first time. After that, the list is stored in the `public_suffix_list` table to avoid downloading it again.
60+
This function extracts the main domain from a URL using an optimized static TLD lookup system. The extension uses Mozilla's Public Suffix List compiled into a gperf-generated perfect hash function for O(1) TLD lookups with zero collisions.
6161

6262
```sql
6363
D SELECT extract_domain('a.example.com') AS domain;
@@ -77,23 +77,7 @@ D SELECT extract_domain('https://b.a.example.com/path') AS domain;
7777
└─────────────┘
7878
```
7979

80-
You can use the `update_suffixes` function to update the public suffix list manually.
81-
82-
```sql
83-
D SELECT update_suffixes();
84-
┌───────────────────┐
85-
│ update_suffixes() │
86-
varchar
87-
├───────────────────┤
88-
│ updated │
89-
└───────────────────┘
90-
```
91-
92-
> [!WARNING]
93-
> This a public service with a limited number of requests. If you call the function too many times, you may get a 403 error.
94-
> `<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message></Error>`
95-
> The list usually changes a few times per week; more frequent downloading will cause rate limiting.
96-
> In this case, you can download the list manually from [publicsuffix.org](https://publicsuffix.org/) and save it in the `public_suffix_list` table.
80+
The TLD lookup is built into the extension at compile time using the latest Mozilla Public Suffix List. No runtime downloads or database operations are required.
9781

9882
### Extracting The Path
9983

@@ -234,7 +218,7 @@ D SELECT extract_extension('http://example.com/image.jpg') AS ext;
234218

235219
### Extracting The TLD (Top-Level Domain)
236220

237-
This function extracts the top-level domain from a URL. This function will use the public suffix list to extract the TLD. Check the [Extracting The Main Domain](#extracting-the-main-domain) section for more information about the public suffix list.
221+
This function extracts the top-level domain from a URL using the optimized gperf-based public suffix lookup system. The function correctly handles multi-part TLDs (like `com.au`) using the longest-match algorithm from Mozilla's Public Suffix List.
238222

239223
```sql
240224
D SELECT extract_tld('https://example.com.ac/path/path') AS tld;
@@ -256,7 +240,7 @@ D SELECT extract_tld('a.example.com') AS tld;
256240

257241
### Extracting The Sub Domain
258242

259-
This function extracts the sub-domain from a URL. This function will use the public suffix list to extract the TLD. Check the [Extracting The Main Domain](#extracting-the-main-domain) section for more information about the public suffix list.
243+
This function extracts the sub-domain from a URL using the optimized public suffix lookup system to correctly identify the domain boundary and extract everything before it.
260244

261245
```sql
262246
D SELECT extract_subdomain('http://a.b.example.com/path') AS dns_record;
@@ -398,6 +382,10 @@ D SELECT * FROM netquack_version();
398382
└─────────┘
399383
```
400384

385+
### 🛠 **Build Requirements**
386+
387+
- **gperf required**: Perfect hash generation requires `gperf` (install via `brew install gperf` or `apt-get install gperf`)
388+
401389
## Debugging
402390

403391
The debugging process for DuckDB extensions is not an easy job. For Netquack, we have created a log file in the current directory. The log file is named `netquack.log` and contains all the logs for the extension. You can use this file to debug your code.

docs/functions/extract-domain.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@ layout:
1414

1515
# Extract Domain
1616

17-
This function extracts the main domain from a URL. For this purpose, the extension will get all public suffixes from the [publicsuffix.org](https://publicsuffix.org/) list and extract the main domain from the URL.
17+
This function extracts the main domain from a URL using an optimized static TLD lookup system. The extension uses Mozilla's Public Suffix List compiled into a gperf-generated perfect hash function for O(1) TLD lookups with zero collisions.
1818

19-
The download process of the public suffix list is done automatically when the function is called for the first time. After that, the list is stored in the `public_suffix_list` table to avoid downloading it again.
19+
The TLD lookup is built into the extension at compile time using the latest Mozilla Public Suffix List. No runtime downloads or database operations are required.
2020

2121
```sql
2222
D SELECT extract_domain('a.example.com') AS domain;

scripts/benchmark.sh

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
#!/usr/bin/env bash
2+
3+
echo "=== NetQuack Performance Comparison ==="
4+
echo ""
5+
6+
# Change to the project root directory (parent of script location)
7+
CWD="$(cd "$(dirname "$0")/.." && pwd)"
8+
cd "$CWD"
9+
10+
# Create results directory in build (gitignored)
11+
mkdir -p build/benchmark_results
12+
13+
# Function to create benchmark SQL with specified load command
14+
create_benchmark_sql() {
15+
local load_command="$1"
16+
local version_label="$2"
17+
local output_file="$3"
18+
19+
cat > "$output_file" << EOF
20+
${load_command}
21+
22+
-- Create test data
23+
CREATE TABLE benchmark_urls AS SELECT * FROM (VALUES
24+
('https://www.example.com/path?query=value#fragment'),
25+
('http://subdomain.example.co.uk:8080/long/path/to/file.html?param1=value1&param2=value2'),
26+
('ftp://user:[email protected]:21/directory/file.zip'),
27+
('https://blog.example.org/2023/12/article-title.html'),
28+
('http://api.service.example.net:3000/v1/users/123?format=json'),
29+
('https://cdn.assets.example.com/images/logo.png'),
30+
('mailto:[email protected]'),
31+
('http://localhost:8080/debug'),
32+
('https://secure.payment.example.gov:8443/transaction?id=abc123'),
33+
('https://example.com')
34+
) AS t(url);
35+
36+
-- Expand dataset
37+
CREATE TABLE large_benchmark_urls AS
38+
WITH RECURSIVE series(i) AS (
39+
SELECT 1
40+
UNION ALL
41+
SELECT i + 1 FROM series WHERE i <= 3000
42+
)
43+
SELECT url FROM benchmark_urls CROSS JOIN series;
44+
45+
.timer on
46+
.print '=== ${version_label} BENCHMARKS ==='
47+
48+
.print 'extract_schema:'
49+
SELECT extract_schema(url) FROM large_benchmark_urls;
50+
51+
.print 'extract_host:'
52+
SELECT extract_host(url) FROM large_benchmark_urls;
53+
54+
.print 'extract_port:'
55+
SELECT extract_port(url) FROM large_benchmark_urls;
56+
57+
.print 'extract_path:'
58+
SELECT extract_path(url) FROM large_benchmark_urls;
59+
60+
.print 'extract_query_string:'
61+
SELECT extract_query_string(url) FROM large_benchmark_urls;
62+
63+
.print 'extract_domain:'
64+
SELECT extract_domain(url) FROM large_benchmark_urls;
65+
66+
.print 'extract_subdomain:'
67+
SELECT extract_subdomain(url) FROM large_benchmark_urls;
68+
69+
.print 'extract_tld:'
70+
SELECT extract_tld(url) FROM large_benchmark_urls;
71+
72+
.print 'extract_extension:'
73+
SELECT extract_extension(url) FROM large_benchmark_urls;
74+
EOF
75+
}
76+
77+
# Create temporary directory
78+
mkdir -p build/tmp
79+
80+
# Create benchmark SQL files using the DRY function
81+
create_benchmark_sql "FORCE INSTALL netquack FROM community; LOAD netquack;" "PUBLISHED VERSION" "build/tmp/published_benchmark.sql"
82+
create_benchmark_sql "LOAD './build/release/extension/netquack/netquack.duckdb_extension';" "local VERSION" "build/tmp/local_benchmark.sql"
83+
84+
echo "Step 1: Installing and benchmarking PUBLISHED NetQuack extension..."
85+
duckdb < build/tmp/published_benchmark.sql > build/benchmark_results/published_full_output.txt 2>&1
86+
87+
echo "Step 2: Running benchmarks on LOCAL implementation..."
88+
./build/release/duckdb < build/tmp/local_benchmark.sql > build/benchmark_results/local_full_output.txt 2>&1
89+
90+
echo "Step 3: Generating comparison analysis..."
91+
92+
# Extract times and calculate improvements
93+
echo "📊 RESULTS SUMMARY:" > build/benchmark_results/analysis.txt
94+
echo "" >> build/benchmark_results/analysis.txt
95+
96+
echo "| Function | Published Time | Local Time | Improvement |" >> build/benchmark_results/analysis.txt
97+
echo "|----------|----------------|----------------|-------------|" >> build/benchmark_results/analysis.txt
98+
99+
# Extract timing data using a more robust approach
100+
functions=("extract_schema" "extract_host" "extract_port" "extract_path" "extract_query_string" "extract_domain" "extract_subdomain" "extract_tld" "extract_extension")
101+
102+
# Extract all times into temporary files
103+
grep "Run Time" build/benchmark_results/published_full_output.txt | grep -o "real [0-9.]*" | cut -d' ' -f2 > build/tmp/published_times.txt
104+
grep "Run Time" build/benchmark_results/local_full_output.txt | grep -o "real [0-9.]*" | cut -d' ' -f2 > build/tmp/local_times.txt
105+
106+
# Process each function
107+
for i in {0..9}; do
108+
func=${functions[$i]}
109+
line_num=$((i + 1))
110+
111+
published_time=$(sed -n "${line_num}p" build/tmp/published_times.txt)
112+
local_time=$(sed -n "${line_num}p" build/tmp/local_times.txt)
113+
114+
if [ ! -z "$published_time" ] && [ ! -z "$local_time" ]; then
115+
improvement=$(echo "scale=1; $published_time / $local_time" | bc -l)
116+
echo "| ${func} | ${published_time}s | ${local_time}s | ${improvement}x faster |" >> build/benchmark_results/analysis.txt
117+
fi
118+
done
119+
120+
# Clean up timing files
121+
rm -f build/tmp/published_times.txt build/tmp/local_times.txt
122+
123+
echo ""
124+
echo "✅ Benchmark comparison complete!"
125+
echo ""
126+
cat build/benchmark_results/analysis.txt

0 commit comments

Comments
 (0)