hatamiarash7
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.vscode/settings.json‎
Lines changed: 13 additions & 0 deletions b/‎.vscode/settings.json‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 10 additions & 22 deletions b/‎README.md‎
Lines changed: 10 additions & 22 deletions
diff --git a/‎docs/functions/extract-domain.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/functions/extract-domain.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎scripts/benchmark.sh‎
Lines changed: 126 additions & 0 deletions b/‎scripts/benchmark.sh‎
Lines changed: 126 additions & 0 deletions
diff --git a/‎scripts/generate_tld_lookup.sh‎
Lines changed: 154 additions & 0 deletions b/‎scripts/generate_tld_lookup.sh‎
Lines changed: 154 additions & 0 deletions
@@ -8,4 +8,5 @@ test/python/__pycache__/
 .Rhistory
 *.log
 *.csv
+public_suffix_list.dat
 !test/data/*.csv
@@ -1,4 +1,17 @@
 {
+  "cmake.generator": "Ninja",
+  "cmake.sourceDirectory": "${workspaceFolder}",
+  "cmake.buildDirectory": "${workspaceFolder}/build/release",
+  "cmake.configureOnOpen": false,
+  // Use compile_commands.json generated by Ninja for accurate include paths and flags
+  "C_Cpp.default.compileCommands": "${workspaceFolder}/build/release/compile_commands.json",
+
+  "clangd.arguments": [
+    "--compile-commands-dir=${workspaceFolder}/build/release",
+    "--background-index",
+    "--enable-config",
+    "--suggest-missing-includes"
+  ],
   "cSpell.words": [
     "duckdb",
     "Hostroute",
 
@@ -8,6 +8,8 @@ This extension is designed to simplify working with domains, URIs, and web paths
 
 With Netquack, you can unlock deeper insights from your web-related datasets without the need for external tools or complex workflows.
 
+NetQuack uses ClickHouse-inspired character-by-character parsing and gperf-generated perfect hash functions for optimal performance.
+
 Table of Contents
 
 - [DuckDB Netquack Extension](#duckdb-netquack-extension)
@@ -55,9 +57,7 @@ Once installed, the [macro functions](https://duckdb.org/community_extensions/ex
 
 ### Extracting The Main Domain
 
-This function extracts the main domain from a URL. For this purpose, the extension will get all public suffixes from the [publicsuffix.org](https://publicsuffix.org/) list and extract the main domain from the URL.
-
-The download process of the public suffix list is done automatically when the function is called for the first time. After that, the list is stored in the `public_suffix_list` table to avoid downloading it again.
+This function extracts the main domain from a URL using an optimized static TLD lookup system. The extension uses Mozilla's Public Suffix List compiled into a gperf-generated perfect hash function for O(1) TLD lookups with zero collisions.
 
 ```sql
 D SELECT extract_domain('a.example.com') AS domain;
@@ -77,23 +77,7 @@ D SELECT extract_domain('https://b.a.example.com/path') AS domain;
 └─────────────┘
 ```
 
-You can use the `update_suffixes` function to update the public suffix list manually.
-
-```sql
-D SELECT update_suffixes();
-┌───────────────────┐
-│ update_suffixes() │
-│      varchar      │
-├───────────────────┤
-│ updated           │
-└───────────────────┘
-```
-
-> [!WARNING]
-> This a public service with a limited number of requests. If you call the function too many times, you may get a 403 error.  
-> `<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message></Error>`  
-> The list usually changes a few times per week; more frequent downloading will cause rate limiting.
-> In this case, you can download the list manually from [publicsuffix.org](https://publicsuffix.org/) and save it in the `public_suffix_list` table.
+The TLD lookup is built into the extension at compile time using the latest Mozilla Public Suffix List. No runtime downloads or database operations are required.
 
 ### Extracting The Path
 
@@ -234,7 +218,7 @@ D SELECT extract_extension('http://example.com/image.jpg') AS ext;
 
 ### Extracting The TLD (Top-Level Domain)
 
-This function extracts the top-level domain from a URL. This function will use the public suffix list to extract the TLD. Check the [Extracting The Main Domain](#extracting-the-main-domain) section for more information about the public suffix list.
+This function extracts the top-level domain from a URL using the optimized gperf-based public suffix lookup system. The function correctly handles multi-part TLDs (like `com.au`) using the longest-match algorithm from Mozilla's Public Suffix List.
 
 ```sql
 D SELECT extract_tld('https://example.com.ac/path/path') AS tld;
@@ -256,7 +240,7 @@ D SELECT extract_tld('a.example.com') AS tld;
 
 ### Extracting The Sub Domain
 
-This function extracts the sub-domain from a URL. This function will use the public suffix list to extract the TLD. Check the [Extracting The Main Domain](#extracting-the-main-domain) section for more information about the public suffix list.
+This function extracts the sub-domain from a URL using the optimized public suffix lookup system to correctly identify the domain boundary and extract everything before it.
 
 ```sql
 D SELECT extract_subdomain('http://a.b.example.com/path') AS dns_record;
@@ -398,6 +382,10 @@ D SELECT * FROM netquack_version();
 └─────────┘
 ```
 
+### 🛠 **Build Requirements**
+
+- **gperf required**: Perfect hash generation requires `gperf` (install via `brew install gperf` or `apt-get install gperf`)
+
 ## Debugging
 
 The debugging process for DuckDB extensions is not an easy job. For Netquack, we have created a log file in the current directory. The log file is named `netquack.log` and contains all the logs for the extension. You can use this file to debug your code.
 
@@ -14,9 +14,9 @@ layout:
 
 # Extract Domain
 
-This function extracts the main domain from a URL. For this purpose, the extension will get all public suffixes from the [publicsuffix.org](https://publicsuffix.org/) list and extract the main domain from the URL.
+This function extracts the main domain from a URL using an optimized static TLD lookup system. The extension uses Mozilla's Public Suffix List compiled into a gperf-generated perfect hash function for O(1) TLD lookups with zero collisions.
 
-The download process of the public suffix list is done automatically when the function is called for the first time. After that, the list is stored in the `public_suffix_list` table to avoid downloading it again.
+The TLD lookup is built into the extension at compile time using the latest Mozilla Public Suffix List. No runtime downloads or database operations are required.
 
 ```sql
 D SELECT extract_domain('a.example.com') AS domain;
 
@@ -0,0 +1,126 @@
+#!/usr/bin/env bash
+
+echo "=== NetQuack Performance Comparison ==="
+echo ""
+
+# Change to the project root directory (parent of script location)
+CWD="$(cd "$(dirname "$0")/.." && pwd)"
+cd "$CWD"
+
+# Create results directory in build (gitignored)
+mkdir -p build/benchmark_results
+
+# Function to create benchmark SQL with specified load command
+create_benchmark_sql() {
+    local load_command="$1"
+    local version_label="$2"
+    local output_file="$3"
+    
+    cat > "$output_file" << EOF
+${load_command}
+
+-- Create test data
+CREATE TABLE benchmark_urls AS SELECT * FROM (VALUES
+    ('https://www.example.com/path?query=value#fragment'),
+    ('http://subdomain.example.co.uk:8080/long/path/to/file.html?param1=value1&param2=value2'),
+    ('ftp://user:[email protected]:21/directory/file.zip'),
+    ('https://blog.example.org/2023/12/article-title.html'),
+    ('http://api.service.example.net:3000/v1/users/123?format=json'),
+    ('https://cdn.assets.example.com/images/logo.png'),
+    ('mailto:[email protected]'),
+    ('http://localhost:8080/debug'),
+    ('https://secure.payment.example.gov:8443/transaction?id=abc123'),
+    ('https://example.com')
+) AS t(url);
+
+-- Expand dataset
+CREATE TABLE large_benchmark_urls AS 
+WITH RECURSIVE series(i) AS (
+    SELECT 1
+    UNION ALL
+    SELECT i + 1 FROM series WHERE i <= 3000
+)
+SELECT url FROM benchmark_urls CROSS JOIN series;
+
+.timer on
+.print '=== ${version_label} BENCHMARKS ==='
+
+.print 'extract_schema:'
+SELECT extract_schema(url) FROM large_benchmark_urls;
+
+.print 'extract_host:'
+SELECT extract_host(url) FROM large_benchmark_urls;
+
+.print 'extract_port:'
+SELECT extract_port(url) FROM large_benchmark_urls;
+
+.print 'extract_path:'
+SELECT extract_path(url) FROM large_benchmark_urls;
+
+.print 'extract_query_string:'
+SELECT extract_query_string(url) FROM large_benchmark_urls;
+
+.print 'extract_domain:'
+SELECT extract_domain(url) FROM large_benchmark_urls;
+
+.print 'extract_subdomain:'
+SELECT extract_subdomain(url) FROM large_benchmark_urls;
+
+.print 'extract_tld:'
+SELECT extract_tld(url) FROM large_benchmark_urls;
+
+.print 'extract_extension:'
+SELECT extract_extension(url) FROM large_benchmark_urls;
+EOF
+}
+
+# Create temporary directory
+mkdir -p build/tmp
+
+# Create benchmark SQL files using the DRY function
+create_benchmark_sql "FORCE INSTALL netquack FROM community; LOAD netquack;" "PUBLISHED VERSION" "build/tmp/published_benchmark.sql"
+create_benchmark_sql "LOAD './build/release/extension/netquack/netquack.duckdb_extension';" "local VERSION" "build/tmp/local_benchmark.sql"
+
+echo "Step 1: Installing and benchmarking PUBLISHED NetQuack extension..."
+duckdb < build/tmp/published_benchmark.sql > build/benchmark_results/published_full_output.txt 2>&1
+
+echo "Step 2: Running benchmarks on LOCAL implementation..."
+./build/release/duckdb < build/tmp/local_benchmark.sql > build/benchmark_results/local_full_output.txt 2>&1
+
+echo "Step 3: Generating comparison analysis..."
+
+# Extract times and calculate improvements
+echo "📊 RESULTS SUMMARY:" > build/benchmark_results/analysis.txt
+echo "" >> build/benchmark_results/analysis.txt
+
+echo "| Function | Published Time |   Local Time   | Improvement |" >> build/benchmark_results/analysis.txt
+echo "|----------|----------------|----------------|-------------|" >> build/benchmark_results/analysis.txt
+
+# Extract timing data using a more robust approach
+functions=("extract_schema" "extract_host" "extract_port" "extract_path" "extract_query_string" "extract_domain" "extract_subdomain" "extract_tld" "extract_extension")
+
+# Extract all times into temporary files
+grep "Run Time" build/benchmark_results/published_full_output.txt | grep -o "real [0-9.]*" | cut -d' ' -f2 > build/tmp/published_times.txt
+grep "Run Time" build/benchmark_results/local_full_output.txt | grep -o "real [0-9.]*" | cut -d' ' -f2 > build/tmp/local_times.txt
+
+# Process each function
+for i in {0..9}; do
+    func=${functions[$i]}
+    line_num=$((i + 1))
+    
+    published_time=$(sed -n "${line_num}p" build/tmp/published_times.txt)
+    local_time=$(sed -n "${line_num}p" build/tmp/local_times.txt)
+    
+    if [ ! -z "$published_time" ] && [ ! -z "$local_time" ]; then
+        improvement=$(echo "scale=1; $published_time / $local_time" | bc -l)
+        echo "| ${func} | ${published_time}s | ${local_time}s | ${improvement}x faster |" >> build/benchmark_results/analysis.txt
+    fi
+done
+
+# Clean up timing files
+rm -f build/tmp/published_times.txt build/tmp/local_times.txt
+
+echo ""
+echo "✅ Benchmark comparison complete!"
+echo ""
+cat build/benchmark_results/analysis.txt
@@ -0,0 +1,154 @@
+#!/usr/bin/env bash
+
+# Download the public suffix list if not present
+if [ ! -f public_suffix_list.dat ]; then
+    echo "Downloading public suffix list..."
+    wget -nv -O public_suffix_list.dat https://publicsuffix.org/list/public_suffix_list.dat
+fi
+
+# Generate C++ header with TLD lookup
+echo "Generating TLD lookup header..."
+
+# Generate gperf file like ClickHouse does
+echo '%language=C++
+%define lookup-function-name isValidTLD
+%define class-name TLDLookupHash
+%readonly-tables
+%includes
+%compare-strncmp
+%{
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wimplicit-fallthrough"
+#pragma GCC diagnostic ignored "-Wzero-as-null-pointer-constant"
+#pragma GCC diagnostic ignored "-Wunused-macros"
+// NOLINTBEGIN(modernize-macro-to-enum)
+%}
+# List generated using https://publicsuffix.org/list/public_suffix_list.dat
+%%' > src/utils/tld_lookup.gperf
+
+# Extract ALL TLDs from the public suffix list (both single and multi-part)
+# Remove comments, empty lines, and wildcards
+grep -v "^//" public_suffix_list.dat | \
+grep -v "^$" | \
+grep -v "^\!" | \
+grep -v "^\*" | \
+sed 's/^[[:space:]]*//' | \
+sort -u >> src/utils/tld_lookup.gperf
+
+echo "%%" >> src/utils/tld_lookup.gperf
+echo '// NOLINTEND(modernize-macro-to-enum)
+#pragma GCC diagnostic pop' >> src/utils/tld_lookup.gperf
+
+# Generate perfect hash function using gperf (required)
+if ! command -v gperf >/dev/null 2>&1; then
+    echo "Error: gperf is required for optimal TLD lookup generation"
+    echo "Please install gperf: brew install gperf (macOS) or apt-get install gperf (Linux)"
+    exit 1
+fi
+
+echo "Generating perfect hash function with gperf..."
+gperf src/utils/tld_lookup.gperf > src/utils/tld_lookup_generated.hpp
+
+# Remove deprecated 'register' storage class specifier for C++17 compatibility
+echo "Removing deprecated 'register' keywords for C++17 compatibility..."
+sed -i.bak 's/register const char \*/const char */g; s/register unsigned int/unsigned int/g' src/utils/tld_lookup_generated.hpp
+rm -f src/utils/tld_lookup_generated.hpp.bak
+
+# Create header file
+cat > src/utils/tld_lookup.hpp << 'EOF'
+// Auto-generated from Mozilla Public Suffix List using gperf
+
+#pragma once
+
+#include <string>
+#include <cstring>
+
+namespace duckdb
+{
+    namespace netquack
+    {
+        // Check if a suffix is a valid public suffix (TLD) using perfect hash
+        bool isValidTLD(const char* str, size_t len);
+        bool isValidTLD(const std::string& suffix);
+        
+        // Get the effective TLD for a hostname
+        std::string getEffectiveTLD(const std::string& hostname);
+    }
+}
+EOF
+
+# Create implementation file
+cat > src/utils/tld_lookup.cpp << 'EOF'
+// Auto-generated from Mozilla Public Suffix List using gperf
+
+#include "tld_lookup.hpp"
+#include "tld_lookup_generated.hpp"
+
+namespace duckdb
+{
+    namespace netquack
+    {
+        bool isValidTLD(const char* str, size_t len)
+        {
+            return TLDLookupHash::isValidTLD(str, len) != nullptr;
+        }
+
+        bool isValidTLD(const std::string& suffix)
+        {
+            return isValidTLD(suffix.c_str(), suffix.length());
+        }
+
+        std::string getEffectiveTLD(const std::string& hostname)
+        {
+            if (hostname.empty())
+                return "";
+
+            // Implement proper public suffix algorithm:
+            // Find the longest matching public suffix
+            
+            // First check if the entire hostname is a TLD
+            if (isValidTLD(hostname))
+            {
+                return hostname;
+            }
+            
+            // Try all possible suffixes and find the longest match
+            std::string longest_tld;
+            
+            for (size_t pos = 0; pos < hostname.length(); ++pos)
+            {
+                if (hostname[pos] == '.')
+                {
+                    std::string candidate = hostname.substr(pos + 1);
+                    if (isValidTLD(candidate))
+                    {
+                        // Keep the longest match
+                        if (candidate.length() > longest_tld.length())
+                        {
+                            longest_tld = candidate;
+                        }
+                    }
+                }
+            }
+            
+            // If we found a valid TLD, return it
+            if (!longest_tld.empty())
+            {
+                return longest_tld;
+            }
+            
+            // If no valid TLD found, return the last part after the last dot
+            size_t last_dot = hostname.find_last_of('.');
+            if (last_dot != std::string::npos)
+            {
+                return hostname.substr(last_dot + 1);
+            }
+            
+            // No dots, return entire hostname
+            return hostname;
+        }
+    }
+}
+EOF
+
+echo "TLD lookup files generated successfully!"