High-performance fuzzy matching for Polars DataFrames that intelligently combines exact fuzzy matching with approximate joins for optimal performance on datasets of any size.
This library automatically selects the best matching strategy based on your data:
- Small datasets (< 100M comparisons): Uses exact fuzzy matching with full cross-join
- Large datasets (β₯ 100M comparisons): Automatically switches to approximate nearest neighbor joins using
polars-simed - Intelligent optimization: Pre-filters candidates using approximate methods, then applies exact fuzzy scoring
This hybrid approach means you get:
- β Best-in-class performance regardless of data size
- β High accuracy with configurable similarity thresholds
- β Memory efficiency through chunked processing
- β No manual optimization needed - the library handles it automatically
- π Dual-Mode Performance: Combines exact fuzzy matching with approximate joins
- π― Multiple Algorithms: Support for Levenshtein, Jaro, Jaro-Winkler, Hamming, Damerau-Levenshtein, and Indel
- π§ Smart Optimization: Automatic query optimization based on data uniqueness and size
- πΎ Memory Efficient: Chunked processing and intelligent caching for massive datasets
- π Incremental Matching: Support for multi-column fuzzy matching with result filtering
- β‘ Automatic Strategy Selection: No configuration needed - automatically picks the fastest approach
pip install pl-fuzzy-frame-matchOr using Poetry:
poetry add pl-fuzzy-frame-matchPerformance comparison on commodity hardware (M3 Mac, 36GB RAM):
| Dataset Size | Cartesian Product | Standard Cross Join Fuzzy match | Automatic Selection | Speedup |
|---|---|---|---|---|
| 500 Γ 400 | 200K | 0.04s | 0.03s | 1.3x |
| 3K Γ 2K | 6M | 0.39s | 0.39s | 1x |
| 10K Γ 8K | 80M | 18.67s | 18.79s | 1x |
| 15K Γ 10K | 150M | 40.82s | 1.45s | 28x |
| 40K Γ 30K | 1.2B | 363.50s | 4.75s | 76x |
| 400K Γ 10K | 4B | Skipped* | 34.52s | β |
*Skipped due to prohibitive runtime
Key Observations:
- Small to Medium datasets (< 100M): Automatic selection uses standard cross join for optimal speed and accuracy
- Large datasets (β₯ 100M): Automatic selection switches to approximate matching first and then matches the dataframes
- Memory efficiency: Can handle billions of potential comparisons without running out of memory
import polars as pl
from pl_fuzzy_frame_match import fuzzy_match_dfs, FuzzyMapping
# Create sample dataframes
left_df = pl.DataFrame({
"name": ["John Smith", "Jane Doe", "Bob Johnson"],
"id": [1, 2, 3]
}).lazy()
right_df = pl.DataFrame({
"customer": ["Jon Smith", "Jane Does", "Robert Johnson"],
"customer_id": [101, 102, 103]
}).lazy()
# Define fuzzy matching configuration
fuzzy_maps = [
FuzzyMapping(
left_col="name",
right_col="customer",
threshold_score=80.0, # 80% similarity threshold
fuzzy_type="levenshtein"
)
]
# Perform fuzzy matching
result = fuzzy_match_dfs(
left_df=left_df,
right_df=right_df,
fuzzy_maps=fuzzy_maps,
logger=your_logger # Pass your logger instance
)
print(result)# Match on multiple columns with different algorithms
fuzzy_maps = [
FuzzyMapping(
left_col="name",
right_col="customer_name",
threshold_score=85.0,
fuzzy_type="jaro_winkler"
),
FuzzyMapping(
left_col="address",
right_col="customer_address",
threshold_score=75.0,
fuzzy_type="levenshtein"
)
]
result = fuzzy_match_dfs(left_df, right_df, fuzzy_maps, logger)For complex matching scenarios, use FuzzyMapExpr to combine conditions with AND (&) and OR (|) operators - similar to Polars expressions:
from pl_fuzzy_frame_match import FuzzyMapExpr, fuzzy_match_dfs
# Define individual match conditions
name_match = FuzzyMapExpr(
left_col="name",
right_col="customer_name",
threshold_score=85.0,
fuzzy_type="jaro_winkler"
)
city_match = FuzzyMapExpr(
left_col="city",
right_col="customer_city",
threshold_score=90.0
)
email_match = FuzzyMapExpr(
left_col="email",
right_col="customer_email",
threshold_score=95.0
)
# Combine with AND/OR logic:
# Match if (name AND city match) OR (email matches perfectly)
expr = (name_match & city_match) | email_match
result = fuzzy_match_dfs(left_df, right_df, expr, logger)Key features:
&(AND): Both conditions must match|(OR): At least one condition must match- Operator precedence follows Python rules:
&binds tighter than| a | b & cevaluates asa | (b & c)
Use cases:
- Address matching:
(street & city & zip) | (street & city)- match on full address or partial - Identity resolution:
(name & dob) | ssn | email- match on multiple identity signals - Flexible deduplication: Define fallback matching strategies in a single expression
- levenshtein: Edit distance between two strings
- jaro: Jaro similarity
- jaro_winkler: Jaro-Winkler similarity (good for name matching)
- hamming: Hamming distance (requires equal length strings)
- damerau_levenshtein: Like Levenshtein but includes transpositions
- indel: Insertion/deletion distance
The library intelligently combines two approaches based on your data size:
- Preprocessing: Analyzes column uniqueness to optimize join strategy
- Cross Join: Creates all possible combinations
- Exact Scoring: Calculates precise similarity scores using your chosen algorithm
- Filtering: Returns only matches above the threshold
- Approximate Candidate Selection: Uses
polars-simedto quickly find likely matches - Chunked Processing: Processes large datasets in memory-efficient chunks
- Reduced Comparisons: Only scores the most promising pairs instead of all combinations
- Final Scoring: Applies exact fuzzy matching to the reduced candidate set
# The library automatically determines the best approach:
if cartesian_product_size >= 100_000_000 and has_polars_simed:
# Use approximate join for initial candidate selection
# This reduces a 1B comparison problem to ~1M comparisons
use_approximate_matching()
else:
# Use traditional cross join for smaller datasets
use_exact_matching()This means you can use the same API whether matching 1,000 or 100 million records!
- Large dataset matching: Install
polars-simedto enable approximate matching:pip install polars-simed
- Optimal threshold: Start with higher thresholds (80-90%) for better performance
- Column selection: Use columns with high uniqueness for better candidate reduction
- Algorithm choice:
jaro_winkler: Best for names and short stringslevenshtein: Best for general text and typosdamerau_levenshtein: Best when transpositions are common
- Memory management: The library automatically chunks large datasets, but you can monitor memory usage with logging
- Python >= 3.9
- Polars >= 1.8.2, < 2.0.0
- polars-distance ~= 0.4.3
- polars-simed >= 0.3.4 (optional, for large datasets)
MIT License - see LICENSE file for details
Contributions are welcome! Please feel free to submit a Pull Request.
Built on top of the excellent Polars DataFrame library and polars-distance for string similarity calculations.