pl-fuzzy-frame-match

High-performance fuzzy matching for Polars DataFrames that intelligently combines exact fuzzy matching with approximate joins for optimal performance on datasets of any size.

🚀 Key Innovation: Hybrid Matching Approach

This library automatically selects the best matching strategy based on your data:

Small datasets (< 100M comparisons): Uses exact fuzzy matching with full cross-join
Large datasets (≥ 100M comparisons): Automatically switches to approximate nearest neighbor joins using polars-simed
Intelligent optimization: Pre-filters candidates using approximate methods, then applies exact fuzzy scoring

This hybrid approach means you get:

✅ Best-in-class performance regardless of data size
✅ High accuracy with configurable similarity thresholds
✅ Memory efficiency through chunked processing
✅ No manual optimization needed - the library handles it automatically

Features

🚀 Dual-Mode Performance: Combines exact fuzzy matching with approximate joins
🎯 Multiple Algorithms: Support for Levenshtein, Jaro, Jaro-Winkler, Hamming, Damerau-Levenshtein, and Indel
🔧 Smart Optimization: Automatic query optimization based on data uniqueness and size
💾 Memory Efficient: Chunked processing and intelligent caching for massive datasets
🔄 Incremental Matching: Support for multi-column fuzzy matching with result filtering
⚡ Automatic Strategy Selection: No configuration needed - automatically picks the fastest approach

Installation

pip install pl-fuzzy-frame-match

Or using Poetry:

poetry add pl-fuzzy-frame-match

Performance Benchmarks

Performance comparison on commodity hardware (M3 Mac, 36GB RAM):

Dataset Size	Cartesian Product	Standard Cross Join Fuzzy match	Automatic Selection	Speedup
500 × 400	200K	0.04s	0.03s	1.3x
3K × 2K	6M	0.39s	0.39s	1x
10K × 8K	80M	18.67s	18.79s	1x
15K × 10K	150M	40.82s	1.45s	28x
40K × 30K	1.2B	363.50s	4.75s	76x
400K × 10K	4B	Skipped*	34.52s	∞

*Skipped due to prohibitive runtime

Key Observations:

Small to Medium datasets (< 100M): Automatic selection uses standard cross join for optimal speed and accuracy
Large datasets (≥ 100M): Automatic selection switches to approximate matching first and then matches the dataframes
Memory efficiency: Can handle billions of potential comparisons without running out of memory

Quick Start

import polars as pl
from pl_fuzzy_frame_match import fuzzy_match_dfs, FuzzyMapping

# Create sample dataframes
left_df = pl.DataFrame({
    "name": ["John Smith", "Jane Doe", "Bob Johnson"],
    "id": [1, 2, 3]
}).lazy()

right_df = pl.DataFrame({
    "customer": ["Jon Smith", "Jane Does", "Robert Johnson"],
    "customer_id": [101, 102, 103]
}).lazy()

# Define fuzzy matching configuration
fuzzy_maps = [
    FuzzyMapping(
        left_col="name",
        right_col="customer",
        threshold_score=80.0,  # 80% similarity threshold
        fuzzy_type="levenshtein"
    )
]

# Perform fuzzy matching
result = fuzzy_match_dfs(
    left_df=left_df,
    right_df=right_df,
    fuzzy_maps=fuzzy_maps,
    logger=your_logger  # Pass your logger instance
)

print(result)

Advanced Usage

Multiple Column Matching

# Match on multiple columns with different algorithms
fuzzy_maps = [
    FuzzyMapping(
        left_col="name",
        right_col="customer_name",
        threshold_score=85.0,
        fuzzy_type="jaro_winkler"
    ),
    FuzzyMapping(
        left_col="address",
        right_col="customer_address",
        threshold_score=75.0,
        fuzzy_type="levenshtein"
    )
]

result = fuzzy_match_dfs(left_df, right_df, fuzzy_maps, logger)

AND/OR Logic with FuzzyMapExpr

For complex matching scenarios, use FuzzyMapExpr to combine conditions with AND (&) and OR (|) operators - similar to Polars expressions:

from pl_fuzzy_frame_match import FuzzyMapExpr, fuzzy_match_dfs

# Define individual match conditions
name_match = FuzzyMapExpr(
    left_col="name",
    right_col="customer_name",
    threshold_score=85.0,
    fuzzy_type="jaro_winkler"
)

city_match = FuzzyMapExpr(
    left_col="city",
    right_col="customer_city",
    threshold_score=90.0
)

email_match = FuzzyMapExpr(
    left_col="email",
    right_col="customer_email",
    threshold_score=95.0
)

# Combine with AND/OR logic:
# Match if (name AND city match) OR (email matches perfectly)
expr = (name_match & city_match) | email_match

result = fuzzy_match_dfs(left_df, right_df, expr, logger)

Key features:

& (AND): Both conditions must match
| (OR): At least one condition must match
Operator precedence follows Python rules: & binds tighter than |
a | b & c evaluates as a | (b & c)

Use cases:

Address matching: (street & city & zip) | (street & city) - match on full address or partial
Identity resolution: (name & dob) | ssn | email - match on multiple identity signals
Flexible deduplication: Define fallback matching strategies in a single expression

Supported Algorithms

levenshtein: Edit distance between two strings
jaro: Jaro similarity
jaro_winkler: Jaro-Winkler similarity (good for name matching)
hamming: Hamming distance (requires equal length strings)
damerau_levenshtein: Like Levenshtein but includes transpositions
indel: Insertion/deletion distance

How It Works: The Best of Both Worlds

The library intelligently combines two approaches based on your data size:

For Regular Datasets (< 100M potential matches)

Preprocessing: Analyzes column uniqueness to optimize join strategy
Cross Join: Creates all possible combinations
Exact Scoring: Calculates precise similarity scores using your chosen algorithm
Filtering: Returns only matches above the threshold

For Large Datasets (≥ 100M potential matches)

Approximate Candidate Selection: Uses polars-simed to quickly find likely matches
Chunked Processing: Processes large datasets in memory-efficient chunks
Reduced Comparisons: Only scores the most promising pairs instead of all combinations
Final Scoring: Applies exact fuzzy matching to the reduced candidate set

The Magic: Automatic Strategy Selection

# The library automatically determines the best approach:
if cartesian_product_size >= 100_000_000 and has_polars_simed:
    # Use approximate join for initial candidate selection
    # This reduces a 1B comparison problem to ~1M comparisons
    use_approximate_matching()
else:
    # Use traditional cross join for smaller datasets
    use_exact_matching()

This means you can use the same API whether matching 1,000 or 100 million records!

Performance Tips

Large dataset matching: Install polars-simed to enable approximate matching:
```
pip install polars-simed
```
Optimal threshold: Start with higher thresholds (80-90%) for better performance
Column selection: Use columns with high uniqueness for better candidate reduction
Algorithm choice:
- jaro_winkler: Best for names and short strings
- levenshtein: Best for general text and typos
- damerau_levenshtein: Best when transpositions are common
Memory management: The library automatically chunks large datasets, but you can monitor memory usage with logging

Requirements

Python >= 3.9
Polars >= 1.8.2, < 2.0.0
polars-distance ~= 0.4.3
polars-simed >= 0.3.4 (optional, for large datasets)

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

Built on top of the excellent Polars DataFrame library and polars-distance for string similarity calculations.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.github/workflows		.github/workflows
docs		docs
performance_test		performance_test
pl_fuzzy_frame_match		pl_fuzzy_frame_match
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
pyproject.toml		pyproject.toml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pl-fuzzy-frame-match

🚀 Key Innovation: Hybrid Matching Approach

Features

Installation

Performance Benchmarks

Quick Start

Advanced Usage

Multiple Column Matching

AND/OR Logic with FuzzyMapExpr

Supported Algorithms

How It Works: The Best of Both Worlds

For Regular Datasets (< 100M potential matches)

For Large Datasets (≥ 100M potential matches)

The Magic: Automatic Strategy Selection

Performance Tips

Requirements

License

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pl-fuzzy-frame-match

🚀 Key Innovation: Hybrid Matching Approach

Features

Installation

Performance Benchmarks

Quick Start

Advanced Usage

Multiple Column Matching

AND/OR Logic with FuzzyMapExpr

Supported Algorithms

How It Works: The Best of Both Worlds

For Regular Datasets (< 100M potential matches)

For Large Datasets (≥ 100M potential matches)

The Magic: Automatic Strategy Selection

Performance Tips

Requirements

License

Contributing

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages