Skip to content

feat: add SimilarityFilterBlock for near-duplicate filtering#723

Closed
UgaTheDev wants to merge 1 commit intoinstructlab:mainfrom
UgaTheDev:add-similarity-filter-block
Closed

feat: add SimilarityFilterBlock for near-duplicate filtering#723
UgaTheDev wants to merge 1 commit intoinstructlab:mainfrom
UgaTheDev:add-similarity-filter-block

Conversation

@UgaTheDev
Copy link
Copy Markdown

Summary

Adds a standalone SimilarityFilterBlock to instructlab-sdg that removes near-duplicate rows from a HuggingFace Dataset based on text similarity.

This is a follow-up to Red-Hat-AI-Innovation-Team/sdg_hub#652, where this block was originally part of a larger InstructLab Q&A pipeline PR. Per the maintainer feedback, the SimilarityFilterBlock was identified as a genuinely useful, framework-agnostic addition that benefits all users and was welcomed as a separate PR.

Design

The block follows the existing FilterByValueBlock pattern exactly:

  • Inherits from Block (ABC base class)
  • Registered via @BlockRegistry.register("SimilarityFilterBlock")
  • Module-level helper functions (_similarity, _deduplicate_group) to avoid pickling issues with multiprocessing
  • Exported from instructlab.sdg public API

Parameters

Parameter Type Default Description
filter_column str required Column containing text to compare
threshold float 0.85 Similarity ratio (0.0–1.0). Rows with similarity above this vs any kept row are dropped
group_by str | None None If set, deduplication is scoped within groups of this column

Algorithm

  1. Convert Dataset to pandas DataFrame
  2. If group_by is set and column exists, group by that column
  3. Within each group (or globally), iterate rows and compare against kept rows using difflib.SequenceMatcher.ratio()
  4. Drop rows exceeding the similarity threshold
  5. Convert back to HuggingFace Dataset via dataset_from_pandas_dataframe

Dependencies

Zero new dependencies. Uses only difflib.SequenceMatcher (stdlib) and pandas/datasets (already required).

Files Changed

File Change
src/instructlab/sdg/blocks/similarityfilterblock.py New block implementation (111 lines)
src/instructlab/sdg/__init__.py Added to __all__ and imports
tests/unit/test_similarityfilterblock.py 7 unit tests

Usage Example

from instructlab.sdg import SimilarityFilterBlock

block = SimilarityFilterBlock(
    ctx=pipeline_context,
    pipe=pipeline,
    block_name="deduplicate_questions",
    filter_column="question",
    threshold=0.85,
    group_by="document_id",  # optional
)
filtered_dataset = block.generate(dataset)

In pipeline YAML:

- block_type: SimilarityFilterBlock
  block_name: deduplicate_questions
  filter_column: question
  threshold: 0.85
  group_by: document_id

Test Plan

  • 7 unit tests covering:
    • Unique rows preserved
    • Exact duplicates removed
    • Near-duplicates caught at threshold
    • group_by isolates groups correctly
    • group_by deduplicates within same group
    • Empty dataset returns empty
    • Lower threshold is more aggressive than higher
  • All pre-existing unit tests unaffected (103 passed, 7 pre-existing failures from missing optional deps tesserocr/submodlib-py)
  • Public API import verified: from instructlab.sdg import SimilarityFilterBlock

Add a new SimilarityFilterBlock that removes near-duplicate rows from a
Dataset based on text similarity using difflib.SequenceMatcher.

Supports configurable similarity threshold and optional group_by column
to scope deduplication within groups. Zero new dependencies.
@UgaTheDev
Copy link
Copy Markdown
Author

Closing — this was submitted to the wrong repository. The SimilarityFilterBlock belongs in Red-Hat-AI-Innovation-Team/sdg_hub, not instructlab/sdg. Will resubmit there.

@UgaTheDev UgaTheDev closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-failure testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant