Skip to content

Commit 99dd9b2

Browse files
committed
SpamScorer: normalise whitespace in comparable form
Collapse Unicode whitespace runs to a single space and trim in SpamScorer::RichText#to_comparable_form to improve SpammyPhrase matching across newlines/tabs.
1 parent 9456eb0 commit 99dd9b2

File tree

2 files changed

+16
-1
lines changed

2 files changed

+16
-1
lines changed

lib/spam_scorer/rich_text.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ def score
3838
attr_reader :text
3939

4040
def to_comparable_form(str)
41-
str.downcase(:fold).unicode_normalize(:nfkc)
41+
str.downcase(:fold).unicode_normalize(:nfkc).gsub(/\s+/u, " ").strip
4242
end
4343
end
4444
end

test/lib/spam_scorer_test.rb

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,4 +44,19 @@ def test_spammy_phrases
4444
scorer = SpamScorer.new_from_rich_text(r)
4545
assert_equal 160, scorer.score.round
4646
end
47+
48+
def test_to_comparable_form_collapses_unicode_whitespace_and_trims
49+
r = RichText.new("text", "x")
50+
scorer = SpamScorer.new_from_rich_text(r)
51+
52+
input = " A\u00A0\tB\n\nC "
53+
assert_equal "a b c", scorer.send(:to_comparable_form, input)
54+
end
55+
56+
def test_spammy_phrase_can_match_across_newlines_after_normalization
57+
create(:spammy_phrase, :phrase => "foo bar")
58+
r = RichText.new("markdown", "foo\nbar")
59+
scorer = SpamScorer.new_from_rich_text(r)
60+
assert_equal 40, scorer.score.round
61+
end
4762
end

0 commit comments

Comments
 (0)