Skip to content

Support Levenshtein substring matching #12

@dsamarin

Description

@dsamarin

It may become very useful to some to provide approximate substring matching. This reports the smallest edit distance of the needle in all possible substrings of haystack. Here are some examples:

tests := []struct {
		needle, haystack string
		want int
	}{
		{"Machu Picchu", "Machu Picchu", 0},
		{"Machu Picchu", "Where is Machu Picchu?", 0},
		{"Machu Picchu", "Where is Machu Pichu?", 1},
		{"Machu Picchu", "Where is Macchu Picchu?", 1},
		{"Stonehenge", "How do I get to Stone Henge?", 2},
		{"", "", 0},
		{"", "Machu Picchu", 0},
		{"u", "Machu Picchu", 0},
		{"z", "Machu Picchu", 1},
		{"Uluru", "", 5},
	}

All that is necessary is to initialize the row with all zeroes, run the main algorithm, and then return the smallest value in the row. I have developed some code and was wondering about your thoughts about including it (or some more optimized variant) in this package:

func ComputeSubstringDistance(needle, haystack string) int {
	if len(needle) == 0 {
		return 0
	}

	if len(haystack) == 0 {
		return utf8.RuneCountInString(needle)
	}

	if needle == haystack {
		return 0
	}

	// We need to convert to []rune if the strings are non-ascii.
	// This could be avoided by using utf8.RuneCountInString
	// and then doing some juggling with rune indices.
	// The primary challenge is keeping track of the previous rune.
	// With needle range loop, its not that easy. And with needle for-loop
	// we need to keep track of the inter-rune width using utf8.DecodeRuneInString
	s1 := []rune(haystack)
	s2 := []rune(needle)

	lenS1 := len(s1)
	lenS2 := len(s2)

	// init the row
	x := make([]int, lenS1+1)

	// make needle dummy bounds check to prevent the 2 bounds check down below.
	// The one inside the loop is particularly costly.
	_ = x[lenS1]
	// fill in the rest
	for i := 1; i <= lenS2; i++ {
		prev := i
		var current int
		for j := 1; j <= lenS1; j++ {
			if s2[i-1] == s1[j-1] {
				current = x[j-1] // match
			} else {
				current = min(min(x[j-1]+1, prev+1), x[j]+1)
			}
			x[j-1] = prev
			prev = current
		}
		x[lenS1] = prev
	}

	min := x[0]
	for i := 1; i <= lenS1; i++ {
		if x[i] < min {
			min = x[i]
		}
	}
	return min
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions