Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Fuzzy string searching with Damerau–Levenshtein distance


  • Please log in to reply
15 replies to this topic
polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Someone might find this useful, like I did...

Test cases:
tests =
( LTrim
	AHK	ahk
	He	ben
	this	tihs
	Toralf	Titan
	google	goggle
)
Loop, Parse, tests, `n
{
	StringSplit, w, A_LoopField, %A_Tab%
	l .= """" . w1 . """	=>	""" . w2 . """	" . DamerauLevenshteinDistance(w1, w2) . "`n"
}
MsgBox, %l%
Gives:

"AHK" => "ahk" 0
"He" => "ben" 2
"this" => "tihs" 1
"Toralf" => "Titan" 6
"google" => "goggle" 1


Download

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Jero3n
  • Members
  • 147 posts
  • Last active: Mar 31 2010 05:07 PM
  • Joined: 19 Jan 2007
What is DamerauLevenshteinDistance? :roll:

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
It's an algorithm for calculating distance between strings by two guys called Damerau and Levenshtein. It lets you find matches based on proximity, which allows for human errors such as typos and spelling mistakes. In the example above "tihs" is 1 distance away from "this" so it's safe to assume the latter is implied, conversely "Toralf" and "Titan" are very far so it can be ignored. You can find more information on teh intarweb.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005
This function can be useful, although it is a simple (2-line) extension of the already posted Levenshtein_distance function, and does not compute the Damerau–Levenshtein distance, but the cost of the so-called optimal string alignment. You can verify it with
MsgBox % DamerauLevenshteinDistance("TO","OST")
It returns 3 instead of the correct 2 (see in the cited Wikipedia article). The Damerau–Levenshtein distance is the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Another problem with this function is that the computed distance does not satisfy the triangle inequality, and so it cannot be easily employed in text search applications. So, some more work is needed...

Jero3n
  • Members
  • 147 posts
  • Last active: Mar 31 2010 05:07 PM
  • Joined: 19 Jan 2007
Than this is really usefull for an search function (what I want to implement in my application i'm writing at the moment), you can check it and when it has <2, I show something like: Did you meant ...

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012

It returns 3 instead of the correct 2 (see in the cited Wikipedia article).

The wiki says: "the distance between TO and OST is 3" so it's correct. Have I missed something?

I prefer this variant as it doesn't penalize transportation, which according to the same wiki article "correspond to more than 80% of all human misspellings." Triangle inequality can be good at times as it gives weight to the control, producing more relevant search results and accurate comparisons.

you can check it and when it has <2, I show something like: Did you meant ...

Cool. If your dataset is extremely large it may be worth computing dispersion and using matches within the first quartile.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

The wiki says: "the distance between TO and OST is 3"

Wikipedia says: “...the strings can be made equal using one deletion and one transposition. Clearly, the algorithm does not compute precisely the value we want”… “The above-described pseudo-code calculates only restricted edit distance”. You need not quote Wikipedia: it is easy to verify that the right answer is 2, not 3.

Have I missed something?

Yes. This Wikipedia article is misleading, as many more.

I prefer this variant as it doesn't penalize transportation, which according to the same wiki article "correspond to more than 80% of all human misspellings."

Not exactly. What the Wikipedia article means is: “insertions, deletions, substitutions and transpositions … correspond to more than 80% of all human misspellings.” Their pseudo-code does penalize transpositions by counting them as two operations instead of one.

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Calculating transportation in this sense is a lot quicker and accurate enough for most cases. In language semantics "ost" is a lot different from "to," I would expect it to have a closer relation to "lost" which the current Damerau extension ensures. As I said to Jero3n if you have a large index such as a dictionary it would be useful to rank about the deviation.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

Calculating transportation in this sense is a lot quicker and accurate enough for most cases.

(It is transposition.) If you like this edit-distance, use it. Just don’t call it “DamerauLevenshteinDistance”. It is different.

... and a copyright notice for some trivial editing and two extra lines added to already posted code?

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Tomato, tomato it's all the same Laszlo :p
I haven't read Damerau's paper, but if he wrote an implementation that has this flaw I see no reason to rename it.

... and a copyright notice for some trivial editing and two extra lines added to already posted code?

I used information from the wikipedia and gave them due attribution. From what I can tell you have done the same thing. Please do not get into an argument with me about this when you know I never based my code around yours. To anybody else reading this: my script is zlibbed so feel absolutely free to copy/remix without restrictions... freedom of speech ftw!

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

if he wrote an implementation that has this flaw I see no reason to rename it

He did not.

In language semantics "ost" is a lot different from "to," I would expect it to have a closer relation to "lost"

In human languages the notion of closeness is much more complex. The most common typewriter/keyboard misspelling of “the” is “teh”, which should be the closest. Hitting a key next to the right one is a common error, hitting one far away is rare. Other types of common errors are missing one of double letters (“tomorow”, “comon”), bur not certain others (“omorrow”, “commo”). Having a letter repeated is frequent (“appartment”), but inserting a random one is not (“apuartment”). Similar sounding words are also often swapped (“write” – “right”), but a blind edit distance metrics will not find them close. This is why good spell checkers, indexers use (also) misspellings tables.

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012

In human languages the notion of closeness is much more complex.

Wouldn't you agree that Damerau's variant is superior for this?

If I get the time I'll try to read his paper and maybe write a compliant function. I'm leaning towards the current version because it's very fast and memory conservative. As you said a comprehensive lexer requires several algorithms for optimal fuzzy searching; but for the moment DamerauLevenshteinDistance remains the best AutoHotkey has.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


keybored
  • Members
  • 351 posts
  • Last active: Apr 26 2013 09:08 AM
  • Joined: 18 Jun 2006
Titan,
Did you change the code?

I have this in my message box;
"AHK   ahk"   =>   ""   9
"He   ben"   =>   ""   8
"this   tihs"   =>   ""   11
"Toralf   Titan"   =>   ""   14
"google   goggle"   =>   ""   15

Edit:
This was just because the web page copies spaces in place of tabs. If anyone else tests the example that is the cause.

Thanks for the function I think this will be very useful to me.

dormeu
  • Guests
  • Last active:
  • Joined: --
oh god.. this is so cool... exactly what i wanted...

adaptive searches ... also check the "hamming code" googleit!!

one problem with this though

distance 1 : script sc*ipt or whatever other errors in single chars are easy detectable, but i need something that ca also account for caracter substraction, or adition

ex : damerau dammerau = 4 and it should be 1 ( -eek!!)
ex: dammerau damerau this returns 1 - correct

So i take it the author was accounting that first is the haystack , and second is the needle ??

What about thinking outside the box? - the search should be unshure about the needle and the haystack
both permutation should return 1 becouse between the 2 only one operation was performed on either the needle or haystack

:O

SoLong&Thx4AllTheFish
  • Members
  • 4999 posts
  • Last active:
  • Joined: 27 May 2007
would making sure the needle is always the shortest string solve it, try this modify function with the new lines, it always returns 1 for me with your example
....

        StringLen, m, s

	StringLen, n, t

	If m = 0

		Return, n

	If n = 0

		Return, m

[color=red]; insert these lines 		

	If (m > n)

		{

		xtmp:=s

		s:=t

		t:=xtmp

		xtmp:=n

		n:=m

		m:=xtmp

		xtmp=

		}

; until here		

[/color]	d0_0 = 0 ....