Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

SIFT3 : Super Fast and Accurate string distance algorithm


  • Please log in to reply
11 replies to this topic
toralf
  • Moderators
  • 4035 posts
  • Last active: Aug 20 2014 04:23 PM
  • Joined: 31 Jan 2005
The original algorithm can be found here:
<!-- m -->http://siderite.blog... ... tance.html<!-- m -->

It is about 10 times faster then the Levenshtein distance.

It translates to this
MsgBox % Distance( "A H K", "A H Kn" )
MsgBox % Distance( "A H K", "A H K" )
MsgBox % Distance( "A H K", "A h K" )
MsgBox % Distance( "AHK", "" )
MsgBox % Distance( "He", "Ben" )
MsgBox % Distance( "Toralf", "ToRalf" )
MsgBox % Distance( "Toralf", "esromneb" )
MsgBox % Distance( "Toralf", "RalfLaDuce" )

Distance(string1, string2, maxOffset=5) {
  StringSplit, n, string1
  StringSplit, m, string2
  IfEqual n0,0, Return m0
  IfEqual m0,0, Return n0
  c = 1
  offset1 = 0
  offset2 = 0
  lcs = 0
  While((c + offset1 <= n0) && (c + offset2 <= m0)) {
    ni := c + offset1
    mi := c + offset2
    If (n%ni% == m%mi%)
      lcs++
    else{
      offset1 = 0
      offset2 = 0
      Loop, % maxOffset  {
        oi := c + A_Index
        if ((oi < n0) && (n%oi% == m%c%)) {
            offset1 := A_Index
            break
        }
        if ((oi < m0) && (n%c% == m%oi%)) {
            offset2 := A_Index
            break
        }
      }
    }
    c++
  }
  Return (n0 + m0)/2 - lcs
}
I haven't checked it completly yet, might have bugs.

EDIT: Found one bug: fixed, last char wasn't considered

Is there a way to make it faster?
Ciao
toralf
 
I use the latest AHK version (1.1.15+)
Please ask questions in forum on ahkscript.org. Why?
For online reference please use these Docs.

toralf
  • Moderators
  • 4035 posts
  • Last active: Aug 20 2014 04:23 PM
  • Joined: 31 Jan 2005
I changed the code a little bit:

It now returns the difference between the strings as a float between 0 and 1. 0 means strings are identical. 1 means they have nothing in common.
MsgBox % Difference( "A H K", "A H Kn" )       ;0.083333
MsgBox % Difference( "A H K", "A H K" )        ;0.000000
MsgBox % Difference( "A H K", "A h K" )        ;0.040000
MsgBox % Difference( "AHK", "" )               ;1.000000
MsgBox % Difference( "He", "Ben" )             ;0.500000
MsgBox % Difference( "Toralf", "ToRalf" )      ;0.033333
MsgBox % Difference( "Toralf", "esromneb" )    ;0.750000
MsgBox % Difference( "Toralf", "RalfLaDuce" )  ;0.420000

;basic SIFT3 code by Siderite Zackwehdex
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html
;took idea to normalize it to longest string from Brad Wood
;http://www.bradwood.com/string_compare/
;Own idea: when character only differ in case, LSC is a 0.8 match for this character
Difference(string1, string2, maxOffset=5) {    ;returns a float: from 0 = identical to 1 = nothing in common
  StringSplit, n, string1
  StringSplit, m, string2
  IfEqual n0,0, Return 1/1
  IfEqual m0,0, Return 1/1
  c = 1
  offset1 = 0
  offset2 = 0
  lcs = 0
  While((c + offset1 <= n0) && (c + offset2 <= m0)) {
    ni := c + offset1
    mi := c + offset2
    If (n%ni% == m%mi%)
      lcs++
    Else If (n%ni% = m%mi%)
      lcs += 0.8
    Else{
      offset1 = 0
      offset2 = 0
      Loop, %maxOffset%  {
        oi := c + A_Index
        if ((oi < n0) && (n%oi% = m%c%)) {
            offset1 := A_Index
            lcs += (n%oi% == m%c% ? 1 : 0.8)
            break
        }
        if ((oi < m0) && (n%c% = m%oi%)) {
            offset2 := A_Index
            lcs += (n%c% == m%oi% ? 1 : 0.8)
            break
        }
      }
    }
    c++
  }
  Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0)
}

But e.g. how could I find "é" compared with "e" to be a nearly match? Do these char have a fixed offset from their base character?

Edit: 2010-06-21: Improved code to catch different case
Ciao
toralf
 
I use the latest AHK version (1.1.15+)
Please ask questions in forum on ahkscript.org. Why?
For online reference please use these Docs.

fred
  • Members
  • 263 posts
  • Last active: Oct 24 2013 02:02 PM
  • Joined: 01 Feb 2010
Nice, string matching the human way (more or less).

But e.g. how could I find "é" compared with "e" to be a nearly match?
Do these char have a fixed offset from their base character?

I don't think so, for e dec. 101 are there are 4 variations dec. 232-235 but for a dec. 96 you have 6 variations dec. 224-229.
Perhaps replacing the special chars to normal chars before calculating?

toralf
  • Moderators
  • 4035 posts
  • Last active: Aug 20 2014 04:23 PM
  • Joined: 31 Jan 2005
I changed and optimized the code quite a bit.
It is now 30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance.
Even though that it might lead to different results then the SIFT3 code, it is accurate enough I hope to find similar strings.

;basic idea for SIFT3 code by Siderite Zackwehdex
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html
;took idea to normalize it to longest string from Brad Wood
;http://www.bradwood.com/string_compare/
;Own work: 
; - when character only differ in case, LSC is a 0.8 match for this character
; - modified code for speed, might lead to different results compared to original code
; - optimized for speed (30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance) 
;http://www.autohotkey.com/forum/topic59407.html
Difference(string1, string2, maxOffset=5) {    ;returns a float: between "0.0 = identical" and "1.0 = nothing in common"
  If (string1 = string2)
    Return (string1 == string2 ? 0/1 : 0.2/StrLen(string1))    ;either identical or (assumption:) "only one" char with different case
  If (string1 = "" OR string2 = "")
    Return (string1 = string2 ? 0/1 : 1/1)
  StringSplit, n, string1
  StringSplit, m, string2
  ni := 1, mi := 1, lcs := 0
  While((ni <= n0) AND (mi <= m0)) {
    If (n%ni% == m%mi%)
      EnvAdd, lcs, 1
    Else If (n%ni% = m%mi%)
      EnvAdd, lcs, 0.8
    Else{
      Loop, %maxOffset%  {
        oi := ni + A_Index, pi := mi + A_Index
        If ((n%oi% = m%mi%) AND (oi <= n0)){
            ni := oi, lcs += (n%oi% == m%mi% ? 1 : 0.8)
            Break
        }
        If ((n%ni% = m%pi%) AND (pi <= m0)){
            mi := pi, lcs += (n%ni% == m%pi% ? 1 : 0.8)
            Break
        }
      }
    }
    EnvAdd, ni, 1
    EnvAdd, mi, 1
  }
  Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0)
}

Ciao
toralf
 
I use the latest AHK version (1.1.15+)
Please ask questions in forum on ahkscript.org. Why?
For online reference please use these Docs.

SFdude
  • Members
  • 4 posts
  • Last active: Apr 28 2011 11:52 AM
  • Joined: 02 Oct 2010
Hi Toralf!

Tried your latest version of SIFT3 in AutoHotkey,
(you posted on: Mon Jun 21, 2010, see above my post).

But when executing the AHK script,
I get this error message from AHK:

" Error at Line 27.
Line Text: While((ni <= n0) AND (mi <= m0))
Error: Functions cannot contain Functions.
Program will exit "


...and it does indeed exit immediately!.

I must be missing something super-simple in AHK....
Included your AHK script, below my signature.

Help! Hilfe! (is there a complete working example?)
SFdude

MsgBox % Difference( "A H K", "A H K" )        ;0.000000
MsgBox % Difference( "A H K", "A h K" )        ;0.040000
MsgBox % Difference( "AHK", "" )               ;1.000000
MsgBox % Difference( "He", "Ben" )             ;0.500000
MsgBox % Difference( "Toralf", "ToRalf" )      ;0.033333
MsgBox % Difference( "Toralf", "esromneb" )    ;0.750000
MsgBox % Difference( "Toralf", "RalfLaDuce" )  ;0.420000


;basic idea for SIFT3 code by Siderite Zackwehdex
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html
;took idea to normalize it to longest string from Brad Wood
;http://www.bradwood.com/string_compare/
;Own work:
; - when character only differ in case, LSC is a 0.8 match for this character
; - modified code for speed, might lead to different results compared to original code
; - optimized for speed (30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance)
;http://www.autohotkey.com/forum/topic59407.html
Difference(string1, string2, maxOffset=5) {    ;returns a float: between "0.0 = identical" and "1.0 = nothing in common"
  If (string1 = string2)
    Return (string1 == string2 ? 0/1 : 0.2/StrLen(string1))    ;either identical or (assumption:) "only one" char with different case
  If (string1 = "" OR string2 = "")
    Return (string1 = string2 ? 0/1 : 1/1)
  StringSplit, n, string1
  StringSplit, m, string2
  ni := 1, mi := 1, lcs := 0
  While((ni <= n0) AND (mi <= m0)) {
    If (n%ni% == m%mi%)
      EnvAdd, lcs, 1
    Else If (n%ni% = m%mi%)
      EnvAdd, lcs, 0.8
    Else{
      Loop, %maxOffset%  {
        oi := ni + A_Index, pi := mi + A_Index
        If ((n%oi% = m%mi%) AND (oi <= n0)){
            ni := oi, lcs += (n%oi% == m%mi% ? 1 : 0.8)
            Break
        }
        If ((n%ni% = m%pi%) AND (pi <= m0)){
            mi := pi, lcs += (n%ni% == m%pi% ? 1 : 0.8)
            Break
        }
      }
    }
    EnvAdd, ni, 1
    EnvAdd, mi, 1
  }
  Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0)
}


Tuncay
  • Members
  • 1945 posts
  • Last active: Feb 08 2015 03:49 PM
  • Joined: 07 Nov 2006
Ok simple questions first... Do you have the newest AutoHotkey version?

What does
#NoEnv
SendMode Input
SetWorkingDir %A_ScriptDir%
MsgBox % A_AhkVersion
report?

No signature.


SFdude
  • Members
  • 4 posts
  • Last active: Apr 28 2011 11:52 AM
  • Joined: 02 Oct 2010
Thanks Tuncay for that quick reply.

You are right!!
I d/l the latest version of AHK...now the script works perfectly.

btw:
Do you know of any JS implementation
of SIFT3?

thanks!
SFdude

Tuncay
  • Members
  • 1945 posts
  • Last active: Feb 08 2015 03:49 PM
  • Joined: 07 Nov 2006

Thanks Tuncay for that quick reply.

No problem. In this community you get in general fast answers (if its not too specific or late).

Do you know of any JS implementation of SIFT3?

No sorry, I did not search for it. Never needed.

No signature.


SFdude
  • Members
  • 4 posts
  • Last active: Apr 28 2011 11:52 AM
  • Joined: 02 Oct 2010
Thanks again Tuncay!
best -

SFdude

fragman
  • Members
  • 1591 posts
  • Last active: Nov 12 2012 08:51 PM
  • Joined: 13 Oct 2009
Thanks for this nice algorithm, I now use it for a system that features program launcher like launchy, window switcher, uninstaller,... etc.

I had to write a wrapper around it to make it do fuzzy search:
FuzzySearch(string1, string2)
{
	lenl := StrLen(string1)
	lens := StrLen(string2)
	if(lenl > lens)
	{
		shorter := string2
		longer := string1
	}
	else if(lens > lenl)
	{
		shorter := string1
		longer := string2
		lens := lenl
		lenl := StrLen(string2)
	}
	else
		return StringDifference(string1, string2)
	min := 1
	Loop % lenl - lens + 1
	{
		distance := StringDifference(shorter, SubStr(longer, A_Index, lens))
		if(distance < min)
			min := distance
	}
	return min
}
;By Toralf:
;basic idea for SIFT3 code by Siderite Zackwehdex 
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html 
;took idea to normalize it to longest string from Brad Wood 
;http://www.bradwood.com/string_compare/ 
;Own work: 
; - when character only differ in case, LSC is a 0.8 match for this character 
; - modified code for speed, might lead to different results compared to original code 
; - optimized for speed (30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance) 
;http://www.autohotkey.com/forum/topic59407.html 
StringDifference(string1, string2, maxOffset=1) {    ;returns a float: between "0.0 = identical" and "1.0 = nothing in common" 
  If (string1 = string2) 
    Return (string1 == string2 ? 0/1 : 0.2/StrLen(string1))    ;either identical or (assumption:) "only one" char with different case 
  If (string1 = "" OR string2 = "") 
    Return (string1 = string2 ? 0/1 : 1/1) 
  StringSplit, n, string1 
  StringSplit, m, string2 
  ni := 1, mi := 1, lcs := 0 
  While((ni <= n0) AND (mi <= m0)) { 
    If (n%ni% == m%mi%) 
      EnvAdd, lcs, 1 
    Else If (n%ni% = m%mi%) 
      EnvAdd, lcs, 0.8 
    Else{ 
      Loop, %maxOffset%  { 
        oi := ni + A_Index, pi := mi + A_Index 
        If ((n%oi% = m%mi%) AND (oi <= n0)){ 
            ni := oi, lcs += (n%oi% == m%mi% ? 1 : 0.8) 
            Break 
        } 
        If ((n%ni% = m%pi%) AND (pi <= m0)){ 
            mi := pi, lcs += (n%ni% == m%pi% ? 1 : 0.8) 
            Break 
        } 
      } 
    } 
    EnvAdd, ni, 1 
    EnvAdd, mi, 1 
  } 
  Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0) 
}

As you can see, I renamed your function because I think it's more fitting and unique.

  • Guests
  • Last active:
  • Joined: --
after reading the speed comment in the node thread I was wondering if this could be done with mCode?

fragman
  • Members
  • 1591 posts
  • Last active: Nov 12 2012 08:51 PM
  • Joined: 13 Oct 2009
Good idea! It seems likely, but I haven't looked into generating MCode yet. I'd need Unicode x86/x64 versions of it.