I've been studying up on string matching (fuzzy matching). I found a few articles discussing various approaches like
I found this video from two guys which took a process of checking to see if a name was on a terrorist watch lists which originally took 14 days to compute down to 5 minutes
What's in a Name? Fast Fuzzy String Matching - Seth Verrinder & Kyle Putnam - Midwest.io 2015
Below are my notes from watching the video (it is ~40 minutes long but very interesting)
1) throw more hardware
2) use another variable/field (zip code / country etc.)
3) n-grams
4) metric trees (example: Lowenstein distance)
5) Brute force (Jaro Winkler is pretty fast already) (5X down to 70hrs )
6) Filtering- estimate similarity first then filter (7x down to 50 hrs 18 minutes in video)
· Length of strings (name length often is not normally distributed so doesn't rule out too much) Probably still look at 70%
· 26 Character filter- search for character that isn't shared- This dropped out quite a bit but was slow (300x down to 65 minutes)
o Bitmap filter- use bitwise operations to get unmatched count- very fast! (340X down to 60 minutes 20 minutes in video)
o 64 character filter (used all bits)- checked for multiple occurrences of a given letter
7) Minimize recalculation (4,000x down to 5 minutes - 28 minutes in video)
· sort names and groups into segments
· common length and first character
· used WolframAlpha to help show formula
Learnings-
· Measure performance and focus on bottleneck
· Order of magnitude doesn't always tell you about actual performance
· Favor simplicity
Various fuzzy string matching algorithms and Excellent video
- Joe Glines
- Posts: 770
- Joined: 30 Sep 2013, 20:49
- Location: Dallas
- Contact:
Various fuzzy string matching algorithms and Excellent video
Sign-up for the HK Newsletter
AHK Tutorials:Web Scraping | | Webservice APIs | AHK and Excel | Chrome | RegEx | Functions
Training: AHK Webinars Courses on AutoHotkey
YouTube
Quick Access Popup, the powerful Windows folders, apps and documents launcher!
AHK Tutorials:Web Scraping | | Webservice APIs | AHK and Excel | Chrome | RegEx | Functions
Training: AHK Webinars Courses on AutoHotkey
YouTube
Quick Access Popup, the powerful Windows folders, apps and documents launcher!
-
- Posts: 934
- Joined: 30 Sep 2017, 03:59
- Location: Romania
- Contact:
Re: Various fuzzy string matching algorithms and Excellent video
Where can I find implementation in AHK of such algorithms?
-------------------------
KeyPress OSD v4: GitHub or forum. (presentation video)
Quick Picto Viewer: GitHub or forum.
AHK GDI+ expanded / compilation library (on GitHub)
My home page.
KeyPress OSD v4: GitHub or forum. (presentation video)
Quick Picto Viewer: GitHub or forum.
AHK GDI+ expanded / compilation library (on GitHub)
My home page.
-
- Posts: 934
- Joined: 30 Sep 2017, 03:59
- Location: Romania
- Contact:
Re: Various fuzzy string matching algorithms and Excellent video
Helgef... I did so... I came across only on the SIFT3 implementation. Can you help, please? thanks
-------------------------
KeyPress OSD v4: GitHub or forum. (presentation video)
Quick Picto Viewer: GitHub or forum.
AHK GDI+ expanded / compilation library (on GitHub)
My home page.
KeyPress OSD v4: GitHub or forum. (presentation video)
Quick Picto Viewer: GitHub or forum.
AHK GDI+ expanded / compilation library (on GitHub)
My home page.
Re: Various fuzzy string matching algorithms and Excellent video
I do not think I can help you. Good luck.
-
- Posts: 934
- Joined: 30 Sep 2017, 03:59
- Location: Romania
- Contact:
Re: Various fuzzy string matching algorithms and Excellent video
Lolguest3456 wrote::lol:
-------------------------
KeyPress OSD v4: GitHub or forum. (presentation video)
Quick Picto Viewer: GitHub or forum.
AHK GDI+ expanded / compilation library (on GitHub)
My home page.
KeyPress OSD v4: GitHub or forum. (presentation video)
Quick Picto Viewer: GitHub or forum.
AHK GDI+ expanded / compilation library (on GitHub)
My home page.
Re: Various fuzzy string matching algorithms and Excellent video
interesting topic. I am very curious what else is coming. the scripts here are not mine:
some links:
Damerau-Levenshtein Distance - Fuzzy searches https://autohotkey.com/boards/viewtopic ... zy#p182567
[Library] Sift - Fuzzy Search by RegEx and Ngram https://autohotkey.com/boards/viewtopic.php?t=7302
Code: Select all
FuzzySearch(string1, string2)
{
lenl := StrLen(string1)
lens := StrLen(string2)
if(lenl > lens)
{
shorter := string2
longer := string1
}
else if(lens > lenl)
{
shorter := string1
longer := string2
lens := lenl
lenl := StrLen(string2)
}
else
return StringDifference(string1, string2)
min := 1
Loop % lenl - lens + 1
{
distance := StringDifference(shorter, SubStr(longer, A_Index, lens))
if(distance < min)
min := distance
}
return min
}
;By Toralf:
;basic idea for SIFT3 code by Siderite Zackwehdex
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html
;took idea to normalize it to longest string from Brad Wood
;http://www.bradwood.com/string_compare/
;Own work:
; - when character only differ in case, LSC is a 0.8 match for this character
; - modified code for speed, might lead to different results compared to original code
; - optimized for speed (30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance)
;http://www.autohotkey.com/forum/topic59407.html
StringDifference(string1, string2, maxOffset=1) { ;returns a float: between "0.0 = identical" and "1.0 = nothing in common"
If (string1 = string2)
Return (string1 == string2 ? 0/1 : 0.2/StrLen(string1)) ;either identical or (assumption:) "only one" char with different case
If (string1 = "" OR string2 = "")
Return (string1 = string2 ? 0/1 : 1/1)
StringSplit, n, string1
StringSplit, m, string2
ni := 1, mi := 1, lcs := 0
While((ni <= n0) AND (mi <= m0)) {
If (n%ni% == m%mi%)
EnvAdd, lcs, 1
Else If (n%ni% = m%mi%)
EnvAdd, lcs, 0.8
Else{
Loop, %maxOffset% {
oi := ni + A_Index, pi := mi + A_Index
If ((n%oi% = m%mi%) AND (oi <= n0)){
ni := oi, lcs += (n%oi% == m%mi% ? 1 : 0.8)
Break
}
If ((n%ni% = m%pi%) AND (pi <= m0)){
mi := pi, lcs += (n%ni% == m%pi% ? 1 : 0.8)
Break
}
}
}
EnvAdd, ni, 1
EnvAdd, mi, 1
}
Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0)
}
Damerau-Levenshtein Distance - Fuzzy searches https://autohotkey.com/boards/viewtopic ... zy#p182567
[Library] Sift - Fuzzy Search by RegEx and Ngram https://autohotkey.com/boards/viewtopic.php?t=7302
github>g_IntelliSense + next improvements + forum
ahk..org>onlineAHKprettyfy, ahk..com>Refactoring
ahk..com>newposts Unanswrd myposts, Donations are appreciated if I could help you
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
ahk..org>onlineAHKprettyfy, ahk..com>Refactoring
ahk..com>newposts Unanswrd myposts, Donations are appreciated if I could help you
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
Return to “Off-topic Discussion”
Who is online
Users browsing this forum: No registered users and 84 guests