[Library] Sift - Fuzzy Search by RegEx and Ngram

Post your working scripts, libraries and tools
FanaticGuru
Posts: 778
Joined: 30 Sep 2013, 22:25

[Library] Sift - Fuzzy Search by RegEx and Ngram

30 Apr 2015, 19:03

[Library] Sift

Sift library is really just two functions designed for sifting or searching through data for items that match a certain criteria.

The two functions are:
Sift_Regex
Sift_Ngram

Code: [Select all] [Expand] [Download] (Script.ahk)GeSHi © Codebox Plus


The Sift_Regex is not too hard to understand. It is just a wrapper with predefined Needles for doing RegExMatch searches. Hopefully the example at the end with demonstrate how its options work and affect the results.

Sift_Ngram is a little more complicated because it is a more fuzzy type search that breaks Haystacks and Needles up into little chunks and then compares how many little chunks they have in common.

Here is a simple Sift_Ngram example code:

Code: [Select all] [Expand] [Download] (Script.ahk)GeSHi © Codebox Plus



Here is a Gui interface that allows you to see more how each type Sift works and the effects of options.

This is not about the Gui it is just to help see how the functions work. Most of the example code is just to get the Gui to collect the user input.

All the meat of the example occurs in the Query subroutine in just a few calls of the functions.

Code: [Select all] [Expand] [Download] (Script.ahk)GeSHi © Codebox Plus



FG
Last edited by FanaticGuru on 11 May 2015, 13:41, edited 1 time in total.
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

Google Search, Dictionary, Thesaurus - Quickly Get Information from Specific Web Resources

[Function] Timer - Create and Manage Timers
toralf
Posts: 432
Joined: 27 Apr 2014, 21:08
Location: Germany

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

01 May 2015, 14:56

There is actually a SIFT3 code on the other forum to allow to detect similarities between strings. Similarly to the levanstein distance.
ciao
toralf
FanaticGuru
Posts: 778
Joined: 30 Sep 2013, 22:25

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

01 May 2015, 16:14

toralf wrote:There is actually a SIFT3 code on the other forum to allow to detect similarities between strings. Similarly to the levanstein distance.

I looked at Sift3 and studied it pretty carefully but Levenshtein Distance is more about finding how close two words are to each other and misspelling detection.

Levenshtein Distance did not seem suitable for searching a text file and looking for a line about how to make applesauce and get a result of a line that contained ...to make many different kinds of fruit sauces including apple, peach, and apricot... .
A simple search for applesauce will fail, also a line like Applesauce taste terrible will rate lower because although it has the exact word applesauce there is nothing about how to make.

Or another example searching for the line:
Star Wars Returns and finding Star Wars: Episode VI - The Return of the Jedi is a classic blockbuster space movie that is fun for the whole family.
Even a RegEx search for a line that contains all three words Star Wars Returns in any order will fail because the result does not contain Returns. It only contains Return.

From what I have read Ngram searches are a pretty good way of doing these type searches. Depending on what you are searching and how, you need to tinker with the size Ngram and detection threshold to get satisfactory results.

I just created the function and am in the process of tinkering with it to search contact information made up of stuff like a name, company, title, address, phone number, email address, etc. to try to quickly locate the proper contact when the search information entered might be limited or partially wrong.

The Regex function is just a bonus for those that do not have the knowledge or will to create the necessary Needle from scratch. Searching for a line that contains multiple words in any order is a very common RegEx need but is still a fairly complex Needle to design.

I thought about including a Levenshtein sift also but did not know exactly how I would go about doing an Levenshtein of a small string against a large string. I thought about dividing the large string into as many small length string as possible and then comparing the small string to each of them. This seemed very slow though.

A line 100 characters long with a 10 character search string would require like 90 Levenshtein calculations (each of which is fairly intensive) and this is just one line of what might be a 10,000 line text file.

I am going to play around with Levenshtein some more.

FG
Last edited by FanaticGuru on 01 May 2015, 16:44, edited 1 time in total.
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

Google Search, Dictionary, Thesaurus - Quickly Get Information from Specific Web Resources

[Function] Timer - Create and Manage Timers
toralf
Posts: 432
Joined: 27 Apr 2014, 21:08
Location: Germany

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

01 May 2015, 16:33

I agree with your explanation.
My remark was not meant to say that SIFT3 could replace ngram or the other features but maybe could complement the functions. Depending on what is being searched for all approaches have their pros and cons
ciao
toralf
rommmcek
Posts: 268
Joined: 15 Aug 2014, 15:18

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

03 May 2015, 09:52

Very sophisticated code, but a bit slow on large data! Probably because of it's versatility & complexity! However why don't you include:

Code: [Select all] [Download] (Script.ahk)GeSHi © Codebox Plus

SetBachLines, -1
which would render the script more then twice as fast!

bye
FanaticGuru
Posts: 778
Joined: 30 Sep 2013, 22:25

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

04 May 2015, 16:36

rommmcek wrote:Very sophisticated code, but a bit slow on large data! Probably because of it's versatility & complexity! However why don't you include:

Code: [Select all] [Download] (Script.ahk)GeSHi © Codebox Plus

SetBachLines, -1
which would render the script more then twice as fast!

bye

SetBatchLines, -1 will double the speed of most scripts because it tells the computer to have the script use all available CPU processing time as opposed to the default of only use up to half of the available CPU processing time.

While an important option to keep in mind, this is really a command that is best set by the user depending on the environment they are going to be running the script in because setting it to -1 could have very negative issues with other programs the user has running at the same time.

Sift_Ngram is relatively slow though just due to the amount of comparisons that must be done in an interpretative language type way.

A typical search will take about 1.5 seconds per 1 meg searched the first time. Most of that time is creating the Ngram matrix of the search file. If the Ngram matrix of an unchanged search file is passed back to the function then each subsequent search will take about .4 seconds.

I would not recommend doing a search as you type like I did in the Gui example for large files. 4 seconds to search a 10 meg file when you push enter might be acceptable but not 4 seconds each time you type a character.

The Sift_Regex is much faster. While a RegEx comparison is a pretty complicated process it is using a command in AHK for the bulk of the work which is operating at the speed of C code. A Sift_Regex comparison depending on the options happens at about .01 to .05 seconds per meg.

These speeds are based on my computer with is fairly typically. And these times can be cut in half if you give the script priority to use more than 50% CPU processing time through SetBatchLines.

FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

Google Search, Dictionary, Thesaurus - Quickly Get Information from Specific Web Resources

[Function] Timer - Create and Manage Timers
guest3456
Posts: 2057
Joined: 09 Oct 2013, 10:31

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

04 May 2015, 17:09

FanaticGuru wrote: SetBatchLines, -1 will double the speed of most scripts because it tells the computer to have the script use all available CPU processing time as opposed to the default of only use up to half of the available CPU processing time.

While an important option to keep in mind, this is really a command that is best set by the user depending on the environment they are going to be running the script in because setting it to -1 could have very negative issues with other programs the user has running at the same time.


I used to worry about this too, but I've used that line in a very CPU intensive script (multiple gdip image searches on each of 20+ different windows per second) without ill effects. And its very important to still have CPU available for the other programs in this setup

FanaticGuru
Posts: 778
Joined: 30 Sep 2013, 22:25

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

04 May 2015, 19:31

guest3456 wrote:
FanaticGuru wrote: SetBatchLines, -1 will double the speed of most scripts because it tells the computer to have the script use all available CPU processing time as opposed to the default of only use up to half of the available CPU processing time.

While an important option to keep in mind, this is really a command that is best set by the user depending on the environment they are going to be running the script in because setting it to -1 could have very negative issues with other programs the user has running at the same time.


I used to worry about this too, but I've used that line in a very CPU intensive script (multiple gdip image searches on each of 20+ different windows per second) without ill effects. And its very important to still have CPU available for the other programs in this setup

Yea, most of the time it is not an issue to use SetBatchLines, -1. You typically have multiple CPU's that are not running at 100% capacity. And then even if they are, time critical processes will generally set their priority higher than AHK scripts' default. Windows is pretty smart about its CPU utilization. Generally it will keep all the processes running fairly smoothly no matter who is yelling they are the most important. I can definitely see why it might be considered a waste to set aside 50% idle CPU just in case a process that is lower priority than normal might need it.

But there is a reason AHK scripts default to a normal priority and a 50% idle CPU usage. An AHK script is generally a secondary process. If the CPUs are doing all they can do and something needs to take a backseat to more important things going on then AHK is normally the one you want to take a backseat. If the AHK script really is more important then commands like SetBatchlines, Process, and Critical provide a way to let Windows know that.

FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

Google Search, Dictionary, Thesaurus - Quickly Get Information from Specific Web Resources

[Function] Timer - Create and Manage Timers
rommmcek
Posts: 268
Joined: 15 Aug 2014, 15:18

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

08 May 2015, 16:52

Thanks for thorough explanation! Obviously I don't have as profound knowledge as you do, however I intuitively use speed up features only for the time I really need them then I switch back to normal!
I merely compared simple search of Sift_RegEx with IfInString command!
Anyhow, I'm already planning to use your functions!
Do you know if Uberi in his Autocomplte dealing with huge data uses the same method as you are?

Thanks!
FanaticGuru
Posts: 778
Joined: 30 Sep 2013, 22:25

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

11 May 2015, 13:17

rommmcek wrote:Do you know if Uberi in his Autocomplte dealing with huge data uses the same method as you are?

I only glanced at Uberi's code for AutoComplete but at its core it also uses RegExMatch the same as Sift_Regex but it is a little different because in that code the data being searched is alphabetized which allows for the possibility of some optimization.

AutoComplete is basically using similar code as Sift_Regex with the OC option of Ordered Characters. It is also different and faster to look for one word that matches one word and then return only that matching word than to look for a phrase that is in a line and then return not only the matching phrase but the entire line it was found within.

FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

Google Search, Dictionary, Thesaurus - Quickly Get Information from Specific Web Resources

[Function] Timer - Create and Manage Timers
rommmcek
Posts: 268
Joined: 15 Aug 2014, 15:18

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

13 May 2015, 11:44

Hi,

It's very kind of you to give me very detailed answer! I will certainly try it out and hopefully deepen my knowledge!
I'm using this kind of searches for a long time! At first with batch files, then 4Dos which become TakeCommand and even with Pascal and later Delphi. All were truly fast, but I deliberately gave up some speed for versatility and adaptability of AutoHotkey, besides nowadays we have much faster computers and your functions are pretty much all, one can wish, sifting through data!

Thanks again!

Return to “Scripts and Functions”

Who is online

Users browsing this forum: RiseUp and 20 guests