[Library] Sift - Fuzzy Search by RegEx and Ngram

Post your working scripts, libraries and tools
FanaticGuru
Posts: 1208
Joined: 30 Sep 2013, 22:25

[Library] Sift - Fuzzy Search by RegEx and Ngram

30 Apr 2015, 19:03

[Library] Sift

Sift library is really just two functions designed for sifting or searching through data for items that match a certain criteria.

The two functions are:
Sift_Regex
Sift_Ngram

Code: Select all

;{ Sift
; Fanatic Guru
; 2015 04 30
; Version 1.00
;
; LIBRARY to sift through a string or array and return items that match sift criteria.
;
; ===================================================================================================================================================
;
; Functions:
; 
; ===================================================================================================================================================
; Sift_Regex(Haystack, Needle, Options, Delimiter)
;
;   Parameters:
;   1) {Haystack}	String or array of information to search, ByRef for efficiency but Haystack is not changed by function
;
;   2) {Needle}		String providing search text or criteria, ByRef for efficiency but Needle is not changed by function
;
;	3) {Options}
;			IN		Needle anywhere IN Haystack item (Default = IN)
;			LEFT	Needle is to LEFT or beginning of Haystack item
;			RIGHT	Needle is to RIGHT or end of Haystack item
;			EXACT	Needle is an EXACT match to Haystack item
;			REGEX	Needle is an REGEX expression to check against Haystack item
;			OC		Needle is ORDERED CHARACTERS to be searched for even non-consecutively but in the given order in Haystack item 
;			OW		Needle is ORDERED WORDS to be searched for even non-consecutively but in the given order in Haystack item
;			UC		Needle is UNORDERED CHARACTERS to be search for even non-consecutively and in any order in Haystack item
;			UW		Needle is UNORDERED WORDS to be search for even non-consecutively and in any order in Haystack item
;
;			If an Option is all lower case then the search will be case insensitive
;
;	4)  {Delimiter}	Single character Delimiter of each item in a Haystack string (Default = `n)
;
;	Returns: 
;		If Haystack is string then a string is returned of found Haystack items delimited by the Delimiter
; 		If Haystack is an array then an array is returned of found Haystack items
;
; 	Note:
;		Sift_Regex searchs are all RegExMatch seaches with Needles crafted based on the options chosen
;
; ===================================================================================================================================================
; Sift_Ngram(Haystack, Needle, Delta, Haystack_Matrix, Ngram Size, Format)
;
;	Parameters:
;	1) {Haystack}		String or array of information to search, ByRef for efficiency but Haystack is not changed by function
;
;   2) {Needle}			String providing search text or criteria, ByRef for efficiency but Needle is not changed by function
;
;	3) {Delta}			(Default = .7) Fuzzy match coefficient, 1 is a prefect match, 0 is no match at all, only results above the Delta are returned
;
;	4) {Haystack_Matrix} (Default = false)	
;			An object containing the preprocessing of the Haystack for Ngrams content
;			If a non-object is passed the Haystack is processed for Ngram content and the results are returned by ByRef
;			If an object is passed then that is used as the processed Ngram content of Haystack
;			If multiply calls to the function are made with no change to the Haystack then a previous processing of Haystack for Ngram content 
;				can be passed back to the function to avoid reprocessing the same Haystack again in order to increase efficiency.
;
;	5) {Ngram Size}		(Default = 3) The length of Ngram used.  Generally Ngrams made of 3 letters called a Trigram is good
;
;	6) {Format}			(Default = S`n)
;			S				Return Object with results Sorted
;			U				Return Object with results Unsorted
;			S%%%			Return Sorted string delimited by characters after S
;			U%%%			Return Unsorted string delimited by characters after U
;								Sorted results are by best match first
;
;	Returns:
;		A string or array depending on Format parameter.
;		If string then it is delimited based on Format parameter.
;		If array then an array of object is returned where each element is of the structure: {Object}.Delta and {Object}.Data
;			Example Code to access object returned:
;				for key, element in Sift_Ngram(Data, QueryText, NgramLimit, Data_Ngram_Matrix, NgramSize)
;						Display .= element.delta "`t" element.data "`n"
;
;	Dependencies: Sift_Ngram_Get, Sift_Ngram_Compare, Sift_Ngram_Matrix, Sift_SortResults
;		These are helper functions that are generally not called directly.  Although Sift_Ngram_Matrix could be useful to call directly to preprocess a large static Haystack
;
; 	Note:
;		The string "dog house" would produce these Trigrams: dog|og |g h| ho|hou|ous|use
;		Sift_Ngram breaks the needle and each item of the Haystack up into Ngrams.
;		Then all the Needle Ngrams are looked for in the Haystack items Ngrams resulting in a percentage of Needle Ngrams found
;
; ===================================================================================================================================================
;
Sift_Regex(ByRef Haystack, ByRef Needle, Options := "IN", Delimit := "`n")
{
	Sifted := {}
	if (Options = "IN")		
		Needle_Temp := "\Q" Needle "\E"
	else if (Options = "LEFT")
		Needle_Temp := "^\Q" Needle "\E"
	else if (Options = "RIGHT")
		Needle_Temp := "\Q" Needle "\E$"
	else if (Options = "EXACT")		
		Needle_Temp := "^\Q" Needle "\E$"
	else if (Options = "REGEX")
		Needle_Temp := Needle
	else if (Options = "OC")
		Needle_Temp := RegExReplace(Needle,"(.)","\Q$1\E.*")
	else if (Options = "OW")
		Needle_Temp := RegExReplace(Needle,"( )","\Q$1\E.*")
	else if (Options = "UW")
		Loop, Parse, Needle, " "
			Needle_Temp .= "(?=.*\Q" A_LoopField "\E)"
	else if (Options = "UC")
		Loop, Parse, Needle
			Needle_Temp .= "(?=.*\Q" A_LoopField "\E)"

	if Options is lower
		Needle_Temp := "i)" Needle_Temp
	
	if IsObject(Haystack)
	{
		for key, Hay in Haystack
			if RegExMatch(Hay, Needle_Temp)
				Sifted.Insert(Hay)
	}
	else
	{
		Loop, Parse, Haystack, %Delimit%
			if RegExMatch(A_LoopField, Needle_Temp)
				Sifted .= A_LoopField Delimit
		Sifted := SubStr(Sifted,1,-1)
	}
	return Sifted
}

Sift_Ngram(ByRef Haystack, ByRef Needle, Delta := .7, ByRef Haystack_Matrix := false, n := 3, Format := "S`n" )
{
	if !IsObject(Haystack_Matrix)
		Haystack_Matrix := Sift_Ngram_Matrix(Haystack, n)
	Needle_Ngram := Sift_Ngram_Get(Needle, n)
	if IsObject(Haystack)
	{
		Search_Results := {}
		for key, Hay_Ngram in Haystack_Matrix
		{
			Result := Sift_Ngram_Compare(Hay_Ngram, Needle_Ngram)
			if !(Result < Delta)
				Search_Results[key,"Delta"] := Result, Search_Results[key,"Data"] := Haystack[key]
		}
	}
	else
	{
		Search_Results := {}
		Loop, Parse, Haystack, `n, `r
		{
			Result := Sift_Ngram_Compare(Haystack_Matrix[A_Index], Needle_Ngram)
			if !(Result < Delta)
				Search_Results[A_Index,"Delta"] := Result, Search_Results[A_Index,"Data"] := A_LoopField
		}
	}
	if (Format ~= "i)^S")
		Sift_SortResults(Search_Results)
	if RegExMatch(Format, "i)^(S|U)(.+)$", Match)
	{
		for key, element in Search_Results
			String_Results .= element.data Match2
		return SubStr(String_Results,1,-StrLen(Match2))
	}
	else
		return Search_Results
}

Sift_Ngram_Get(ByRef String, n := 3)
{
	Pos := 1, Grams := {}
	Loop, % (1 + StrLen(String) - n)
		gram := SubStr(String, A_Index, n), Grams[gram] ? Grams[gram] ++ : Grams[gram] := 1
	return Grams
} 

Sift_Ngram_Compare(ByRef Hay, ByRef Needle)
{
	for gram, Needle_Count in Needle
	{
		Needle_Total += Needle_Count
		Match += (Hay[gram] > Needle_Count ? Needle_Count : Hay[gram])
	}
	return Match / Needle_Total
}

Sift_Ngram_Matrix(ByRef Data, n := 3)
{
	if IsObject(Data)
	{
		Matrix := {}
		for key, string in Data
			Matrix.Insert(Sift_Ngram_Get(string, n))
	}
	else
	{
		Matrix := {}
		Loop, Parse, Data, `n
			Matrix.Insert(Sift_Ngram_Get(A_LoopField, n))
	}
	return Matrix
}

Sift_SortResults(ByRef Data)
{
	Data_Temp := {}
	for key, element in Data
		Data_Temp[element.Delta SubStr("0000000000" key, -9)] := element
	Data := {}
	for key, element in Data_Temp
		Data.InsertAt(1,element)
	return
}
The Sift_Regex is not too hard to understand. It is just a wrapper with predefined Needles for doing RegExMatch searches. Hopefully the example at the end with demonstrate how its options work and affect the results.

Sift_Ngram is a little more complicated because it is a more fuzzy type search that breaks Haystacks and Needles up into little chunks and then compares how many little chunks they have in common.

Here is a simple Sift_Ngram example code:

Code: Select all

#Include Sift.ahk

Data =
(
Big Apple trees are great.
Pear trees are not great.
Tree beetles kill trees.
A song Bird is in the tree.
An Apple is in the tree.
)

MsgBox % Sift_Ngram(Data, "The Apple tree")

MsgBox % Sift_Ngram(Data, "The Aple tree")

MsgBox % Sift_Ngram(Data, "The Apple tree",.3,,,"S`n=========`n")

Esc::ExitApp
Here is a Gui interface that allows you to see more how each type Sift works and the effects of options.

This is not about the Gui it is just to help see how the functions work. Most of the example code is just to get the Gui to collect the user input.

All the meat of the example occurs in the Query subroutine in just a few calls of the functions.

Code: Select all

#Include Sift.ahk

Data =
(
Where can I find the official build, or older releases?
Why do some lines in my script never execute?
Why does my script run fine in XP but not in Vista or Windows 7 or 8?
I can't edit my script via tray icon because it won't start due to an error. Can I find my script somewhere else?
How can I find and fix errors in my code?
Can I run AHK from a USB drive?
Why is the Run command unable to launch my game or program?
How can the output of a command line operation be retrieved?
How can a script close, pause, or suspend other script(s)?
How can a repeating action be stopped without exiting the script?
How can performance be improved for games or at other times when the CPU is under heavy load?
How can context sensitive help for AutoHotkey commands be used in any editor?
How to detect when a web page is finished loading?
How can dates and times be compared or manipulated?
How can I send the current Date and/or Time?
How can I send text to a window which isn't active or isn't visible?
Why don't Hotstrings, Send, and MouseClick work in certain games?
How can Winamp be controlled even when it isn't active?
How can MsgBox's button names be changed?
How can I change the default editor, which is accessible via context menu or tray icon?
How can I save the contents of my GUI associated variables?
Can I draw something with AHK?
How can I start an action when a window appears, closes or becomes [in]active?
My antivirus program flagged AHK as malware. Does it really contain a virus?
How do I put my hotkeys and hotstrings into effect automatically every time I start my PC?
I'm having trouble getting my mouse buttons working as hotkeys. Any advice?
How can tab and space be defined as hotkeys?
How can keys or mouse buttons be remapped so that they become different keys?
How do I detect the double press of a key or button?
How can a hotkey or hotstring be made exclusive to certain program(s)? In other words, I want a certain key to act as it normally does except when a specific window is active.
How can a prefix key be made to perform its native function rather than doing nothing?
How can the built-in Windows shortcut keys, such as Win+U (Utility Manager) and Win+R (Run), be changed or disabled?
Can I use wildcards or regular expressions in Hotstrings?
How can I use a hotkey that is not in my keyboard layout?
My keypad has a special 000 key. Is it possible to turn it into a hotkey?
)

Display := Data
Options := "in"
NgramSize := 3
NgramLimit := .50
DisplayLimit := 0

Gui, Font, s10 bold
Gui, Add, Text, x78 y11 w120 h20, Query?
Gui, Font, s10 bold underline
Gui, Add, Text, x8 y50 w120 h20, Search Options
Gui, Font, norm
Gui, Add, Edit, x130 y10 w600 h20 vQueryText gQuery
Gui, Add, Radio, x5 yp+60 w120 h15 +Center vRadio gRadio Checked, IN
Gui, Add, Radio, x5 w120 h20 +Center gRadio, LEFT
Gui, Add, Radio, x5 w120 h20 +Center gRadio, RIGHT
Gui, Add, Radio, x5 w120 h20 +Center gRadio, EXACT
Gui, Add, Radio, x5 w120 h20 +Center gRadio, REGEX
Gui, Add, Radio, x5 w120 h40 +Center gRadio, ORDERED`nCHARACTERS
Gui, Add, Radio, x5 w120 h40 +Center gRadio, UNORDERED`nCHARACTERS
Gui, Add, Radio, x5 w120 h40 +Center gRadio, ORDERED`nWORDS
Gui, Add, Radio, x5 w120 h40 +Center gRadio, UNORDERED`nWORDS
Gui, Add, Radio, x5 w120 h40 +Center gRadio, Ngram
Gui, Add, ComboBox, yp40 w40 vNgramSize gNgramSize Choose2, 2|3|4|5
Gui, Add, Text, x48 yp+3, Size
Gui, Add, ComboBox, x5 w40 vNgramLimit gNgramLimit Choose3, 1.00|.70|.50|.30|.10|0
Gui, Add, Text, x48 yp+3, Result Limit
Gui, Add, ComboBox, x5 w40 vDisplayLimit gDisplayLimit Choose1, 0|1|2|3|4|5
Gui, Add, Text, x48 yp+3, Result #
Gui, Add, Checkbox, x5 w120 h40 +Center vNgramResult gNgramResult, SHOW NGRAM RESULT
Gui, Add, Text, x5 w120 h2 0x7 ; Line
Gui, Add, Checkbox, x5 w120 h40 +Center vCase gCheckboxCase, CASE SENSITIVE
Gui, Add, Text, x5 yp-190 w120 h2 0x7 ; Line
Gui, Add, Edit, w600 h570 x130 y40 vGui_Display ReadOnly +0x300000 -wrap, %Display%
Gui, Show, w740 h620

Esc::ExitApp

Query:
	Gui, Submit, NoHide
	if (Options = "NGRAM")
	{
		if (StrLen(QueryText)<NgramSize)
			Display := Sift_Regex(Data,QueryText, "in")
		else
		{
			Display := ""
			if NgramResult
			{
				for key, element in Sift_Ngram(Data, QueryText, NgramLimit, Data_Ngram_Matrix, NgramSize, "S")
					Display .= element.delta "`t" element.data "`n"
				Display := SubStr(Display,1,-1)
			}
			else
				Display := Sift_Ngram(Data, QueryText, NgramLimit, Data_Ngram_Matrix, NgramSize)
			If DisplayLimit 
				Display := SubStr(Display, 1, InStr(Display,"`n",,, DisplayLimit))
		}
	}
	else
		Display := Sift_Regex(Data, QueryText, Options)
	GuiControl,, Gui_Display, %Display%
return

CheckboxCase:
	Gui, Submit, NoHide
	if (Options = "NGRAM")
	{
		Case := 0
		GuiControl,, Case, 0
	}
	if Case
		StringUpper, Options, Options
	else
		StringLower, Options, Options
	gosub Query
return

Radio:
	Gui, Submit, NoHide
	if (Radio = 1)
		Options := "IN"
	else if (Radio = 2)
		Options := "LEFT"
	else if (Radio = 3)
		Options := "RIGHT"
	else if (Radio = 4)
		Options := "EXACT"
	else if (Radio = 5)
		Options := "REGEX"
	else if (Radio = 6)
		Options := "OC"
	else if (Radio = 7)
		Options := "UC"
	else if (Radio = 8)
		Options := "OW"
	else if (Radio = 9)
		Options := "UW"
	else if (Radio = 10)
	{
		if Case
		{
			Case := 0
			GuiControl,, Case, 0
		}
		Options := "NGRAM"
	}
	gosub CheckboxCase
return

NgramSize:
	Gui, Submit, NoHide
	Data_Ngram_Matrix := ""
	gosub Query
return

NgramLimit:
	Gui, Submit, NoHide
	gosub Query
return

DisplayLimit:
	Gui, Submit, NoHide
	gosub Query
return

NgramResult:
	Gui, Submit, NoHide
	gosub Query
return
FG
Last edited by FanaticGuru on 11 May 2015, 13:41, edited 1 time in total.
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

[Function] Timer - Create and Manage Timers
toralf
Posts: 511
Joined: 27 Apr 2014, 21:08
Location: Germany

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

01 May 2015, 14:56

There is actually a SIFT3 code on the other forum to allow to detect similarities between strings. Similarly to the levanstein distance.
ciao
toralf
FanaticGuru
Posts: 1208
Joined: 30 Sep 2013, 22:25

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

01 May 2015, 16:14

There is actually a SIFT3 code on the other forum to allow to detect similarities between strings. Similarly to the levanstein distance.
I looked at Sift3 and studied it pretty carefully but Levenshtein Distance is more about finding how close two words are to each other and misspelling detection.

Levenshtein Distance did not seem suitable for searching a text file and looking for a line about how to make applesauce and get a result of a line that contained ...to make many different kinds of fruit sauces including apple, peach, and apricot... .
A simple search for applesauce will fail, also a line like Applesauce taste terrible will rate lower because although it has the exact word applesauce there is nothing about how to make.

Or another example searching for the line:
Star Wars Returns and finding Star Wars: Episode VI - The Return of the Jedi is a classic blockbuster space movie that is fun for the whole family.
Even a RegEx search for a line that contains all three words Star Wars Returns in any order will fail because the result does not contain Returns. It only contains Return.

From what I have read Ngram searches are a pretty good way of doing these type searches. Depending on what you are searching and how, you need to tinker with the size Ngram and detection threshold to get satisfactory results.

I just created the function and am in the process of tinkering with it to search contact information made up of stuff like a name, company, title, address, phone number, email address, etc. to try to quickly locate the proper contact when the search information entered might be limited or partially wrong.

The Regex function is just a bonus for those that do not have the knowledge or will to create the necessary Needle from scratch. Searching for a line that contains multiple words in any order is a very common RegEx need but is still a fairly complex Needle to design.

I thought about including a Levenshtein sift also but did not know exactly how I would go about doing an Levenshtein of a small string against a large string. I thought about dividing the large string into as many small length string as possible and then comparing the small string to each of them. This seemed very slow though.

A line 100 characters long with a 10 character search string would require like 90 Levenshtein calculations (each of which is fairly intensive) and this is just one line of what might be a 10,000 line text file.

I am going to play around with Levenshtein some more.

FG
Last edited by FanaticGuru on 01 May 2015, 16:44, edited 1 time in total.
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

[Function] Timer - Create and Manage Timers
toralf
Posts: 511
Joined: 27 Apr 2014, 21:08
Location: Germany

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

01 May 2015, 16:33

I agree with your explanation.
My remark was not meant to say that SIFT3 could replace ngram or the other features but maybe could complement the functions. Depending on what is being searched for all approaches have their pros and cons
ciao
toralf
rommmcek
Posts: 375
Joined: 15 Aug 2014, 15:18

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

03 May 2015, 09:52

Very sophisticated code, but a bit slow on large data! Probably because of it's versatility & complexity! However why don't you include:

Code: Select all

SetBachLines, -1
which would render the script more then twice as fast!

bye
FanaticGuru
Posts: 1208
Joined: 30 Sep 2013, 22:25

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

04 May 2015, 16:36

Very sophisticated code, but a bit slow on large data! Probably because of it's versatility & complexity! However why don't you include:

Code: Select all

SetBachLines, -1
which would render the script more then twice as fast!

bye
SetBatchLines, -1 will double the speed of most scripts because it tells the computer to have the script use all available CPU processing time as opposed to the default of only use up to half of the available CPU processing time.

While an important option to keep in mind, this is really a command that is best set by the user depending on the environment they are going to be running the script in because setting it to -1 could have very negative issues with other programs the user has running at the same time.

Sift_Ngram is relatively slow though just due to the amount of comparisons that must be done in an interpretative language type way.

A typical search will take about 1.5 seconds per 1 meg searched the first time. Most of that time is creating the Ngram matrix of the search file. If the Ngram matrix of an unchanged search file is passed back to the function then each subsequent search will take about .4 seconds.

I would not recommend doing a search as you type like I did in the Gui example for large files. 4 seconds to search a 10 meg file when you push enter might be acceptable but not 4 seconds each time you type a character.

The Sift_Regex is much faster. While a RegEx comparison is a pretty complicated process it is using a command in AHK for the bulk of the work which is operating at the speed of C code. A Sift_Regex comparison depending on the options happens at about .01 to .05 seconds per meg.

These speeds are based on my computer with is fairly typically. And these times can be cut in half if you give the script priority to use more than 50% CPU processing time through SetBatchLines.

FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

[Function] Timer - Create and Manage Timers
guest3456
Posts: 2438
Joined: 09 Oct 2013, 10:31

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

04 May 2015, 17:09

SetBatchLines, -1 will double the speed of most scripts because it tells the computer to have the script use all available CPU processing time as opposed to the default of only use up to half of the available CPU processing time.

While an important option to keep in mind, this is really a command that is best set by the user depending on the environment they are going to be running the script in because setting it to -1 could have very negative issues with other programs the user has running at the same time.
I used to worry about this too, but I've used that line in a very CPU intensive script (multiple gdip image searches on each of 20+ different windows per second) without ill effects. And its very important to still have CPU available for the other programs in this setup

FanaticGuru
Posts: 1208
Joined: 30 Sep 2013, 22:25

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

04 May 2015, 19:31

SetBatchLines, -1 will double the speed of most scripts because it tells the computer to have the script use all available CPU processing time as opposed to the default of only use up to half of the available CPU processing time.

While an important option to keep in mind, this is really a command that is best set by the user depending on the environment they are going to be running the script in because setting it to -1 could have very negative issues with other programs the user has running at the same time.
I used to worry about this too, but I've used that line in a very CPU intensive script (multiple gdip image searches on each of 20+ different windows per second) without ill effects. And its very important to still have CPU available for the other programs in this setup
Yea, most of the time it is not an issue to use SetBatchLines, -1. You typically have multiple CPU's that are not running at 100% capacity. And then even if they are, time critical processes will generally set their priority higher than AHK scripts' default. Windows is pretty smart about its CPU utilization. Generally it will keep all the processes running fairly smoothly no matter who is yelling they are the most important. I can definitely see why it might be considered a waste to set aside 50% idle CPU just in case a process that is lower priority than normal might need it.

But there is a reason AHK scripts default to a normal priority and a 50% idle CPU usage. An AHK script is generally a secondary process. If the CPUs are doing all they can do and something needs to take a backseat to more important things going on then AHK is normally the one you want to take a backseat. If the AHK script really is more important then commands like SetBatchlines, Process, and Critical provide a way to let Windows know that.

FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

[Function] Timer - Create and Manage Timers
rommmcek
Posts: 375
Joined: 15 Aug 2014, 15:18

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

08 May 2015, 16:52

Thanks for thorough explanation! Obviously I don't have as profound knowledge as you do, however I intuitively use speed up features only for the time I really need them then I switch back to normal!
I merely compared simple search of Sift_RegEx with IfInString command!
Anyhow, I'm already planning to use your functions!
Do you know if Uberi in his Autocomplte dealing with huge data uses the same method as you are?

Thanks!
FanaticGuru
Posts: 1208
Joined: 30 Sep 2013, 22:25

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

11 May 2015, 13:17

Do you know if Uberi in his Autocomplte dealing with huge data uses the same method as you are?
I only glanced at Uberi's code for AutoComplete but at its core it also uses RegExMatch the same as Sift_Regex but it is a little different because in that code the data being searched is alphabetized which allows for the possibility of some optimization.

AutoComplete is basically using similar code as Sift_Regex with the OC option of Ordered Characters. It is also different and faster to look for one word that matches one word and then return only that matching word than to look for a phrase that is in a line and then return not only the matching phrase but the entire line it was found within.

FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

[Function] Timer - Create and Manage Timers
rommmcek
Posts: 375
Joined: 15 Aug 2014, 15:18

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

13 May 2015, 11:44

Hi,

It's very kind of you to give me very detailed answer! I will certainly try it out and hopefully deepen my knowledge!
I'm using this kind of searches for a long time! At first with batch files, then 4Dos which become TakeCommand and even with Pascal and later Delphi. All were truly fast, but I deliberately gave up some speed for versatility and adaptability of AutoHotkey, besides nowadays we have much faster computers and your functions are pretty much all, one can wish, sifting through data!

Thanks again!
User avatar
Chunjee
Posts: 63
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

16 Oct 2018, 15:12

SetBatchLines, -1 will double the speed of most scripts because it tells the computer to have the script use all available CPU processing time as opposed to the default of only use up to half of the available CPU processing time.

While an important option to keep in mind, this is really a command that is best set by the user depending on the environment they are going to be running the script in because setting it to -1 could have very negative issues with other programs the user has running at the same time.
I don't think this matches the documentation: https://autohotkey.com/docs/commands/SetBatchLines.htm

"it indicates how often the script should sleep (each sleep is 10 ms long). In the following example, the script will sleep for 10ms every time it has run for 20ms: SetBatchLines, 20ms"

There's no interference with other programs and certainly no negative consequences.

FG
FanaticGuru
Posts: 1208
Joined: 30 Sep 2013, 22:25

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

17 Oct 2018, 12:52

There's no interference with other programs and certainly no negative consequences.
That is not true.

Two quotes from the documentation:
  • Determines how fast a script will run (affects CPU utilization).
  • For example, on most systems a setting of 10ms will prevent the script from using any more than 50% of an idle CPU's time. This allows scripts to run quickly while still maintaining a high level of cooperation with CPU sensitive tasks such as games and video capture/playback.
It affects CPU utilization and cooperation with other tasks.

The CPU only has so much processing power. SetBatchLines, -1 tells a script to use more CPU time which means other threads get less processing time. That means other threads (programs) can run slower depending on their priority level. You are basically telling the CPU that this thread is more important than any other thread of equal or less priority. Even just within Autohotkey, if you had two scripts that were in a loop calculating pi at the same time, by default they would run at close to the same speed. You put SetBatchLines, -1 in one script the other script would slow down.

Now all that said, Windows is very smart about CPU utilization. Hundreds of smart people have worked for years to try to make a hundred threads by different developers all run smoothly at the same time. Things that really matter will generally get the CPU time they need. Windows is reluctant to let one program take a huge percentage of the processing time. So, generally you do not notice major negative consequences when you use SetBatchLines, -1. But there are consequences and that is the reason -1 is not the default setting.

Now where you could really get in trouble though is combining SetBatchLines, -1 with Critical and Process. Between those three commands you can tell Windows that your program is the highest priority, use as much processing time as possible, and don't let anything interrupt you. You can get your computer to lag badly if not lock up completely.

So if your AutoHotkey script really is more important than other processes then that is what SetBatchLines, -1 for. For me, in general, my AutoHotkey scripts are less important than other programs running and if something is going to slow down I generally want it to be my scripts. In most cases I am happy with my scripts being good neighbors and using only their fair share of processing time.

If ever developer told Windows that their process was the most important then Windows would have a hard time running smoothly. Luckily ever developer does not use something equivalent to SetBatchLines, -1 in their programs but if just one developer does then the system with adjust and handle it.

I don't think it is an appropriate setting to put in a function for distribution. SetBatchLines, -1 and Process, Priority are settings that should be decided by the overall script author based on what other scripts and programs are going to be running at the same time.

In short most users will not notice the consequences of using SetBatchLines, -1 so use it in ever script if you like, and this is a good one to use it in, but it does have an effect on other programs even if you don't notice it.

FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

[Function] Timer - Create and Manage Timers
Kellyzkorner_NJ
Posts: 12
Joined: 20 Oct 2017, 18:33

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

29 Oct 2018, 12:26

Code: Select all


#Include Sift.ahk
:*:searchword::

InputBox, SearchText, Search for, Search for Keyword
;Loop, Parse, Data, `n, `r

;FileRead, Data, C:\myfile.ahk

;Data = "C:\myfile.ahk"

FileRead, Data, C:\myfile.ahk

MsgBox % Sift_Ngram(Data, %SearchText%)

Return
Hello,

I'm wondering if you could help me a bit with this. I'm trying to search a file(s) (more than one is even better) with your function. My test file would include for example

dupe
fantastic to revisit
revisit
copying to
physician name to copy

Obviously I'm doing something wrong. I get a blank box. I have been using the CodeQuickTester to do this, including the whole function in the tester window. Thanks in advance for any help.

Kelly
FanaticGuru
Posts: 1208
Joined: 30 Sep 2013, 22:25

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

29 Oct 2018, 12:54

Code: Select all


#Include Sift.ahk
:*:searchword::

InputBox, SearchText, Search for, Search for Keyword
;Loop, Parse, Data, `n, `r

;FileRead, Data, C:\myfile.ahk

;Data = "C:\myfile.ahk"

FileRead, Data, C:\myfile.ahk

MsgBox % Sift_Ngram(Data, %SearchText%)

Return
I'm trying to search a file(s) (more than one is even better) with your function. My test file would include for example

dupe
fantastic to revisit
revisit
copying to
physician name to copy

Obviously I'm doing something wrong. I get a blank box. I have been using the CodeQuickTester to do this, including the whole function in the tester window.
Rarely do you need %% around a variable in a function.

This works as expected for me.

Code: Select all

#Include Sift.ahk

:*:searchword::
InputBox, SearchText, Search for, Search for Keyword
FileRead, Data, C:\myfile.ahk
MsgBox % Sift_Ngram(Data, SearchText)
Return
You need to make sure you have the Sift.ahk file where expected. The myfile.ahk needs to be where expected. Also I would not name this file with an 'ahk' extension as that signifies it is an AHK script. I would probably use 'txt' as it is a text file but you can use any extension you want.

FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts

AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon

[Function] Timer - Create and Manage Timers
Kellyzkorner_NJ
Posts: 12
Joined: 20 Oct 2017, 18:33

Re: [Library] Sift - Fuzzy Search by RegEx and Ngram

29 Oct 2018, 13:22

Thank you so much FanaticGuru, it worked for me now too. I knew I was pretty close, I appreciate your help.

Return to “Scripts and Functions”

Who is online

Users browsing this forum: Bing [Bot] and 66 guests