
Regular Expressions (RegEx) for AutoHotkey
UPDATE (older): I've added a new poll to help choose the name of the RegEx functions. I think the advantages of InStrRE() or InStrReg() is that they emphasize that the function is like InStr (haystack comes before needle, and the return value is the found position [0 if not found]). However, RegMatch() or RegExMatch() might be more familiar to people used to PHP and other languages (in which case, perhaps the needle parameter should come before haystack to match PHP). Thanks for voting.
I'd like to get advice from you all on the feasibility including RegEx in AutoHotkey simply by distributing a RegEx DLL with the installer. To support the new DLL, some new built-in functions would be created in AutoHotkey (i.e. wrapper functions).
First let me ask whether the plan described above basically sound:
- Any major licensing issues?
- Any technical hurdles of adding code to AutoHotkey that directly accesses such a DLL (such as Unicode vs. ANSI and memory/string management issues)?
- Any other issues you can think of?
For the flavor of RegEx, it's my understanding that the only serious contender is Perl-compatible regular expressions (PCRE). Please let me know if there's anything else that should be considered, and also which variant of PCRE you would recommend (possible considerations are DLL size, fastest performance, the most friendly license, etc.)
I think the main advantage of distributing a separate DLL for RegEx rather than building it into the program is savings in code size and memory utilization (since the RegEx code would be loaded into memory only when the script actually uses it for the first time). The main drawback of a DLL is that compiled scripts would need a copy of the DLL present if they use any RegEx functions (the script should probably display an error upon launch if the DLL isn't found in the current directory/PATH somewhere).
Finally (assuming the above seems feasible), I'd appreciate some input on the quantity and naming of the wrapper functions (and their parameters in cases where it isn't obvious). The names should be logical but should also consider the naming used in popular languages such as Perl, Python, and PHP. In other words, even if a name isn't optimal, the fact that it is very common in other languages might be enough to make it the winning choice.
Thanks in advance for your advice.
Related:
- Regular expressions: a wrapper around the PCRE DLL (by PhiLho)
- Regular expressions (RegEx): library and wrapper (by Thomas Lauer)
- Regular expressions (Wikipedia article/explanation)
- Regular expression tutorial (by PhiLho)
Edit: Added link for Regular expressions: a wrapper around the PCRE DLL.

I voted option 2.
Sorry, I can't help you on most of the questions you have due to lack of knowledge on my side.
The only thing that came to my mind is that the place/path of the DLL should be able (optional) to be specified. Specially for compiled scripts this is important.
1) AHK might not be installed
2) The exe might be place in a path the user has no write permission (installed by admin)
So the only way would be to place the DLL in some dir that the user has access to and add that path to the PATH.
I would prefer to FileInstall the DLL to a place user has access to and specify the place when I call the DLL in the script. So that the script doesn't have to change the path.

toralf
I use the latest AHK version (1.1.15+)
Please ask questions in forum on ahkscript.org. Why?
For online reference please use these Docs.
Can't the compiler be made to detect this and automatically FileInstall/merge the DLL?The main drawback of a DLL is that compiled scripts would need a copy of the DLL present if they use any RegEx functions
Since AutoHotkey is not an object orientated language using the PHP pcre syntax will be the easiest:[*:3bmdj339]Instead of 'preg_' prefix use 'regex_' or no prefix at allFinally (assuming the above seems feasible), I'd appreciate some input on the quantity and naming of the wrapper functions (and their parameters in cases where it isn't obvious). The names should be logical but should also consider the naming used in popular languages such as Perl, Python, and PHP. In other words, even if a name isn't optimal, the fact that it is very common in other languages might be enough to make it the winning choice.
[*:3bmdj339]Similarity with StringGetPos, StringReplace and StringSplit:[*:3bmdj339]regex_match(exp, string) - returns the index of the first match (blank if none)
[*:3bmdj339]regex_replace(exp, new, string) - returns replacement string of exp with new in string
[*:3bmdj339]regex_split(exp, string) - splits string into array by exp in global mode[*:3bmdj339]Keep pcre regex syntax (/pattern/modifiers), e.g. /gr[ea]y/ig = global case insensitive match of grey or gray[/list]

As to the syntactic flavour, I agree with Chris. Perl compatible REs are the way to go: they are extremely well-documented and there are dozens, if not hundreds of tutorials and examples out there. They are powerful and not too resource hungry.
There are a few free implementations of this standard; the one that I have used to good effect and the one that is probably more thoroughly tested than anything else is Phil Hazel's PCRE (used, among others, in Apache, EXIM and PHP). Current version is 6.7, I think; however, I am not 100% convinced that this is actually the best version for AHK's purposes. But I could have too look into this matter, once the dust has settled.
I don't know enough about the AHK internals to say anything about the actual implementation. If there is to be a DLL anyway, I would do as much of the code as possible in C or C++, because this is more efficient than AHK, and then hardwire the Regex..() calls to the DLL.
However, I am not sure whether integrating an optimised version of the PCRE code into AHK itself wouldn't be a better strategy. In a best case scenario this would add about 22 to 25 KB (upx'ed), but it would mean a good few cans of worms firmly closed.
There need only be one function for replacing stuff:
RegexReplace(str,search,repl,options). I would not try to mimic the Perl or PHP way (s/// etc.). This syntax is not particularly obvious and it's not easy to understand for newbies.
As to matching, this is more difficult. I would probably opt for simplicity and do a "one in all" set. (I.e. I would not actually have functions to compile and reuse REs, as this can be a pretty complicated business and I am not sure that AHK users are interested in this level of complexity.)
One function to get the number of matches, another to collect the actual matches should be sufficient (I have done no AHK for a while, so I am not sure about whether a function can be made to return an array of strings.) If, at some later point, this is not deemed efficient enough, there is always the possibility to add further functions during the development cycle.
The smaller the number of functions, the better.

Don't reinvent the wheel. Poeple who learn regex with ahk will be easier able to use it in other languages(using pcre) and these who come from other languages(using pcre) will easiely learn how to use the regex facilities in ahk.

I'd like to avoid such an option if possible. If an admin installed the compiled script, he would normally have included the DLL if it's required by the script (otherwise, the script wouldn't launch). Also, see below....the place/path of the DLL should be able (optional) to be specified. Specially for compiled scripts this is important.
1) AHK might not be installed
2) The exe might be place in a path the user has no write permission (installed by admin)
If the script calls LoadLibrary on the explicit path of the DLL, I think the program could detect this and avoid searching for the DLL. In addition, it might be desirable to create a separate version of AutoHotkeySC.bin that has built-in RegEx so that the DLL wouldn't be needed. Thanks for the ideas.I would prefer to FileInstall the DLL to a place user has access to and specify the place when I call the DLL in the script. So that the script doesn't have to change the path.
Possibly, but it would complicate the compiler and require that it extract the DLL to some specific path such as %A_Temp%. I'm not sure it would be worth it. In fact, if there's actually a high demand for this, it might be better to avoid the DLL and have RegEx built-in as thomasl suggested.Can't the compiler be made to detect this and automatically FileInstall/merge the DLL?
Thanks.Since AutoHotkey is not an object orientated language using the PHP pcre syntax will be the easiest:[*:2hd2s90j]Instead of 'preg_' prefix use 'regex_' or no prefix at all
[*:2hd2s90j]Similarity with StringGetPos, StringReplace and StringSplit:[*:2hd2s90j]regex_match(exp, string) - returns the index of the first match (blank if none)
[*:2hd2s90j]regex_replace(exp, new, string) - returns replacement string of exp with new in string
[*:2hd2s90j]regex_split(exp, string) - splits string into array by exp in global mode[*:2hd2s90j]Keep pcre regex syntax (/pattern/modifiers), e.g. /gr[ea]y/ig = global case insensitive match of grey or gray[/list]
If you happen to find some other PCRE variant that's more suitable -- due to licensing, code size, performance, etc. -- please let me know.There are a few free implementations of this [PCRE] standard; the one that I have used to good effect and the one that is probably more thoroughly tested than anything else is Phil Hazel's PCRE (used, among others, in Apache, EXIM and PHP). Current version is 6.7, I think; however, I am not 100% convinced that this is actually the best version for AHK's purposes. But I could have too look into this matter, once the dust has settled.
Yes, it was my intent that the only thing interpreted about RegEx is the script's actual call to the function. Everything after that would be C code.I don't know enough about the AHK internals to say anything about the actual implementation. If there is to be a DLL anyway, I would do as much of the code as possible in C or C++, because this is more efficient than AHK, and then hardwire the Regex..() calls to the DLL.
That is a good point. If nothing else, a separate version of AutoHotkeySC.bin could be created that contains RegEx (for those who want it). In addition, AutoHotkey.exe itself could include the code directly to make it more portable.However, I am not sure whether integrating an optimised version of the PCRE code into AHK itself wouldn't be a better strategy. In a best case scenario this would add about 22 to 25 KB (upx'ed), but it would mean a good few cans of worms firmly closed.
However, I wonder if there are any licensing issues. The GPL is notorious for its viral nature, and its my understanding that a non-GPL project that wishes to incoporates GPL code must make their license GPL-compatible. But what about the converse: Can a GPL project like AutoHotkey incorporate code from a less restrictive license that isn't GPL compatible? If not, the DLL approach might be preferable if not mandatory.
Thanks for the advice because this is an area of RegEx I'm not familiar with.There need only be one function for replacing stuff:
RegexReplace(str,search,repl,options). I would not try to mimic the Perl or PHP way (s/// etc.). This syntax is not particularly obvious and it's not easy to understand for newbies.
Good point. Although I'd be interested in some kind of automatic caching or auto-compiling (if worthwhile), there might be no such code available. If not, coding it from scratch would seem to be a low priority.As to matching, this is more difficult. I would probably opt for simplicity and do a "one in all" set. (I.e. I would not actually have functions to compile and reuse REs, as this can be a pretty complicated business and I am not sure that AHK users are interested in this level of complexity.)
It could create an array of strings, but not return an array per se (since in AutoHotkey, an array isn't a single object but instead a collection of variables). So I think you're right that this feature could be deferred until better arrays are added to AHK.One function to get the number of matches, another to collect the actual matches should be sufficient (I have done no AHK for a while, so I am not sure about whether a function can be made to return an array of strings.) If, at some later point, this is not deemed efficient enough, there is always the possibility to add further functions during the development cycle.
I used to think that way, but lately I've been leaning more toward having more functions if it improves clarity and usability. For example, LV_Add() and LV_Insert() are separate functions but share almost exactly the same code internally (which helps cut down on code size).The smaller the number of functions, the better.
Thanks to everyone for all the advice. More comments are definitely welcome.

For PCREAHK.DLL I used 5.0 of PCRE and I used that version for a reason: it was smaller and easier to compile. The newer versions have some bugs fixed, but the code size has grown as well. It's possible to cut down the code but that would involve some work.If you happen to find some other PCRE variant that's more suitable -- due to licensing, code size, performance, etc. -- please let me know.
PCRE is under the BSD license; I can't see any problems as long as you include the copyright notices. Then again, I am not a lawyer and these guys see things we mortals simply do not see.However, I wonder if there are any licensing issues. The GPL is notorious for its viral nature, and its my understanding that a non-GPL project that wishes to incoporates GPL code must make their license GPL-compatible. But what about the converse: Can a GPL project like AutoHotkey incorporate code from a less restrictive license that isn't GPL compatible? If not, the DLL approach might be preferable if not mandatory.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the University of Cambridge nor the name of Google Inc. nor the names of their contributors may be used to endorse or promote products derived from this software without specific prior written permission.
You can have a fully automatic mode; that's relatively easy to do. The problem with non-automatic modes is that they shift the burden to the user of the PCRE library (ie the caller). This gives flexibilty at the price of added complexity. Given that many people struggle with REs I would concentrate on making the interface as simple as possible, at least for the first release. Further releases can add complexity, as long as it's understood that using this is completely optional.Although I'd be interested in some kind of automatic caching or auto-compiling (if worthwhile), there might be no such code available. If not, coding it from scratch would seem to be a low priority.

Of course, I voted for the "frequent" use.
Thomas deserves to be the first mentioned, but I would like to have my own "pure" AHK wrapper to be mentioned in the first article, thank you. Of course, I appreciate that out of hundred of good tutorial on the Net, you chose mine. :-)
A point of interest: PCRE (the only one, by Phil Hazel) has shrunk in size between 6.4 and 6.5, because of optimization of a big Unicode table. And actually we can compile this library without any Unicode (UTF-8 ) support, since AHK doesn't support it anyway. Thus getting a smaller size and a small speed increase.
PCRE is a good choice, made by lot of languages and softwares: robust, complete, well tested and maintained.
I have read the source code of PCRE recently, and it is highly optimized for speed.
I believe there is no problem to link a library under a modified BSD licence (or any other GPL-compatible licence) to a GPL program. For example, the Hypermail program mentioned in the PCRE home page is GPL.
Frankly, I am not overly convinced this must be kept outside AHK. I understand the argument, but adding 20 or even 40 more KB shouldn't be an overkill. And solving the issues of compiled script will not be obvious. Not even how to manage the fact that some official commands might be not available (although that's already the case in Win98).
So far AHK is monolithic, and that's a good thing.
Plus I don't see the advantage over using a simple wrapper of the official DLL, like I did (except you don't have to load the wrapper, and the API is official).
Another reason will be explained below.
I disagree with Titan on the choice of PHP syntax. First because these names don't fit in the current naming choices of AHK. Second, why PHP? JS could be another choice. Or Java, etc. Last, the functions he mention lacks a important parameter, that PCRE supports: offset in the string. This bite us in the JS coloring of comments (but I found a workaround, should the need be: .{48} for a 48 char offset) and this can be a miss.
I started to rewrite my PCRE wrapper more in the AHK spirit, avoiding to separate compilation phase, but I can try to advance it to show what I would like for syntax, on the lines of my signature:
result := RegExMatch(bigString, regex[, offset, options])
result := RegExReplace(bigString, regex, replaceExpr[, offset, options])
result := RegExSplit(bigString, regex[, options])
The prefix can be shortened or omitted. It probably needs helper functions to access the results (at least for Match).
Did that in hurry, needs more thinking for a friendly interface.
Options would be a string, with the classicals g i m x (more?), easier to understand and use than big numeric constants to OR.
Perhaps we could add a function (or constant to overwrite) to set the options for the next calls.
Perhaps we can integrate offset to options, and either add a parameter to RegExReplace to indicate the number of changes ("A" for all) or integrate this to options too.
Actually, I propose to drop these functions, or don't rely only on them.
I saw from time to time requests to support REs in IfWinActive and related commands. Better than SetTitleMatchMode...
I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!
In other words, everywhere a match can be done against a string, it should be possible to do it against a regular expression...
Perhaps the easiest way to do this is to create a special operator, like %(A_Space) (ie. like MsgBox % GetData(x)).
For consistency with other languages, it could be / or // (because we often write /[a-b]+/ and such). Of course, a better choice can be made, for readability, intelligibility and compatibility.
We might need an escape, for rare cases where a title starts with this symbol.
I am aware this is a bigger change than just adding some functions, but yet I believe that what users (at least those RE-savvy) will expect.
And if treated consistently, it shouldn't be so hard (easy to write, I know...).
Whatever way you choose, I will support you, it will be always an improvement anyway.


If it works in AHK I think I would use them!
Thalon

I checked the links, and the PCRE stuff seems to support the things I need (greedy/non-greedy, backreferences, etc.) :-)
I'd prefer not to have to deal with a separate DLL for distribution purposes (that's a vote for including it by default in AHK and a compiled script if needed).
So I'm with PhiLho: I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now! I also like his syntax examples. The closer a replace operation is to
s/
... the easier it will be to use.
Thanks,
John B.

Although this would be helpful to advanced users, newbies will find it harder to understand.I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!
PHP is similar to AutoHotkey. You can't copy javascripts var myregex = new RegExp(...) or /regex/m/.exec(str) syntax but you can create PHP's regexp functions. The offset parameter would be great - my proposals were just basic examples so I didn't think to include this, the order of the params can be changed around as well.I disagree with Titan on the choice of PHP syntax. First because these names don't fit in the current naming choices of AHK. Second, why PHP? JS could be another choice. Or Java, etc. Last, the functions he mention lacks a important parameter, that PCRE supports: offset in the string.

Why? If not using the special operator, the commands remain unchanged.Although this would be helpful to advanced users, newbies will find it harder to understand.I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!
StringSplit a, line, :;
vs.
StringSplit a, line, // [:;]
StringReplace newHTML, HTMLString,
, ``, All
StringReplace newHTML, newHTML,
, ``, All
StringSplit a, newHTML, ``
vs.
StringSplit a, HTMLString, // (
|
)
Both syntaxes and ways can pacifically coexist.
The only drawback is that the doc. will grow again. ;-)
Indeed. I choose to put the "bigString" first, to be consistent with InStr. But in this case, offset need to go after options.The offset parameter would be great - my proposals were just basic examples so I didn't think to include this, the order of the params can be changed around as well.
result := InStr(Haystack, Needle [, CaseSensitive?, StartingPos])
result := RegExMatch(bigString, regex[, options, offset])


This means that certain parameters of string commands have to detect // prior to execution and parse the quantifiers through the regexp engine. This extra overhead will affect performance and add unneeded complexity. Most commands already accept expressions which confuses newbies as it can be difficult to tell at a glance whether this, a literal or a variable is being used. Adding regexps to the mix would only cause more confusion. Dedicated regexp functions like RegExMatch()/regex_replace()/etc. will be much easier to understand due to consistency with other languages like javascript/PHP which have their own regexp methods and functions.If not using the special operator, the commands remain unchanged.

Basically, I don't feel an urge of RE support in string commands, as functions work well. But I feel that users will wonder why everywhere there is a WinTitle or WinText or exclude variants, REs are not allowed.
Not that I will use this feature very often, but I guess such questions are pending...
Now, if Chris replies once and for all "that's too hard to support" (or too confusing, etc.), OK, no problem.
Say it is just a suggestion. :-)


This is the way to go. This is the true power. About the n00bs and performance, I think you worry too much. 8)I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!
