Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Regular Expressions (RegEx) for AutoHotkey


  • Please log in to reply
112 replies to this topic

Poll: What should the names of the RegEx functions be (if you HAD to pick one of these)? (42 member(s) have cast votes)

What should the names of the RegEx functions be (if you HAD to pick one of these)?

  1. RegExMatch() and RegExReplace() (43 votes [84.31%])

    Percentage of vote: 84.31%

  2. RegMatch() and RegReplace() (8 votes [15.69%])

    Percentage of vote: 15.69%

Vote Guests cannot vote
Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004
UPDATE: Another poll to help narrow down the name.

UPDATE (older): I've added a new poll to help choose the name of the RegEx functions. I think the advantages of InStrRE() or InStrReg() is that they emphasize that the function is like InStr (haystack comes before needle, and the return value is the found position [0 if not found]). However, RegMatch() or RegExMatch() might be more familiar to people used to PHP and other languages (in which case, perhaps the needle parameter should come before haystack to match PHP). Thanks for voting.


I'd like to get advice from you all on the feasibility including RegEx in AutoHotkey simply by distributing a RegEx DLL with the installer. To support the new DLL, some new built-in functions would be created in AutoHotkey (i.e. wrapper functions).

First let me ask whether the plan described above basically sound:
- Any major licensing issues?
- Any technical hurdles of adding code to AutoHotkey that directly accesses such a DLL (such as Unicode vs. ANSI and memory/string management issues)?
- Any other issues you can think of?

For the flavor of RegEx, it's my understanding that the only serious contender is Perl-compatible regular expressions (PCRE). Please let me know if there's anything else that should be considered, and also which variant of PCRE you would recommend (possible considerations are DLL size, fastest performance, the most friendly license, etc.)

I think the main advantage of distributing a separate DLL for RegEx rather than building it into the program is savings in code size and memory utilization (since the RegEx code would be loaded into memory only when the script actually uses it for the first time). The main drawback of a DLL is that compiled scripts would need a copy of the DLL present if they use any RegEx functions (the script should probably display an error upon launch if the DLL isn't found in the current directory/PATH somewhere).

Finally (assuming the above seems feasible), I'd appreciate some input on the quantity and naming of the wrapper functions (and their parameters in cases where it isn't obvious). The names should be logical but should also consider the naming used in popular languages such as Perl, Python, and PHP. In other words, even if a name isn't optimal, the fact that it is very common in other languages might be enough to make it the winning choice.

Thanks in advance for your advice.

Related:
- Regular expressions: a wrapper around the PCRE DLL (by PhiLho)
- Regular expressions (RegEx): library and wrapper (by Thomas Lauer)
- Regular expressions (Wikipedia article/explanation)
- Regular expression tutorial (by PhiLho)

Edit: Added link for Regular expressions: a wrapper around the PCRE DLL.

toralf
  • Moderators
  • 4035 posts
  • Last active: Aug 20 2014 04:23 PM
  • Joined: 31 Jan 2005
I haven't used it yet, but it guess there will be a time when it will be handy, e.g. I could have used it for my EPG script to parse the html files.
I voted option 2.

Sorry, I can't help you on most of the questions you have due to lack of knowledge on my side.

The only thing that came to my mind is that the place/path of the DLL should be able (optional) to be specified. Specially for compiled scripts this is important.
1) AHK might not be installed
2) The exe might be place in a path the user has no write permission (installed by admin)
So the only way would be to place the DLL in some dir that the user has access to and add that path to the PATH.

I would prefer to FileInstall the DLL to a place user has access to and specify the place when I call the DLL in the script. So that the script doesn't have to change the path.
Ciao
toralf
 
I use the latest AHK version (1.1.15+)
Please ask questions in forum on ahkscript.org. Why?
For online reference please use these Docs.

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012

The main drawback of a DLL is that compiled scripts would need a copy of the DLL present if they use any RegEx functions

Can't the compiler be made to detect this and automatically FileInstall/merge the DLL?

Finally (assuming the above seems feasible), I'd appreciate some input on the quantity and naming of the wrapper functions (and their parameters in cases where it isn't obvious). The names should be logical but should also consider the naming used in popular languages such as Perl, Python, and PHP. In other words, even if a name isn't optimal, the fact that it is very common in other languages might be enough to make it the winning choice.

Since AutoHotkey is not an object orientated language using the PHP pcre syntax will be the easiest:[*:3bmdj339]Instead of 'preg_' prefix use 'regex_' or no prefix at all
[*:3bmdj339]Similarity with StringGetPos, StringReplace and StringSplit:[*:3bmdj339]regex_match(exp, string) - returns the index of the first match (blank if none)
[*:3bmdj339]regex_replace(exp, new, string) - returns replacement string of exp with new in string
[*:3bmdj339]regex_split(exp, string) - splits string into array by exp in global mode[*:3bmdj339]Keep pcre regex syntax (/pattern/modifiers), e.g. /gr[ea]y/ig = global case insensitive match of grey or gray[/list]

thomasl
  • Members
  • 92 posts
  • Last active: Sep 28 2006 09:55 AM
  • Joined: 16 Jun 2005
The idea is feasible and sound. In fact, I think a utility that can process text and does not support regexes (or can be extended to do so) is not a serious tool. I know that people shrink back from REs but I have yet to meet someone who, after the inevitable initial shock, didn't appreciate their power and relative simplicity. (No, I am not kidding.)

As to the syntactic flavour, I agree with Chris. Perl compatible REs are the way to go: they are extremely well-documented and there are dozens, if not hundreds of tutorials and examples out there. They are powerful and not too resource hungry.

There are a few free implementations of this standard; the one that I have used to good effect and the one that is probably more thoroughly tested than anything else is Phil Hazel's PCRE (used, among others, in Apache, EXIM and PHP). Current version is 6.7, I think; however, I am not 100% convinced that this is actually the best version for AHK's purposes. But I could have too look into this matter, once the dust has settled.

I don't know enough about the AHK internals to say anything about the actual implementation. If there is to be a DLL anyway, I would do as much of the code as possible in C or C++, because this is more efficient than AHK, and then hardwire the Regex..() calls to the DLL.

However, I am not sure whether integrating an optimised version of the PCRE code into AHK itself wouldn't be a better strategy. In a best case scenario this would add about 22 to 25 KB (upx'ed), but it would mean a good few cans of worms firmly closed.

There need only be one function for replacing stuff:
RegexReplace(str,search,repl,options). I would not try to mimic the Perl or PHP way (s/// etc.). This syntax is not particularly obvious and it's not easy to understand for newbies.

As to matching, this is more difficult. I would probably opt for simplicity and do a "one in all" set. (I.e. I would not actually have functions to compile and reuse REs, as this can be a pretty complicated business and I am not sure that AHK users are interested in this level of complexity.)

One function to get the number of matches, another to collect the actual matches should be sufficient (I have done no AHK for a while, so I am not sure about whether a function can be made to return an array of strings.) If, at some later point, this is not deemed efficient enough, there is always the possibility to add further functions during the development cycle.

The smaller the number of functions, the better.

foom
  • Members
  • 386 posts
  • Last active: Jul 04 2007 04:53 PM
  • Joined: 19 Apr 2006
I'd second what Titans said.

Don't reinvent the wheel. Poeple who learn regex with ahk will be easier able to use it in other languages(using pcre) and these who come from other languages(using pcre) will easiely learn how to use the regex facilities in ahk.

Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004

...the place/path of the DLL should be able (optional) to be specified. Specially for compiled scripts this is important.
1) AHK might not be installed
2) The exe might be place in a path the user has no write permission (installed by admin)

I'd like to avoid such an option if possible. If an admin installed the compiled script, he would normally have included the DLL if it's required by the script (otherwise, the script wouldn't launch). Also, see below.

I would prefer to FileInstall the DLL to a place user has access to and specify the place when I call the DLL in the script. So that the script doesn't have to change the path.

If the script calls LoadLibrary on the explicit path of the DLL, I think the program could detect this and avoid searching for the DLL. In addition, it might be desirable to create a separate version of AutoHotkeySC.bin that has built-in RegEx so that the DLL wouldn't be needed. Thanks for the ideas.

Can't the compiler be made to detect this and automatically FileInstall/merge the DLL?

Possibly, but it would complicate the compiler and require that it extract the DLL to some specific path such as %A_Temp%. I'm not sure it would be worth it. In fact, if there's actually a high demand for this, it might be better to avoid the DLL and have RegEx built-in as thomasl suggested.

Since AutoHotkey is not an object orientated language using the PHP pcre syntax will be the easiest:[*:2hd2s90j]Instead of 'preg_' prefix use 'regex_' or no prefix at all
[*:2hd2s90j]Similarity with StringGetPos, StringReplace and StringSplit:[*:2hd2s90j]regex_match(exp, string) - returns the index of the first match (blank if none)
[*:2hd2s90j]regex_replace(exp, new, string) - returns replacement string of exp with new in string
[*:2hd2s90j]regex_split(exp, string) - splits string into array by exp in global mode[*:2hd2s90j]Keep pcre regex syntax (/pattern/modifiers), e.g. /gr[ea]y/ig = global case insensitive match of grey or gray[/list]

Thanks.

There are a few free implementations of this [PCRE] standard; the one that I have used to good effect and the one that is probably more thoroughly tested than anything else is Phil Hazel's PCRE (used, among others, in Apache, EXIM and PHP). Current version is 6.7, I think; however, I am not 100% convinced that this is actually the best version for AHK's purposes. But I could have too look into this matter, once the dust has settled.

If you happen to find some other PCRE variant that's more suitable -- due to licensing, code size, performance, etc. -- please let me know.

I don't know enough about the AHK internals to say anything about the actual implementation. If there is to be a DLL anyway, I would do as much of the code as possible in C or C++, because this is more efficient than AHK, and then hardwire the Regex..() calls to the DLL.

Yes, it was my intent that the only thing interpreted about RegEx is the script's actual call to the function. Everything after that would be C code.

However, I am not sure whether integrating an optimised version of the PCRE code into AHK itself wouldn't be a better strategy. In a best case scenario this would add about 22 to 25 KB (upx'ed), but it would mean a good few cans of worms firmly closed.

That is a good point. If nothing else, a separate version of AutoHotkeySC.bin could be created that contains RegEx (for those who want it). In addition, AutoHotkey.exe itself could include the code directly to make it more portable.

However, I wonder if there are any licensing issues. The GPL is notorious for its viral nature, and its my understanding that a non-GPL project that wishes to incoporates GPL code must make their license GPL-compatible. But what about the converse: Can a GPL project like AutoHotkey incorporate code from a less restrictive license that isn't GPL compatible? If not, the DLL approach might be preferable if not mandatory.

There need only be one function for replacing stuff:
RegexReplace(str,search,repl,options). I would not try to mimic the Perl or PHP way (s/// etc.). This syntax is not particularly obvious and it's not easy to understand for newbies.

Thanks for the advice because this is an area of RegEx I'm not familiar with.

As to matching, this is more difficult. I would probably opt for simplicity and do a "one in all" set. (I.e. I would not actually have functions to compile and reuse REs, as this can be a pretty complicated business and I am not sure that AHK users are interested in this level of complexity.)

Good point. Although I'd be interested in some kind of automatic caching or auto-compiling (if worthwhile), there might be no such code available. If not, coding it from scratch would seem to be a low priority.

One function to get the number of matches, another to collect the actual matches should be sufficient (I have done no AHK for a while, so I am not sure about whether a function can be made to return an array of strings.) If, at some later point, this is not deemed efficient enough, there is always the possibility to add further functions during the development cycle.

It could create an array of strings, but not return an array per se (since in AutoHotkey, an array isn't a single object but instead a collection of variables). So I think you're right that this feature could be deferred until better arrays are added to AHK.

The smaller the number of functions, the better.

I used to think that way, but lately I've been leaning more toward having more functions if it improves clarity and usability. For example, LV_Add() and LV_Insert() are separate functions but share almost exactly the same code internally (which helps cut down on code size).

Thanks to everyone for all the advice. More comments are definitely welcome.

thomasl
  • Members
  • 92 posts
  • Last active: Sep 28 2006 09:55 AM
  • Joined: 16 Jun 2005

If you happen to find some other PCRE variant that's more suitable -- due to licensing, code size, performance, etc. -- please let me know.

For PCREAHK.DLL I used 5.0 of PCRE and I used that version for a reason: it was smaller and easier to compile. The newer versions have some bugs fixed, but the code size has grown as well. It's possible to cut down the code but that would involve some work.

However, I wonder if there are any licensing issues. The GPL is notorious for its viral nature, and its my understanding that a non-GPL project that wishes to incoporates GPL code must make their license GPL-compatible. But what about the converse: Can a GPL project like AutoHotkey incorporate code from a less restrictive license that isn't GPL compatible? If not, the DLL approach might be preferable if not mandatory.

PCRE is under the BSD license; I can't see any problems as long as you include the copyright notices. Then again, I am not a lawyer and these guys see things we mortals simply do not see.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

    * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

    * Neither the name of the University of Cambridge nor the name of Google Inc. nor the names of their contributors may be used to endorse or promote products derived from this software without specific prior written permission.

Although I'd be interested in some kind of automatic caching or auto-compiling (if worthwhile), there might be no such code available. If not, coding it from scratch would seem to be a low priority.

You can have a fully automatic mode; that's relatively easy to do. The problem with non-automatic modes is that they shift the burden to the user of the PCRE library (ie the caller). This gives flexibilty at the price of added complexity. Given that many people struggle with REs I would concentrate on making the interface as simple as possible, at least for the first release. Further releases can add complexity, as long as it's understood that using this is completely optional.

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
Hey, that's good news! :-)
Of course, I voted for the "frequent" use.
Thomas deserves to be the first mentioned, but I would like to have my own "pure" AHK wrapper to be mentioned in the first article, thank you. Of course, I appreciate that out of hundred of good tutorial on the Net, you chose mine. :-)

A point of interest: PCRE (the only one, by Phil Hazel) has shrunk in size between 6.4 and 6.5, because of optimization of a big Unicode table. And actually we can compile this library without any Unicode (UTF-8 ) support, since AHK doesn't support it anyway. Thus getting a smaller size and a small speed increase.

PCRE is a good choice, made by lot of languages and softwares: robust, complete, well tested and maintained.
I have read the source code of PCRE recently, and it is highly optimized for speed.
I believe there is no problem to link a library under a modified BSD licence (or any other GPL-compatible licence) to a GPL program. For example, the Hypermail program mentioned in the PCRE home page is GPL.

Frankly, I am not overly convinced this must be kept outside AHK. I understand the argument, but adding 20 or even 40 more KB shouldn't be an overkill. And solving the issues of compiled script will not be obvious. Not even how to manage the fact that some official commands might be not available (although that's already the case in Win98).
So far AHK is monolithic, and that's a good thing.
Plus I don't see the advantage over using a simple wrapper of the official DLL, like I did (except you don't have to load the wrapper, and the API is official).
Another reason will be explained below.

I disagree with Titan on the choice of PHP syntax. First because these names don't fit in the current naming choices of AHK. Second, why PHP? JS could be another choice. Or Java, etc. Last, the functions he mention lacks a important parameter, that PCRE supports: offset in the string. This bite us in the JS coloring of comments (but I found a workaround, should the need be: .{48} for a 48 char offset) and this can be a miss.

I started to rewrite my PCRE wrapper more in the AHK spirit, avoiding to separate compilation phase, but I can try to advance it to show what I would like for syntax, on the lines of my signature:
result := RegExMatch(bigString, regex[, offset, options])
result := RegExReplace(bigString, regex, replaceExpr[, offset, options])
result := RegExSplit(bigString, regex[, options])
The prefix can be shortened or omitted. It probably needs helper functions to access the results (at least for Match).
Did that in hurry, needs more thinking for a friendly interface.

Options would be a string, with the classicals g i m x (more?), easier to understand and use than big numeric constants to OR.
Perhaps we could add a function (or constant to overwrite) to set the options for the next calls.
Perhaps we can integrate offset to options, and either add a parameter to RegExReplace to indicate the number of changes ("A" for all) or integrate this to options too.

Actually, I propose to drop these functions, or don't rely only on them.
I saw from time to time requests to support REs in IfWinActive and related commands. Better than SetTitleMatchMode...

I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!
In other words, everywhere a match can be done against a string, it should be possible to do it against a regular expression...
Perhaps the easiest way to do this is to create a special operator, like %(A_Space) (ie. like MsgBox % GetData(x)).
For consistency with other languages, it could be / or // (because we often write /[a-b]+/ and such). Of course, a better choice can be made, for readability, intelligibility and compatibility.
We might need an escape, for rare cases where a title starts with this symbol.
I am aware this is a bigger change than just adding some functions, but yet I believe that what users (at least those RE-savvy) will expect.
And if treated consistently, it shouldn't be so hard (easy to write, I know...).

Whatever way you choose, I will support you, it will be always an improvement anyway.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

Thalon
  • Members
  • 641 posts
  • Last active: Jan 02 2017 12:17 PM
  • Joined: 12 Jul 2005
I did not use them now (lacking of support), but it would be a nice feature.
If it works in AHK I think I would use them!

Thalon

John B
  • Guests
  • Last active:
  • Joined: --
New user of AHK here. I'm all in favor of full RegExp support. Everytime I see a question about RegExp in the forum, I think, "Gee, that would be easy with SED". Right now, I do a lot from the command line with SED and DOS batch files (modifying hundreds of HTML files in a directory tree in seconds). If it would be easy to do the same in AHK, I could simplify a number of things I do every week ("Frequently"). I was really surprised when I discovered "regular expressions" is not in the AHK help index!

I checked the links, and the PCRE stuff seems to support the things I need (greedy/non-greedy, backreferences, etc.) :-)

I'd prefer not to have to deal with a separate DLL for distribution purposes (that's a vote for including it by default in AHK and a compiled script if needed).

So I'm with PhiLho: I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now! I also like his syntax examples. The closer a replace operation is to
s///
... the easier it will be to use.

Thanks,
John B.

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012

I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!

Although this would be helpful to advanced users, newbies will find it harder to understand.

I disagree with Titan on the choice of PHP syntax. First because these names don't fit in the current naming choices of AHK. Second, why PHP? JS could be another choice. Or Java, etc. Last, the functions he mention lacks a important parameter, that PCRE supports: offset in the string.

PHP is similar to AutoHotkey. You can't copy javascripts var myregex = new RegExp(...) or /regex/m/.exec(str) syntax but you can create PHP's regexp functions. The offset parameter would be great - my proposals were just basic examples so I didn't think to include this, the order of the params can be changed around as well.

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005

I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!

Although this would be helpful to advanced users, newbies will find it harder to understand.

Why? If not using the special operator, the commands remain unchanged.
StringSplit a, line, :;
vs.
StringSplit a, line, // [:;]

StringReplace newHTML, HTMLString,
, ``, All
StringReplace newHTML, newHTML,
, ``, All
StringSplit a, newHTML, ``
vs.
StringSplit a, HTMLString, // (
|
)

Both syntaxes and ways can pacifically coexist.
The only drawback is that the doc. will grow again. ;-)

The offset parameter would be great - my proposals were just basic examples so I didn't think to include this, the order of the params can be changed around as well.

Indeed. I choose to put the "bigString" first, to be consistent with InStr. But in this case, offset need to go after options.

result := InStr(Haystack, Needle [, CaseSensitive?, StartingPos])
result := RegExMatch(bigString, regex[, options, offset])

Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012

If not using the special operator, the commands remain unchanged.

This means that certain parameters of string commands have to detect // prior to execution and parse the quantifiers through the regexp engine. This extra overhead will affect performance and add unneeded complexity. Most commands already accept expressions which confuses newbies as it can be difficult to tell at a glance whether this, a literal or a variable is being used. Adding regexps to the mix would only cause more confusion. Dedicated regexp functions like RegExMatch()/regex_replace()/etc. will be much easier to understand due to consistency with other languages like javascript/PHP which have their own regexp methods and functions.

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
I don't know, but I trust Chris to make the right choice (lot of burden! :-)).
Basically, I don't feel an urge of RE support in string commands, as functions work well. But I feel that users will wonder why everywhere there is a WinTitle or WinText or exclude variants, REs are not allowed.
Not that I will use this feature very often, but I guess such questions are pending...
Now, if Chris replies once and for all "that's too hard to support" (or too confusing, etc.), OK, no problem.
Say it is just a suggestion. :-)
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

majkinetor
  • Moderators
  • 4512 posts
  • Last active: Jul 29 2016 12:40 AM
  • Joined: 24 May 2006

I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!

This is the way to go. This is the true power. About the n00bs and performance, I think you worry too much. 8)
Posted Image