Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Regular Expressions (RegEx) for AutoHotkey


  • Please log in to reply
112 replies to this topic

Poll: What should the names of the RegEx functions be (if you HAD to pick one of these)? (42 member(s) have cast votes)

What should the names of the RegEx functions be (if you HAD to pick one of these)?

  1. RegExMatch() and RegExReplace() (43 votes [84.31%])

    Percentage of vote: 84.31%

  2. RegMatch() and RegReplace() (8 votes [15.69%])

    Percentage of vote: 15.69%

Vote Guests cannot vote
Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004

PCRE is under the BSD license; I can't see any problems [including it in a GPL project] as long as you include the copyright notices. Then again, I am not a lawyer and these guys see things we mortals simply do not see.

Thanks. It might be moot if I decide to go the DLL route. If anyone knows more about the above, please let me know.

I believe there is no problem to link a library under a modified BSD licence (or any other GPL-compatible licence) to a GPL program. For example, the Hypermail program mentioned in the PCRE home page is GPL.

Ah ha! I'd forgotten about the library/linking permission. That's a great option to have in case it's not legal to directly include the source code inside a GPL project.

I would like to have my own "pure" AHK wrapper to be mentioned in the first article, thank you.

I'd forgotten about that! I've added a link up top to Regular expressions: a wrapper around the PCRE DLL. I intend to study it to find out how close it is to a complete solution (maybe all I have to do is convert your approach to C code :)).

A point of interest: PCRE (the only one, by Phil Hazel) has shrunk in size between 6.4 and 6.5, because of optimization of a big Unicode table. And actually we can compile this library without any Unicode (UTF-8 ) support, since AHK doesn't support it anyway. Thus getting a smaller size and a small speed increase.

Great. If you have any more details about compiling the DLL that way -- or maybe even a ready-made, optimized DLL and/or LIB -- please let me know.

...lacks a important parameter, that PCRE supports: offset in the string.

I do try minimize the number of parameters, or at least move seldom used parameters to the end of the parameter list so that they can be optional. Another way is to sneak in new parameters by combining them (like you mention below).

Perhaps we could add a function (or constant to overwrite) to set the options for the next calls.
Perhaps we can integrate offset to options, and either add a parameter to RegExReplace to indicate the number of changes ("A" for all) or integrate this to options too.

Good idea (especially the options merging). You can tell from the GUI commands that I like having a combined options parameter (though it has its drawbacks).

I started to rewrite my PCRE wrapper more in the AHK spirit, avoiding to separate compilation phase, but I can try to advance it to show what I would like for syntax, on the lines of my signature:
result := RegExMatch(bigString, regex[, offset, options])
result := RegExReplace(bigString, regex, replaceExpr[, offset, options])
result := RegExSplit(bigString, regex[, options])
The prefix can be shortened or omitted. It probably needs helper functions to access the results (at least for Match).
Did that in hurry, needs more thinking for a friendly interface.

I really appreciate how detailed that is. I think it will be a great help.

I saw from time to time requests to support REs in IfWinActive and related commands. Better than SetTitleMatchMode...
...
I feel that users will wonder why everywhere there is a WinTitle or WinText or exclude variants, REs are not allowed. Not that I will use this feature very often, but I guess such questions are pending...

I think that would require either a new SetTitleMatchMode or something like IfWinExist ahk_regex. Your // idea is also a possibility.

I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!
In other words, everywhere a match can be done against a string, it should be possible to do it against a regular expression...
Perhaps the easiest way to do this is to create a special operator, like %(A_Space) (ie. like MsgBox % GetData(x)).
For consistency with other languages, it could be / or // (because we often write /[a-b]+/ and such). Of course, a better choice can be made, for readability, intelligibility and compatibility.

This is the way to go. This is the true power. About the n00bs and performance, I think you worry too much. 8)

Ambitious yet attractive. However, it might help to create a more complete/interesting list of where this proposed RegEx extension would be used -- such as WinTitle, WinText, some functions/commands that you mention below. As it stand now, the benefit vs. cost doesn't seem compelling.

If not using the special operator, the commands remain unchanged.
[color=green]StringSplit a, line, // [:;]
Both syntaxes and ways can pacifically coexist.
The only drawback is that the doc. will grow again. ;-)

Now that I think about it, I'm leaning away from retrofitting old commands to support RegEx. For one thing, it would be a lot of coding, at least in some cases. For another, spending time extending commands would be a waste if some of them will someday be better expressed/popular as a function (StringReplace might be one example of this). Finally, users of RegEx tend to have a technical background and thus might generally prefer functions instead of commands.

Most commands already accept expressions which confuses newbies as it can be difficult to tell at a glance whether this, a literal or a variable is being used. Adding regexps to the mix would only cause more confusion. Dedicated regexp functions like RegExMatch()/regex_replace()/etc. will be much easier to understand due to consistency with other languages like javascript/PHP which have their own regexp methods and functions.

For now, I tend to agree with this. But more discussion is welcome.

I'd prefer not to have to deal with a separate DLL for distribution purposes (that's a vote for including it by default in AHK and a compiled script if needed).

For what you and everyone have said, I think there should at least be an alternate version of AutoHotkeySC.bin that contains built-in RegEx (if it's not too much work).

The closer a replace operation is to
s///
... the easier it will be to use.

Although not very readable, its brevity is admirable and probably addictive to veterans. However, given AHK's current emphasis on readability, perhaps something longer yet intuitive would be better.

Thank you all for sharing your ideas and experience. Please post more ideas as they occur to you.

thomasl
  • Members
  • 92 posts
  • Last active: Sep 28 2006 09:55 AM
  • Joined: 16 Jun 2005
Some quick remarks.

@Chris: whatever you decide in the way of functions, interfaces etc., remember that some people will use REs not only for short and sweet match/replace games. I regularly do checks on the tags of my MP3 collection that require pretty complex REs and work on a "string" of ~1400 kb (even if I currently do this in Perl, I could just as well use AHK). That's probably a perverse case, but a good RE engine has to be able to handle that sort of thing. Mostly, PCRE (or whatever you'll use) should take care of that. Still, large strings will require an efficient interface between AHK strings and C/C++ strings. There's also the code that handles the actual replacement functionality.

I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!
In other words, everywhere a match can be done against a string, it should be possible to do it against a regular expression...

The idea is good; if implemented in a clear and unobtrusive manner I think it could give many string-related functions a boost.

Still, I would refrain from doing this for the time being. Changing too much in one go means the risk of introducing subtle bugs and losing people along the way. And, as Titan has pointed out, things in the expression/string/variable arena are already confusing enough. KISS :)

Last but not least there's the holy grail of backward compatibility: doing this in such a way that it breaks no existing scripts might well be impossible.

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005

Great. If you have any more details about compiling the DLL that way -- or maybe even a ready-made, optimized DLL and/or LIB -- please let me know.

OK, I will look more into it, I planned to compile a DLL anyway, since the semi-official Gnu ones, made with GCC, are quite large.

(about offset in the string) I do try minimize the number of parameters, or at least move seldom used parameters to the end of the parameter list so that they can be optional. Another way is to sneak in new parameters by combining them (like you mention below).

I see this as essential, and you already did it in InStr() anyway... In my prototype I am working on right now, I put this offset as last parameter, thus being completely optional.

Good idea (especially the options merging). You can tell from the GUI commands that I like having a combined options parameter (though it has its drawbacks).

Yes, I try to throw ideas consistent with the current design. ;-)

Now that I think about it, I'm leaning away from retrofitting old commands to support RegEx.

OK, as I wrote, I wouldn't have used it much anyway.

The closer a replace operation is to
s///
... the easier it will be to use.

Although not very readable, its brevity is admirable and probably addictive to veterans. However, given AHK's current emphasis on readability, perhaps something longer yet intuitive would be better.

I agree. This syntax, coming from Perl, have been copied in JavaScript, but partially: in Perl, you can write s![xyz]!__!g to avoid using escapes. In JS, you cannot.
And I agree on the familiar and intuitive interface.

I still vote for full integration of the library in AHK. ;-)
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

majkinetor
  • Moderators
  • 4512 posts
  • Last active: Jul 29 2016 12:40 AM
  • Joined: 24 May 2006

PhiLho wrote:
I believe that regular expressions should be pervasive in AutoHotkey, exactly like expressions are now!
In other words, everywhere a match can be done against a string, it should be possible to do it against a regular expression...
Perhaps the easiest way to do this is to create a special operator, like %(A_Space) (ie. like MsgBox % GetData(x)).
For consistency with other languages, it could be / or // (because we often write /[a-b]+/ and such). Of course, a better choice can be made, for readability, intelligibility and compatibility.

majkinetor wrote:
This is the way to go. This is the true power. About the n00bs and performance, I think you worry too much.

Ambitious yet attractive. However, it might help to create a more complete/interesting list of where this proposed RegEx extension would be used -- such as WinTitle, WinText, some functions/commands that you mention below. As it stand now, the benefit vs. cost doesn't seem compelling.

Regular Expression is one of the most important things ever. I use it every day, to rename files, to take care of mp3 tags, to format my XMLs, ,to process the clipboard before pasting it etc.... The fact that some of you people don't or rarely use it, is serious hole in your computer skils (without offense, just friendly advice). That said, my opinion is if this can be done there should be no doubts if it should be done or no. This will also sumon another population of IT users here, those working with REs every day - new testing and script supplying ground.

Also, the standard syntax should not be changed, that is obvious.
Posted Image

Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004

...large strings will require an efficient interface between AHK strings and C/C++ strings.

AHK variables are essentially the same as C strings. Each one is a struct containing a C-string along with some attributes (like capacity). I intend to exploit this to make the interface to PCRE as pure and unencumbered as possible (I'm acutely aware of how costly string operations can be, such as unnecessary calls to strlen).

Last but not least there's the holy grail of backward compatibility: doing this in such a way that it breaks no existing scripts might well be impossible.

That's a good point. I think the only way to do it with perfect compatibility would be to overload the existing "% " prefix (or something else involving %).

In my prototype I am working on right now, I put this offset as last parameter, thus being completely optional.

Cool, a prototype!

in Perl, you can write s![xyz]!__!g to avoid using escapes. In JS, you cannot.

Maybe there's some way to offer both the short syntax and the long/function syntax. However, I'm wary of introducing anything that would cause ambiguity during script parsing, so I suspect it won't be feasible.

I still vote for full integration of the library in AHK. ;-)

I'll definitely check how much code size it adds if put directly into the EXE. To further reduce the size of PCRE, there are probably parts of the PCRE code that can be omitted (as thomasl said).

Thanks.

John B.
  • Guests
  • Last active:
  • Joined: --
majkinetor wrote

The fact that some of you people don't or rarely use it, is serious hole in your computer skils (without offense, just friendly advice). That said, my opinion is if this can be done there should be no doubts if it should be done or no.

I very much agree with this. One of the make-or-break issues with any new tool is whether it supports regexp, and it's critical for any tool that modifies text files. I deal with a lot of HTML and XML. I'm not a programmer (in fact, I'm a technical writer), but I use regexps daily to maximize my productivity. Otherwise, I'd have died on my first XML project ("Say, could you add a few thousand Help IDs to 30 Mb of XML files by the end of the week?").

Thanks,
John B.

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
Lot of programmers I know don't use them, or only at very primitive level.

it's critical for any tool that modifies text files.

Well, AutoHotkey wasn't designed for this, but since it has some powerful tools (Loop Parse, StringSplit, etc.), I see more and more requests on processing text files.
I started to give two answers: one with a one-line regex, then one with a more complex hand-parsing... ;-)
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

majkinetor
  • Moderators
  • 4512 posts
  • Last active: Jul 29 2016 12:40 AM
  • Joined: 24 May 2006

Lot of programmers I know don't use them, or only at very primitive level.

So ?

Lot of people listen Madonna, but comparing to that very few people listen John Coltrane. Now, if that is going to be how I measure things then in hell with this...

You care oh so much about 300K size addon but famous "typical user" will not ever notice the difference. I am tired of this double principle where some things are considered typical and some not.

This thing you call AHK is good, I don't have to tell you this. You want it to be much better, you add Regular Expressions directly in every command that deal with strings. Period.
Posted Image

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005

Lot of programmers I know don't use them, or only at very primitive level.

So ?

So, that's too bad for them! :-P
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

majkinetor
  • Moderators
  • 4512 posts
  • Last active: Jul 29 2016 12:40 AM
  • Joined: 24 May 2006
correct!

you win a trip to secret Hawai island. :D
The name of the island is dynamic (depending on local clima) so you can only pronaunce it using Reg Exp. This pleasure comes as a free bonus.
Posted Image

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
This morning, I managed to compile PCRE 6.7 as a Windows DLL.
It was easier than I thought... except it just strikes me that I forgot to add the .def file to the project! Argh, no function is exported, I have a nice DLL with no entry point... :lol:
OK, I will give a link when it will be functional, then.

Meanwhile, I will give the results:
I get a 40KB DLL when compiled with Minimize size option and no UTF-8 support, and a 68KB DLL with UTF-8 & Unicode support. Not too bad.
PS.: Compiled with Visual C++ 6... Will try with Visual Studio Express.

Compiling isn't so hard: you have to copy/rename a file, change two settings inside, compile a little program and run it so it generates a C file with the current locale (actually using C locale on Windows), then just compile all the pcre_xxx.c files, excluding three if not using UTF-8.
I wrote down the steps, so it will be easy to reproduce them.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

majkinetor
  • Moderators
  • 4512 posts
  • Last active: Jul 29 2016 12:40 AM
  • Joined: 24 May 2006
Good job PhilHo. Compiling can be nightmare.

Now we have no excuses to implement REs directly into the language. :D
Posted Image

Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004

I get a 40KB DLL when compiled with Minimize size option and no UTF-8 support, and a 68KB DLL with UTF-8 & Unicode support. Not too bad.

That's impressive. That could easily shrink to 20 KB or less with UPX, and might even drop further if some of the code in the DLL/LIB overlaps with code already in AutoHotkey.exe (such as standard C-library code). I wonder: if you link to a lib that contains C-library code, does that code get omitted if the project already has the same code... hopefully so, or maybe C-library code doesn't get added into LIBs the way it does into EXEs and DLLs; instead it just makes references to the C-Library itself. That would be great because then the linker can include common code only once rather than twice.

By the way, I still need to verify that the PCRE license allows a LIB built from the code to be linked to a GPL project. Maybe someone already knows the answer.

...it generates a C file with the current locale (actually using C locale on Windows)

If it happens to use the C library's locale functions, I noticed they add a considerable amount of code size. For example, I seem to remember that setlocale() is around 15 KB by itself (at least in VC++ 7.1)!

So if it does use C-locale, we might be able to change it to use Windows API locale equivalents, which add close to zero code size.

Thanks for your work on this so far. I hadn't expected so much interet in RegEx, much less someone willing to do so much of the research and development.

majkinetor
  • Moderators
  • 4512 posts
  • Last active: Jul 29 2016 12:40 AM
  • Joined: 24 May 2006

I wonder: if you link to a lib that contains C-library code, does that code get omitted if the project already has the same code..

Hm..... hm... hm....
I think not, based from what I heard once: some programmers told me that Delphi is good cuz it includes only once and even if you use some library with plenty of functions it will inlude only what you have used. They were comparing it to C. This is ofcourse not reliable information and I would like to know as well is this true or not.
Posted Image

Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004
It looks like it doesn't matter now because the PCRE code can be directly merged with the AutoHotkey code. ThomasL and PhiLho: You were right about the 3-clause BSD license being fully compatible with the GPL. Wikipedia is such a gift because it often clarifies murky issues like this. It says, "[BSD-licensed] code can be combined with a GPLed program without conflict (the new combination would have the GPL applied to the whole)."

My only remaining doubt about licensing is where to include the PCRE/BSD copyright/license in the binary distribution of AutoHotkey. It seems mandatory, so perhaps the best place is in the installer underneath the GPL, and also in the ZIP file version's license.txt.