Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Regular expressions: a wrapper around the PCRE DLL


  • Please log in to reply
20 replies to this topic
PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
This code is obsolete with version 1.0.45 that embed PCRE!
It can be still of interest for educational purpose... ;-)

OK, I see half of the readers (uh? two-thirds? ninety percent?) asking "What are regular expressions?".

Well, I won't provide full explaination here, I started to hand-write a tutorial, I have to type that and finish it...
But in few words, regular expressions, or regexp or regex or RE are a powerful (but a bit geeky) way to manipulate text.
With them, you can see if a generic string (eg. "5 letters followed by 2 digits) is inside a text, you can extract this string (eg. getting the current version number from the AutoHotkey download page), check if a string meets some criteria (does the user has typed a date in the right format?), transform a text (morph a list of C's #defines to a list of AHK's variable assignments), split a string with complex requirements (eg. get all words of a natural text, separated by spaces or punctuation signs), etc.
The drawback is its syntax, a bit cryptic for the uninitiated (and sometime for the initiated...), but with practice, it appears that most of the tasks use rather simple expressions.

Currently, AutoHotkey doesn't support regular expressions, so we have to rely on some external DLL. One of the most used is PCRE (Perl Compatible Regular Expressions), which is powerful and can be compiled to a rather small DLL.

Thomas Lauer already provided a wrapper DLL for PCRE 5.0.
It has the advantage of being small and implementing a replace algorithm, since PCRE does only searches.
It has the inconveniences of relying on an old version of this library (but the latest ones are big!), of using only the Posix version of the library, of needing a supplementary wrapper (in AHK) around this DLL, of being rather inefficient by compiling an RE at each of its use, of being difficult to change (need a C compiler), etc.

So, I tried to make my own implementation in pure AutoHotkey using only the official DLL.
Thus, if a new version comes out, you can use it. Or, possibly with some changes, you can use an older, smaller version. You can customize the wrapper to your tastes, since it is pure script.

The replace algorithm might be a bit slower, because I had to write it all in AHK, but you can compensate by adding extra power by hacking these routines.

I provide no split function, because it is inconvenient to write in AutoHotkey, as we cannot return arrays. So either the result would be global, or hard to fetch. But implementing a split with the provided functions should be quite trivial.

The version I release today is a bit geeky, in the sense you don't use the RE strings directly, but you have to compile them before using them.
The advantage is performance: you compile a regular expression once, then reuse it as many time as you like, the library won't need to recompile it again.
The disavantage is that's not much intuitive, not in the spirit of AutoHotkey.

So I am planning to do another version more in the spirit of my signature... or of AHK.
The trade off will be less performance, but it probably won't be noticeable except to parse a very big file line per line... And it will be perfect, for example, for a quick validation of a formatted edit field.
Plus this version might serve as prototype to a future integration of regular expressions in AHK... Note that such implementation can be more performant, perhaps by caching the expressions. If caching (hashing) is much faster than compiling, there might be an advantage. That's the way Perl mangage REs too: it only avoid to cache dynamic expressions (ie. resulting of concatenation or variable expansion, etc.).
It should implement also friendlier options (letters instead of big constant names).

Now it is time to take a look:
PCRE_DLL.ahk
TestPCRE_DLL.ahk
PCRE-6.4.zip, (only) the DLL. You can get other compiled DLLs at GnuWin32 or at Psyon site (untested yet, may be smaller).

As you can see, the test script is becoming big, but it only touch the surface of the library, with simple expressions, no option, no offset.
So there can be bugs there. If you find any, please report them here.

An overview of the usage of the library:
stringToSearch = You can do /Regular Expressions/ in AutoHotkey too!

; Compile regular expression and get a reference to the result
hRE := PCRE_RegisterRegExp("R(A|H)(u|o)**")
; There is an error, the handle is null, we can use the provided mini-GUI
; that point out where the error is in the expression (if single line).
if (hRE = 0)
	PCRE_ShowLastError()

; Compile a correct RE
hRE := PCRE_RegisterRegExp("([A-Z])([a-z])")

; Get the position of a match on the given string.
pos := PCRE_GetMatch(hRE, stringToSearch)
; Get both position and length of the match, in a string, separated by a pipe (|)
pos@len := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETLENGTH)
; Get the matched string
match := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETSTRING)

; Get the first match of this RE on the given string, as a reference for use in further calls
hMatch := PCRE_Match(hRE, stringToSearch)
If (ErrorLevel = #PCRE_ERROR_NOMATCH)
{
	MsgBox No match!
	ExitApp
}
; Get how many captured string there was in this match:
; number of matched captures, plus the implicit capture of the whole match.
n := PCRE_GetMatchedCaptureNumber(hMatch)

; Get position and length of the captures
PCRE_GetMatchVals(hMatch, 0, pos0, len0) ; Whole match
PCRE_GetMatchVals(hMatch, 1, pos1, len1) ; First capture
PCRE_GetMatchVals(hMatch, 2, pos2, len2) ; Second capture

; Get strings of captures
s0 := PCRE_GetMatchStr(hMatch, 0) ; Whole match
s1 := PCRE_GetMatchStr(hMatch, 1) ; First capture
s2 := PCRE_GetMatchStr(hMatch, 2) ; Second capture

; Find next match and update the reference
PCRE_MatchNext(hRE, hMatch)
; Similar to:
hMatch := PCRE_Match(hRE, stringToSearch, pos0 + len0)
; but the later is less efficient, creating another reference instead of reusing it.

; Replace the whole match(es) by the given string,
; with $n replaced by the nth capture.
hRS1 := PCRE_RegisterReplaceString("$2-$1!")
; Idem with user-defined symbol.
hRS2 := PCRE_RegisterReplaceString("\_2-\_1!", "\_")
; Idem, two-parts symbol, to avoid ambiguity
hRS3 := PCRE_RegisterReplaceStringEx("${2}1-${1}0!")
; Idem, with user-defined symbols
hRS4 := PCRE_RegisterReplaceStringEx("\_2_/-\_1_/!", "\_", "_/")
; Note that unlike Perl, you cannot mix both notations. See the test file for more explainations.

; "A" is to replace all occurences (default), can be a maximum number of replacements.
resultString := PCRE_Replace(hRE, hRS1, stringToReplace, "A")
resultString := PCRE_Replace(hRE, hRS3, stringToReplace, 1)

; This call is optional, it will unload the library (automatically loaded on first use)
; and free the data allocated by the DLL.
; If not called, Windows will free all this on script exit.
PCRE_End()

Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

evl
  • Members
  • 1237 posts
  • Last active: Oct 20 2010 11:41 AM
  • Joined: 24 Aug 2005
I've bookmarked this page for future reference as I'm still trying to learn about regular expressions.

In case it helps anyone else, these are a couple of useful regular expression resources:

Regular Expression Laboratory (freeware):
http://www.silverage...re.com/rxl.html

Regular Expression Laboratory is an assistant simple to use tool to help you learn and prepare regular expressions.

Also has a quick reference to the syntax in the help file.


Quite a bit of info on syntax and examples:
http://www.regular-e...quickstart.html
http://www.regular-e.../reference.html

foom
  • Members
  • 386 posts
  • Last active: Jul 04 2007 04:53 PM
  • Joined: 19 Apr 2006
Also a nice app wich helps understanding regular expressions.
http://www.weitz.de/regex-coach/
I recommend doing the quick start tutorial to get used to the Gui.

Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004
Great presentation. I used to be a RegEx novice but have learned a lot more about it in conjunction with phpBB and .htaccess modifications. Armed with this understanding and your work in this and other topics, RegEx is getting closer to being integrated with AHK.

Thanks.

BoBo
  • Guests
  • Last active:
  • Joined: --

Regular Expressions DLL for Win32 Programs
121010 Kb
1999-03-16 00:00:00

gnuregex.dll: Regular Expressions for Win32 Programs
----------------------------------------------------
If you've ever wanted to add regular expressions to
a Win32 program, here's your chance.

This DLL is under the GNU General Public License
(almost all the source for it comes from the regex
library 0.12), so if you distribute a program that
uses it, you must follow the terms detailed in
COPYING.

[Download]

No idea if this is worth a look, stumbled over it while checking for other things ... :roll:

corrupt
  • Members
  • 2558 posts
  • Last active: Nov 01 2014 03:23 PM
  • Joined: 29 Dec 2004
Thanks for the link BoBo :)

121010 Kb

typo? unpacked size 375,860 bytes, .dll size 41.5 KB

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
Thank you.

The RegEx Coach is an excellent tool, I use it to verify complex expressions.
There is also the PCRE Workbench which, as the name imply, uses the same library as me. Its weakness is that it limits test strings to one line.
The Regular Expression Laboratory looks OK too, so is the JRegExpTester, a Java application, ie. targeted at Java syntax. It has the advantage of using the RE library of regular-expressions.info.
Of course, there are more similar tools, I might even write one in AutoHotkey... With the weakness that currently, it is hard to colorize the various parts of a string.

Note also, for those wanting to do quick tests without downloading a program, two online programs to test REs (there are others...):
- Java syntax: RegEx
- JavaScript and both PHP syntaxes: REGex TESTER. This one is impressive because it uses Ajax to transmit the strings to test to the PHP program and retreive the results, so you never leave or refresh the page.
Well, there is also regular-expressions.info's JavaScript tester, which isn't bad either.

I hope this will wet your appetite for REs! :-)
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

SKAN
  • Administrators
  • 9115 posts
  • Last active:
  • Joined: 26 Dec 2005
Dear PhiLho, :)

I do not know anything about RegEx, but just heard of it when I gave a try to sed sometime ago. Since it is you, I think this is something that I should try & learn. I thank you for contributing such a nice thing to the community.

Regards, :)
kWo4Lk1.png

olfen
  • Members
  • 115 posts
  • Last active: Dec 25 2012 09:48 AM
  • Joined: 04 Jun 2005
Thanks, PhiLho, for your superbly commented work. Very useful!

Rajat
  • Members
  • 1904 posts
  • Last active: Jul 17 2015 07:45 AM
  • Joined: 28 Mar 2004
Thanks Philho firstly for the indepth knowledge of regex that I didn't have earlier, and then for making regex easily available to us folks.

MIA

CleanNews.in : Bite sized latest news headlines from India with zero bloat


olfen
  • Members
  • 115 posts
  • Last active: Dec 25 2012 09:48 AM
  • Joined: 04 Jun 2005
Hello PhiLho,
I just did a couple of tests. Test1 and Test3 don't work as expected. All 3 RegExes give the expected match in Regex Coach.
Am I doing something wrong?

#SingleInstance Force
#NoEnv

#Include PCRE_DLL.ahk

stringToSearch = Test123

;Test1 - No match found. Expected: "Test1" 
hRE := PCRE_RegisterRegExp("^.{5}")
if (hRe = 0)
	PCRE_ShowLastError()

; Get the position of a match on the given string.
pos := PCRE_GetMatch(hRE, stringToSearch)
res := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETLENGTH)
match := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETSTRING)

MsgBox,
(
hRE: %hRE% (%ErrorLevel%)
Pos: %pos%
Pos&Len: %res%
Match: %match%
)

;Test2 - Working as expected, match: "123"
hRE := PCRE_RegisterRegExp(".{3}$")
if (hRe = 0)
	PCRE_ShowLastError()

; Get the position of a match on the given string.
pos := PCRE_GetMatch(hRE, stringToSearch)
res := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETLENGTH)
match := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETSTRING)

MsgBox,
(
hRE: %hRE% (%ErrorLevel%)
Pos: %pos%
Pos&Len: %res%
Match: %match%
)

;Test3 - No match found. Expected: "Test123" 
hRE := PCRE_RegisterRegExp("^\D*\d*$")
if (hRe = 0)
	PCRE_ShowLastError()

; Get the position of a match on the given string.
pos := PCRE_GetMatch(hRE, stringToSearch)
res := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETLENGTH)
match := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETSTRING)

MsgBox,
(
hRE: %hRE% (%ErrorLevel%)
Pos: %pos%
Pos&Len: %res%
Match: %match%
)


PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
It is my fault, I changed the API of PCRE_GetMatch so _startOffset defaulted to 1, ie. start of string, first char, in AutoHotkey tradition, instead of 0, in C tradition... But my test code, which you took as base, still used 0, so I gave -1 to the DLL... Garbage in, garbage out...
So, please, change the calls to:
res := PCRE_GetMatch(hRE, stringToSearch, 1, #PCRE_GETLENGTH)
match := PCRE_GetMatch(hRE, stringToSearch, 1, #PCRE_GETSTRING)
(file is updated)
Thank you to report the problem, and sorry for the confusion.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

olfen
  • Members
  • 115 posts
  • Last active: Dec 25 2012 09:48 AM
  • Joined: 04 Jun 2005
Thanks for the explanation and solution, seems to work fine now.

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
OK, after some days of "rest" (doing something else), I finally acheived my regular expression tutorial!

You can find it on my site.

As I explain at the start, I tried to make a tutorial with concrete, real examples, yet avoiding forward references ("This expression uses some features we will see later"...), and trying to add some levity to this otherwise rather arid subject...

If you are curious and adventurous enough to read it, don't hesitate to give me feedback.
Feedback on the content (is it clear enough, should I insist on some point, etc.) is welcome in the specific topic I created in the Utilities & Resources section.
Feedback on the form (syntax, phrasing, sentences sounding too "Frenchy", etc.) should be private (PhiLho(a)GMX.net) to avoid adding noise to these topics.

Note that the page is printer friendly: I use a specific stylesheet for the printer (if your browser is smart enough) and you should be able to print in two columns, for example (if your driver is smart enough...).

Now, I should go back to work on my EasyRegEx wrapper...
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
About my (current) signature:
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

I explained the image in another topic, but I re-explain it here:
It is a drawing I made (much larger!) inspired by celtic knots. I name it KnotMan. Previous link points to a larger version of the picture (500x500 pixels, 28KB) for the curious people...

The text explains the origin of the nickname with a regular expression, in a pseudo-AutoHotkey expression (pseudo because REs are not built in, but I will create someday the RegExReplace function using the above library), with syntax coloring.

This signature shows three of my points of interest: drawing (and celtic knots!), regular expressions, and my work on Scintilla, the syntax highlighting editor component, and SciTE, my editor of choice using this component.

If you apply the ^(\w{3})\w*\s+\b(\w{3})\w*$ expression to my real name (Philippe Lhoste) to replace it with the given substitution string ($1$2), you will get "PhiLho", which I shown with the chosen variable name.
The expression, a bit more convoluted than necessary to make it more cryptic ;-) means "match, at the start of the string, three word chars (captured) followed by any number of word chars, then at least one space (or tab, blank char), then at a start of a word (the redundant part), match and capture three word chars, then match the remainder of the word up to the end of the string".

Several persons asked me, privately or not, what it means, I thought it was an appropriate place to explain it. I hope it is clearer now. :-)
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")