Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Regular Expressions (RegEx) for AutoHotkey


  • Please log in to reply
112 replies to this topic

Poll: What should the names of the RegEx functions be (if you HAD to pick one of these)? (42 member(s) have cast votes)

What should the names of the RegEx functions be (if you HAD to pick one of these)?

  1. RegExMatch() and RegExReplace() (43 votes [84.31%])

    Percentage of vote: 84.31%

  2. RegMatch() and RegReplace() (8 votes [15.69%])

    Percentage of vote: 15.69%

Vote Guests cannot vote
thomasl
  • Members
  • 92 posts
  • Last active: Sep 28 2006 09:55 AM
  • Joined: 16 Jun 2005

I wonder: if you link to a lib that contains C-library code, does that code get omitted if the project already has the same code... hopefully so, or maybe C-library code doesn't get added into LIBs the way it does into EXEs and DLLs; instead it just makes references to the C-Library itself. That would be great because then the linker can include common code only once rather than twice.

A .lib is either a stub for the code in a DLL or a sort of glorified .obj. In other words, this is pre-linked code.

An EXE or DLL, OTOH, is post-linked: the linker puts all that is required into the EXE or DLL... but only once. This means that you wil have only one copy of, say the strlen() function, in the EXE image (if the C runtime was statically linked with the EXE). It is, however, entirely possible, to have another copy of strlen() in a DLL that is loaded with that EXE (once again, if the C runtime was statically linked with the DLL).

(That is one reason why the C runtime DLL can be a big space saver.)

The meaning of all that is that if PhiLho linked his DLL statically and you link AHK statically, there will be some overlap that will go away once everything is linked into one EXE.

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
Exactly!
I generated 4 DLLs, this time with the .def included in the projects...
Curiously, they are much bigger, even as I remembered to add the /opt:nowin98 option (perhaps obsolete in higher versions of VC++).
But they are still rather small.
I don't recall exactly now, but since I have done no step to exclude it, these DLLs might include the C runtime. It works for exe (ie. we specify a dependency on msvcrt.dll), I don't know for sure for DLLs.
Here are the sizes:
60928   PCRE6ac.dll
 71680   PCRE6af.dll
103424   PCRE6wc.dll
118784   PCRE6wf.dll
Explaination: c is for compact, Optimized for size; f is for fast, Optimized for speed; a is Ascii, excluding the UTF-8/UCP code & tables; w is for wide, including the later.
Seems still managable, and real measures must be done on real integration to AutoHotkey.exe...

I have to test these DLLs, to see if they are usable, and to improve a bit the How To, then I will give the link to the archive.

If it happens to use the C library's locale functions, I noticed they add a considerable amount of code size.

I can be wrong, but I would say no. We precisely generate and run a program that can depend on a locale, using isascii and such functions, to rely on the local system to tell what is Ascii, punctuation, etc. (for 8bit Ascii, not for UTF-8, which has its own tables). This program generates a C file with tables full of this data. It is these tables that are exclusively used in the PCRE engine: it is faster, and developper can hack these tables before including them.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004

A .lib is either a stub for the code in a DLL or a sort of glorified .obj. In other words, this is pre-linked code.

Good to know. Thanks.

It is, however, entirely possible, to have another copy of strlen() in a DLL that is loaded with that EXE (once again, if the C runtime was statically linked with the DLL).

(That is one reason why the C runtime DLL can be a big space saver.)

Benchmarks show the C runtime DLL is quite a bit slower when used with AutoHotkey (I think by nearly 20%). Part of the reason is probably that it's multi-threaded; AutoHotkey can currently get away with using the single-threaded, static CRT because although AHK has two threads, both of them are designed to tolerate the single-threaded CRT.

However, due to the phasing out of the single-threaded CRT by Visual Studio and perhaps other compilers, AHK will probably eventually be compiled with the multi-threaded CRT (especially for porting to 64-bit).

The meaning of all that is that if PhiLho linked his DLL statically and you link AHK statically, there will be some overlap that will go away once everything is linked into one EXE.

Good. I'm looking forward to trying it.

c is for compact, Optimized for size; f is for fast, Optimized for speed; a is Ascii, excluding the UTF-8/UCP code & tables; w is for wide, including the later.
Seems still managable, and real measures must be done on real integration to AutoHotkey.exe...

These are useful results. Hopefully even more ways can be found to prune the code.

[Rather than using C locale functions] This program generates a C file with tables full of this data. It is these tables that are exclusively used in the PCRE engine: it is faster, and developper can hack these tables before including them.

Thanks for your research.

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005

These are useful results. Hopefully even more ways can be found to prune the code.

Not much, at least by removing files and using some macros. Otherwise, you need to change the source, which is risky...

I think we can get rid of the pcre_dfa_exec.c source, which implements an alternative engine... Mostly for the most geeky people... :-D

Looking around, I found that I don't need the .def, just have to define the macros PCRE_DEFINITION and DLL_EXPORT, and it will use the right __declspec. Will try that to simply the build process.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

majkinetor
  • Moderators
  • 4512 posts
  • Last active: May 20 2019 07:41 AM
  • Joined: 24 May 2006

Looking around, I found that I don't need the .def, just have to define the macros PCRE_DEFINITION and DLL_EXPORT, and it will use the right __declspec. Will try that to simply the build process.

It is not enough, at least by my experience. I spent days to figure out whta is wrong here while doing some TC plugin.

In VS u must use export "c" also if you want to see exported functions. Remember that if your compiled dll contains no exported functions.

I really don't know why people still use .def files.
Posted Image

JSLover
  • Members
  • 920 posts
  • Last active: Nov 02 2012 09:54 PM
  • Joined: 20 Dec 2004
BSD vs GPL

...it's my understanding only 33 functions/commands are GPL'd AU3 code (well that is alot but) (searching has you saying 40 about 5 times, where'd I get 33?)...the rest is 100% pure Grade A Chris...right? I believe some projects are multi-licensed...for example all the code you wrote you can 're-release' under BSD & only leave GPL on the already GPL-infested code...I know the GPL is annoying in its Virus nature...but if you own 100% rights to some of the code...who is going to sue you for re-releasing your own code under BSD...as you wanted to anyway (right?)...the next step would be getting permission from the GPL-owners of those 33 functions to issue you a license under BSD for you to use them...or completely re-write those functions & say to hell with the AU3 code...now I'm not sure about the re-licensing of already GPL'd code...but you only released the 100% Chris code as GPL cuz the rest was GPL (& you didn't know about/think of multi-licensing)...I'd say try to multi-license it as half-BSD half-GPL (or 95% BSD 5% GPL)...then the new PCRE code is already BSD. Can we get any BSD C writers to re-write those GPL-infested functions?...but the way I see it, code is code...there aren't that many ways to write something...so I don't even see why most code is copyrightable...if that's the case ALL your code are belong to Dennis Ritchie (& the other guy)...who wrote C to begin with...

msgbox, hi
...there I just copyrighted that AHK code, anyone who uses it in their script has to pay me 1 million dollars...see what I mean...

GPL is good in only 1 way...it keeps stuff free...BSD is good cuz it is more liberal...but I want to write my own license with the spirit of the BSD, but with some "commercial-use" clauses...something like "if you are a mom n pop shop...you can use it...if you are Microsoft...pay me 5 billion dollars & you can use it"...for example I think I've seen some BSD-licensed projects be stolen, re-compiled by a company & sold...(with or) without any real modification of code...but the customers don't know where the code came from & happily pay for something they could get free if they googled...so both GPL & BSD fail in that regard, anybody can compile any GPL/BSD program & make money on it...in the case of GPL, the company can charge, but they need to provide source code if asked...BSD means they can charge & don't need to give out the source..."my license" would be...companies can USE it if they make less than some amount per year...or they have to pay me...& NO COMPANY can resell it, no matter how much they pay me...

AutoIt 2 / AutoIt 3 code???

...I wasn't around then, but it WAS my understanding that AutoIt 2 was GPL'd (I've found places where you said it never was, but why does everyone think it was?)...on the other hand I've also heard AutoIt 3 code was never released or isn't available on their site...so where'd you get the 33 functions code?...I've never seen that explained before...OK...now I just went to the AutoIt entry on Wikipedia & it says the code is available on their website, before when I looked it was nowhere...& I just looked again & it don't appear to REALLY be there, just some compression code...
Useful forum links: New content since: Last visitPast weekPast 2 weeks (links will show YOUR posts, not mine)

OMFG, the AutoHotkey forum is IP.board now (yuck!)...I may not be able to continue coming here (& I love AutoHotkey)...I liked phpBB, but not this...ugh...

Note...
I may not reply to any topics (specifically ones I was previously involved in), mostly cuz I can't find the ones I replied to, to continue helping, but also just cuz I can't stand the new forum...phpBB was soo perfect. This is 100% the opposite of "perfect".

I also semi-plan to start my own, phpBB-based AutoHotkey forum (or take over the old one, if he'll let me)
PM me if you're interested in a new phpBB-based forum (I need to know if anyone would use it)
How (or why) did they create the Neil Armstrong memorial site (neilarmstronginfo.com) BEFORE he died?

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
If I understood correctly, AutoIt2 source was never released. AutoIt3 source was released as GPL in the early stages of the project, but closed again because they didn't liked that Chris advertised the new AutoHotkey on their forum (as if there was commercial rivality! hey, due to language differences, they don't even attract the same users). At least, that's my understanding of the grudge they hold against Chris.
So, I guess that even if we asked nicely, they would refuse to change the licence.
The solution is, as you wrote, to rewrite the GPL parts. Not so hard, probably, some parts even would gain from it (the OS detection is (was?) outdated, and could be replaced by code given by Microsoft as sample).
As you wrote, some things just are close to public domain: there are not hundred of ways to fill a LOGFONT structure and display a text, for example.
Now, if there is a special algorithm trying various API functions to get a job done (hard Send cases), it is another thing.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004

If I understood correctly, AutoIt2 source was never released. AutoIt3 source was released as GPL...

That's correct. I believe the first non-beta release of AutoIt v3 was GPL, but sometime after that it was changed to be the current license.

The solution is, as you wrote, to rewrite the GPL parts.

That will probably happen someday; but it doesn't seem to be a high priority at the moment (it would probably take several weeks to develop and test properly). Although I'm somewhat turned off by GPL's viral nature, I see the reasons for it more clearly now: basically, GPL people are saying, "we want to push the world toward freedom, because we can get there faster that way (or the world can't be trusted to make the right choice)"; and BSD-like people are saying, "we want to maximize total benefit; let the world choose freedom on its own."

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
I didn't have time to play more with compile options and such, so I cleanned up my compile how-to and give it here to immediate consumption...
I also give my little prototype of RegExMatch as I feel it should be implemented in AutoHotkey. Of course, these are only suggestions and ideas. I didn't have time to do the RegExReplace and RegExSplit. The former is more or less implemented in my wrapper, the later have to be done yet.
Will there be a Loop RegExParse (or Loop Parse with regex option)?

How to compile PCRE on Windows

And on other systems as well, but Unices have makefiles.

Download the latest version of PCRE -- v.6.7 at time of writing, a 558KB .tar.bz2 file (also available as .tar.gz, more than 800KB).
http://www.pcre.org/
ftp://ftp.csx.cam.ac.uk/pub/software/pr ... .7.tar.bz2

Unzip the file in a directory.
You will find a NON-UNIX-USE file with so instructions for compiling on Windows.
Since it is slightly outdated (config.in is now named config.h.in), I give here new instructions.

Copy config.h.in to config.h and edit it.
As explained: "change the macros that define HAVE_STRERROR and HAVE_MEMMOVE to define them as 1 rather than 0."
I am not sure about NEWLINE, I leave it as it is.
Other defaults values seems OK.

Next step is to compile dftables.c and run dftables.exe to generate a pcre_chartables.c file, using the system (or user) locale.
On Windows, it seems to use the C locale (default), perhaps it needs an additional explicit call to set the locale within the program.
The pcre_chartables.c file can then be manually tweaked to meet any requirement.

The remainder of the steps are (almost) trivial: just add the following files to a project.

pcre_chartables.c -- Generated
pcre_compile.c
pcre_config.c
pcre_exec.c
pcre_fullinfo.c
pcre_get.c
pcre_globals.c
pcre_info.c
pcre_maketables.c
pcre_refcount.c
pcre_study.c
pcre_tables.c
pcre_try_flipped.c
pcre_version.c
pcre_xclass.c
pcre_dfa_exec.c -- Can be omitted if this method isn't used (for specialists!)
pcre_ord2utf8.c -- Probably not needed if not using UTF-8
pcre_ucp_searchfuncs.c -- Idem
pcre_valid_utf8.c -- Idem

These are just all the pcre_xxx.c files.

To make a small project, you can omit the pcre_dfa_exec.c file, and perhaps remove the pcre_dfa_exec function declaration from pcre.h.
And if UTF-8 support isn't needed, skip also the last three files.

If you need UTF-8 support, add SUPPORT_UTF8 preprocessor definition to the C compile options.
If you need UCP support (Unicode character property: escape sequences \p{..}, \P{..}, and \X), add SUPPORT_UCP preprocessor definition (and SUPPORT_UTF8 too, of course).

If you want DFA function without UTF-8/UCP, you need to edit pcre_dfa_exec.c, as I had a link error: unresolved external symbol __pcre_ucp_findprop
It seems author forgot to protect some parts with the #ifdef SUPPORT_UCP test.
I added it between
case OP_PROP_EXTRA + OP_TYPEPLUS:
and the break of
case OP_EXTUNI_EXTRA + OP_TYPEEXACT:
hoping these opcodes are generated only in UCP mode...

If you need to make a DLL from PCRE, either you copy libpcre.def to pcredll.def and edit it to remove or rename the library name that can conflict with the project name. Then add this file to the project;
Or you define (for Visual C++ only?) the preprocessor macros PCRE_DEFINITION and DLL_EXPORT, and it will use the right __declspec in front of the exported functions.

/*
RegEx.ahk

Wrapper routines to ease the use of the functions in PCRE3.dll.
The functions here should be easier to use than those in PCRE_DLL.ahk,
at the cost of some performance loss.
To compensate a bit that, I cache the latest compiled string,
so repetitive uses of the same expression are a bit optimized.
I cache only one string, because searching the cache for several
strings in pure AHK would be slower than compiling the expression...
And for difficult cases, there is still the first wrapper, with explicit compilation.
If (when) REs will be integrated to AutoHotkey, it will be able to manage
a bigger cache. I doubt caching more than 5 REs is necessary:
the problem arises mostly when using different regexes in a loop.

// by Philippe Lhoste <PhiLho(a)GMX.net> http://Phi.Lho.free.fr
// File/Project history:
 1.00.000 -- 2006/09/25 (PL) -- First release.
 0.01.000 -- 2006/06/23 (PL) -- Creation from PCRE_DLL.ahk.
*/
/* Copyright notice: For details, see the following file:
http://Phi.Lho.free.fr/softwares/PhiLhoSoft/PhiLhoSoftLicence.txt
This program is distributed under the zlib/libpng license.
Copyright (c) 2006 Philippe Lhoste / PhiLhoSoft
*/
#hPCREModule = 0
; Provide full path or put it in the path (or the working dir).
#PCRE_DLL = PCRE3.dll
#RegExCompRE_ref = 0

;/* Options */

#PCRE_CASELESS        := 0x00000001
#PCRE_MULTILINE       := 0x00000002
#PCRE_DOTALL          := 0x00000004
#PCRE_EXTENDED        := 0x00000008
; Non-PCRE options
#PCRE_HIDENONSTDOPT   := 0x00FFFFFF
#PCRE_GLOBAL          := 0x01000000

;/* Request types for pcre_fullinfo() */

#PCRE_INFO_CAPTURECOUNT   :=  2

;/* Exec-time and get/set-time error codes */

#PCRE_ERROR_NOMATCH        :=  (-1)
#PCRE_ERROR_NULL           :=  (-2)
#PCRE_ERROR_BADOPTION      :=  (-3)
#PCRE_ERROR_BADMAGIC       :=  (-4)
#PCRE_ERROR_UNKNOWN_NODE   :=  (-5)
#PCRE_ERROR_NOMEMORY       :=  (-6)
#PCRE_ERROR_NOSUBSTRING    :=  (-7)
#PCRE_ERROR_MATCHLIMIT     :=  (-8)
#PCRE_ERROR_BADUTF8        := (-10)
#PCRE_ERROR_BADUTF8_OFFSET := (-11)
#PCRE_ERROR_PARTIAL        := (-12)
#PCRE_ERROR_BADPARTIAL     := (-13)
#PCRE_ERROR_INTERNAL       := (-14)
#PCRE_ERROR_BADCOUNT       := (-15)
#PCRE_ERROR_DFA_UITEM      := (-16)
#PCRE_ERROR_DFA_UCOND      := (-17)
#PCRE_ERROR_DFA_UMLIMIT    := (-18)
#PCRE_ERROR_DFA_WSSIZE     := (-19)
#PCRE_ERROR_DFA_RECURSE    := (-20)

OnExit RegEx_CleanUp
; Skip internal code and continue to auto-exec section of including code
Goto PCRE=>ContinueAutoExec


/*
// Like InStr(), returns the position of the first occurrence of the regular expression
// _regEx in the string _stringToSearch.
// Returns 0 if not found, or found position starting at 1.
// If not found, ErrorLevel can be checked to see what is the problem.
// It can contain an error code from DllCall (followed by a pipe and the name of the called function)
// or an error code from PCRE followed by a pipe and the offset of the error in the regex.
//
// Unlike InStr(), there is no reverse search.
//
// This function sets global variables:
// A_RegExPos, A_RegExLength, A_RegExString, (global match)
// A_RegExPos1, A_RegExLength1, A_RegExString1, (capture 1, etc.)
// A_RegExNextPos, A_RegExCaptureCount, A_RegExError

Note on the above global variables:
Somehow, they follow the same logic than Loop FilePattern which also creates
lot of built-in variables to avoid using extra commands to get results.
And actually, the same logic is used in Perl REs...

If that's too much, we can skip the capture variables (numbered) and add
a function to fetch the capture #n.
*/
RegExMatch(_stringToSearch, _regEx, _options="", _startingPos=1)
{
	local options
	local errorCode, errorOffset, p_errorMsg, errorMsg
	local hPCRE, captureCount
	local offsetTableSize, compRegExp, resCode, pos, len

	If (#hPCREModule = 0)
	{
		#hPCREModule := DllCall("LoadLibrary", "Str", #PCRE_DLL)
		If (#hPCREModule = 0)
		{
			MsgBox 16, RegEx, You need the %#PCRE_DLL% in your path!
			ExitApp
		}
	}
OutputDebug RegExMatch: %#hPCREModule% for %_regEx%

	options := RegEx_ParseOptions(_options)
OutputDebug Options: %_options% -> %options%

	;--- Compilation phase
	If (#RegExCache_RE = _regEx)
		; We just compiled it, skip this step
		Goto RegExMatch_MatchStep

	; Compile the RE
	hPCRE := DllCall(#PCRE_DLL "\pcre_compile2"
			, "Str", _regEx
			, "Int", options
			, "Int *", errorCode
			, "UInt *", p_errorMsg
			, "Int *", errorOffset
			, "UInt", 0
			, CDecl)
	If (ErrorLevel != 0)
	{
		ErrorLevel = %ErrorLevel%|pcre_compile2
		Return 0
	}
OutputDebug Handle: %hPCRE%

	if (hPCRE = 0)
	{
		ErrorLevel = %errorCode%|%errorOffset%
		VarSetCapacity(errorMsg, 100)
		DllCall("lstrcpy", "Str", errorMsg, "UInt", p_errorMsg)
		A_RegExError = Error compiling pattern /%_regEx%/%_options%:`n(%errorCode%) %errorMsg%
		Return 0
	}
	#RegExCache_CompRERef := hPCRE
	#RegExCache_RE := _regEx

	DllCall(#PCRE_DLL "\pcre_fullinfo"
			, "UInt", hPCRE
			, "UInt", 0
			, "UInt", #PCRE_INFO_CAPTURECOUNT
			, "UInt *", captureCount
			, CDecl)
	If (ErrorLevel != 0)
	{
		ErrorLevel = %ErrorLevel%|pcre_fullinfo
		Return 0
	}
	; This is the number of capturing parenthesis!
	; It can be different of the number of real captures when matching
	; but it is used as maximum size of capture buffer
	#RegExCache_captureCount := captureCount
OutputDebug Capture Count: %captureCount%

	;--- Matching phase
RegExMatch_MatchStep:

	offsetTableSize := 3 * (#RegExCache_captureCount + 1)
	VarSetCapacity(#PCRECache_offsetTable, offsetTableSize * 4)

	resCode := DllCall(#PCRE_DLL "\pcre_exec"
			, "UInt", #RegExCache_CompRERef
			, "UInt", 0
			, "Str", _stringToSearch
			, "Int", StrLen(_stringToSearch)
			, "Int", _startingPos - 1
			, "Int", 0	; Can be ANCHORED, NOTBOL, NOTEOL, NOTEMPTY, PARTIAL
			, "UInt", &#PCRECache_offsetTable
			, "Int", offsetTableSize
			, CDecl)
	If (ErrorLevel != 0)
	{
		ErrorLevel = %ErrorLevel%|pcre_exec
		Return 0
	}
	If (resCode < 0)
	{
		ErrorLevel = %resCode%|pcre_exec
		VarSetCapacity(errorMsg, 100)
		DllCall("lstrcpy", "Str", errorMsg, "UInt", p_errorMsg)
		A_RegExError = Error matching pattern /%_regEx%/%_options%: %resCode%
		Return 0
	}
OutputDebug Exec: %resCode%

	resCode--	; It counts the global capture (whole match)
	A_RegExCaptureCount := resCode
	; Whole match
	pos := RegEx_GetOffset(#PCRECache_offsetTable, 0)
	; Given positions start at 1
	A_RegExPos := pos + 1
	A_RegExLength := RegEx_GetOffset(#PCRECache_offsetTable, 1) - pos
	StringMid A_RegExString, _stringToSearch, A_RegExPos, A_RegExLength
	A_RegExNextPos := A_RegExPos + A_RegExLength
	; Captures
	Loop %resCode%
	{
		pos := RegEx_GetOffset(#PCRECache_offsetTable, A_Index * 2)
		A_RegExPos%A_Index% := pos + 1
		len := RegEx_GetOffset(#PCRECache_offsetTable, A_Index * 2 + 1) - pos
		A_RegExLength%A_Index% := len
		pos++
		StringMid A_RegExString%A_Index%, _stringToSearch, pos, len
	}

	Return A_RegExPos
}

;===== Private section =====

RegEx_ParseOptions(_options)
{
	local options

	options := 0
	Loop Parse, _options
	{
		If (A_LoopField = "i")
			options := options | #PCRE_CASELESS
		Else If (A_LoopField = "m")
			options := options | #PCRE_MULTILINE
		Else If (A_LoopField = "d")
			options := options | #PCRE_DOTALL
		Else If (A_LoopField = "x")
			options := options | #PCRE_EXTENDED
		Else If (A_LoopField = "g")
			options := options | #PCRE_GLOBAL
	}
	Return options
}

RegEx_GetOffset(ByRef @offsetTable, _index)
{
	local addr

	addr := [email protected] + _index * 4

	Return *addr + (*(addr + 1) << 8) +  (*(addr + 2) << 16) + (*(addr + 3) << 24)
}

RegEx_CleanUp:
	; Remove cached compiled RE
	If (#RegExCache_CompRERef != 0)
	{
		DllCall(#PCRE_DLL "\pcre_free"
				, "UInt", #RegExCache_CompRERef)
	}

	DllCall("FreeLibrary", "UInt", #hPCREModule)
	#hPCREModule := 0
ExitApp

PCRE=>ContinueAutoExec:
#Include RegEx.ahk

Test:
variable = 9a89F87x21Beef This is a Test string containing today's Date : 14-03-2006 and Day : Tuesday, and more: 25-09-2006 or 06-10-1961 is good too.

pos := RegExMatch(variable, "([A-F\d]+)", "i")
Gosub GetResult
res = %result%
pos := RegExMatch(variable, "([A-F\d]+)", "i", A_RegExNextPos)
Gosub GetResult
res = %res%`n`n%result%
MsgBox %res%

nextPos := 1	; Start at beginning of string
Loop
{
	pos := RegExMatch(variable, "(\d+)-(\d+)-(\d+)", "", nextPos)
	If (pos = 0)
		Break
	nextPos := A_RegExNextPos
	Gosub GetResult
	res = Main match:`n%result%`n`nSub-captures:
	Loop %A_RegExCaptureCount%
	{
		pos := A_RegExPos%A_Index%
		len := A_RegExLength%A_Index%
		str := A_RegExString%A_Index%
		res = %res%`n`n
		( LTrim
			A_RegExPos: %pos%
			A_RegExLength: %len%
			A_RegExString: %str%
		)
	}
	MsgBox %res%
}
Return

GetResult:
result =
(
A_RegExPos: %A_RegExPos%
A_RegExLength: %A_RegExLength%
A_RegExString: %A_RegExString%
A_RegExNextPos: %A_RegExNextPos%
A_RegExCaptureCount: %A_RegExCaptureCount%
A_RegExError: %A_RegExError%
)
Return

Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004
Adding RegEx to AutoHotkey is going well (thanks to Philip Hazel for making PCRE so easy integrate). The code size seems even smaller than hoped for.

Based on PhiLho's posts above -- as well as those of ThomasL, Titan, and others -- I’ve come up the following design for your consideration, along with some more questions.
FoundPos := InStrRE(Haystack, NeedleRegEx [, Matches, Options, StartingPos])
The name “InStrRE” is tentative; there are some alternatives near the bottom. And although the ordering of the parameters above is tentative, their nature seems clear:[*:2ly8d1og]FoundPos (the return value): It seems most intuitive and useful to return the position of the first full-pattern match (or 0 if no match), just like InStr(). If there's an error during RegEx parsing, the return value can be blank and ErrorLevel can be set to some descriptive code like PhiLho’s approach. The variable A_LastError is also available if another output needed.
[*:2ly8d1og]Haystack: The subject string.
[*:2ly8d1og]NeedleRegEx: The pattern to search for (RegEx).
[*:2ly8d1og]Matches: If present, this parameter is an output variable that will receive the first substring from haystack that matches the complete pattern. If the pattern contains any subpatterns, the substrings that match them would be stored as array elements. There could also be an option that changes what get stored in Matches to be offsets and lengths rather than the substrings themselves.
[*:2ly8d1og]Options: A list of option letters such as Case-sensitive, DotAll, etc. (see http://php.net/manua... ... ifiers.php)
[*:2ly8d1og]StartingPos: The position in haystack at which to start the search (default 1).Questions and Considerations

Where to put the options: My understanding is that PHP’s requirement for delimiters at the beginning and end of the pattern exists solely to support options inside the RegEx itself. If so, I agree that this makes RegEx seem even more complicated than it already is (as some of you said earlier). Therefore, having options as a separate parameter seems more friendly while also protecting delicate RegEx's from typos.

Default options: Should InStrRE() be case insensitive by default (like InStr) or should it follow the more conventional approach of being case sensitive? The answer to this depends on how often insensitive searches are done, and whether it’s often enough to justify departing from tradition. Also, are there any other options that should deviate from PHP's defaults?

Caching: To improve performance, there will probably a very simple cache of the 10 to 100 most recently executed RegEx’s (i.e. in compiled form). Later on, the cache size can be increased by means of hashing or binary search.

Find-All (global mode): Should such an option be a high priority or can it be postponed? I realize the performance would be better than calling InStrRE() in a loop with varying offset, but perhaps not by much. There may also technical differences between a loop and a built-in implementation like PHP’s, such as avoiding infinite loops caused by empty strings. As an alternative, PhiLho had suggested adding a "Loop ParseRegEx, String, Pattern" capability, which for most purposes might be superior to a built-in find-all mode (since it would avoid creating arrays).

Option Names: I'll probably wind up using the same option-letters as PHP since they seem easy to remember. But comments/alternatives are welcome.

Alternate names for InStrRE() (if there's no clear consensus, we could have a poll): InStrReg, InStrRegEx, (or one of these but with an underscore), RegMatch (like PHP), RegExMatch, RE_Match, RE_InStr, RegInStr, etc.

RegExReplace(): I haven’t put much thought into RegExReplace() yet, but any comments are welcome. Also, I think RegExSplit() should be postponed until true arrays are implemented. In fact, PhiLho's "Loop ParseRegex" idea might be superior to it anyway.

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
Thanks for the ideas. The post being complex, I will skip the quotes...
The syntax seems basically OK. I like the InStrRE name, being short yet explicit.

Matches: is it a string or a real variable name? As I understand it, it will replace my A_RegExPos, A_RegExLength, A_RegExString, which is a good idea (no additional built-in vars, flexibility).
Instead of an option, perhaps you can add a suffix and a number, ie. if var is "capture", we get data in capturePosition (or capturePos), caputreLenght (or captureLen), captureString (or captureStr) and the same numbered for sub-captures. Because if we want, for some reason, both string and pos, we would need to do two searches. Now, we might add options to select which names are generated.
Once we get true arrays (hopefully associative ones, I started to write a proposal for this), this could be more flexible.

Options & Option Names: copying the PHP options can be indeed a good choice, it will be familiar to some people (the first four are rather almost universal) and are OK to learn for the others. Additional advantage, it is rather compact.

Where to put the options: Yes, keep them separate! We are lucky enough to have a special escape char, so we don't need to double all backslashes like in most languages, don't add the trouble to handle special delimiters.

Default options: Here, I vote for case-sensitive. It is less trouble for those used to REs, even if it won't follow AHK rules. Or should we have a way to set default options for next searches?

Caching: unless somebody voices a different PoV, I believe we don't need large caches, I think we rarely use more than a dozen REs in a loop.

Find-All (global mode): I am not too sure what you mean here. Anyway, indeed, a loop on InStrRE with computed StartingPos (must be FoundPos + length of found match) is enough.

Alternate names for InStrRE(): As I wrote, this name is OK. Avoid underscores, IIRC, you don't use them anywhere else.
Or reuse RegExMatch (or RegExFind), so a RegExReplace and RegExSplit will come naturally to the family. This would be consistent with your other name conventions (FileXxx, GuiXxx, etc.) which is nice in the CHM index...

RegExReplace(): Although that's a must have, it can come later, it is more work to implement. My implementation can be used as prototype. I defaulted to using the traditional $, but since it was easy, I allowed any custom tag. I suggest you see my notes on the topic in my TestPCRE_DLL.ahk. There is also there some test code that can be reused.
I agree that the Split isn't easy to do, even harder to do from the "outside", that's why I skipped it.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

JSLover
  • Members
  • 920 posts
  • Last active: Nov 02 2012 09:54 PM
  • Joined: 20 Dec 2004

Where to put the options

...I like s///g notation...or [email protected]@@g when parsing urls...can you support both options in the regex & a separate param?...they are "regexs" & should be advanced, like regexs are.

Default options

...case sensitive...it's easy to add i to the flags, but I don't know of a flag to turn ON case sensitivity...perhaps capital I, but I dunno...default regexs are normally case sensitive.

Find-All (global mode)

...by find all do you mean the g regex flag?...yes it should be supported...somehow...

Option Names

...options, as in flags...like g in s///g should stay the same, but in the separate "options" param, couldn't you support both?...g or the word global...?...i/I or the word case0/case1 for insensitive/sensitive.

Alternate names for InStrRE()

RegExMatch sounds good, preg_match for the perl among us (me)...or just match for the JavaScript in me...if we can write our own wrapper functions for the new regex stuff, then I'd say make it any descriptive name & we can wrapper our own name, but InStrRE rubs me the wrong way...maybe InStrRegEx...tho...?

RegExReplace()

...what are you supporting in regex?...when I think regex I think FULL regex...what will/won't we be able to do?...can you support ahk_regex in all string params?
Useful forum links: New content since: Last visitPast weekPast 2 weeks (links will show YOUR posts, not mine)

OMFG, the AutoHotkey forum is IP.board now (yuck!)...I may not be able to continue coming here (& I love AutoHotkey)...I liked phpBB, but not this...ugh...

Note...
I may not reply to any topics (specifically ones I was previously involved in), mostly cuz I can't find the ones I replied to, to continue helping, but also just cuz I can't stand the new forum...phpBB was soo perfect. This is 100% the opposite of "perfect".

I also semi-plan to start my own, phpBB-based AutoHotkey forum (or take over the old one, if he'll let me)
PM me if you're interested in a new phpBB-based forum (I need to know if anyone would use it)
How (or why) did they create the Neil Armstrong memorial site (neilarmstronginfo.com) BEFORE he died?

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Sorry if I missed something but what about backreferences? Will you output the traditional $1 .. $9/$n variables?

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


foom
  • Members
  • 386 posts
  • Last active: Jul 04 2007 04:53 PM
  • Joined: 19 Apr 2006

[*:2oejxymi]Matches: If present, this parameter is an output variable that will receive the first substring from haystack that matches the complete pattern. If the pattern contains any subpatterns, the substrings that match them would be stored as array elements. There could also be an option that changes what get stored in Matches to be offsets and lengths rather than the substrings themselves.

First subpattern = Match1, nth subpatter = Matchn , total number of matches = Match0 e.g. create an ahk-array like in StringSplit. Maybe also, Whole regexp = Match ($& in perl).

Where to put the options: My understanding is that PHP’s requirement for delimiters at the beginning and end of the pattern exists solely to support options inside the RegEx itself. If so, I agree that this makes RegEx seem even more complicated than it already is (as some of you said earlier). Therefore, having options as a separate parameter seems more friendly while also protecting delicate RegEx's from typos.

Omitting //gmsxi will not magically make newbies understand regular expressions better nor will the readability be drastically improved. It will make look simple RegExp's look clearer "\bsomeword[0-9]+\b".
But in case of "/(\+|\-|\*|\/|!|~|&|\||\^|(:|\-|\+|<|>|!)?=)/gi" it's six of one and half a dozen of another. And RegExp's can get very complicated very quickly, meaning such simple RegExp's like the first example will be rare.

Default options: Should InStrRE() be case insensitive by default (like InStr) or should it follow the more conventional approach of being case sensitive? The answer to this depends on how often insensitive searches are done, and whether it’s often enough to justify departing from tradition. Also, are there any other options that should deviate from PHP's defaults?
Option Names: I'll probably wind up using the same option-letters as PHP since they seem easy to remember. But comments/alternatives are welcome.

Keep things as they are.

Find-All (global mode): Should such an option be a high priority or can it be postponed? I realize the performance would be better than calling InStrRE() in a loop with varying offset, but perhaps not by much. There may also technical differences between a loop and a built-in implementation like PHP’s, such as avoiding infinite loops caused by empty strings. As an alternative, PhiLho had suggested adding a "Loop ParseRegEx, String, Pattern" capability, which for most purposes might be superior to a built-in find-all mode (since it would avoid creating arrays).
RegExReplace(): I haven’t put much thought into RegExReplace() yet, but any comments are welcome. Also, I think RegExSplit() should be postponed until true arrays are implemented. In fact, PhiLho's "Loop ParseRegex" idea might be superior to it anyway.

As Philho said replace() is a must. And with it the g modifier is a must.

John B.
  • Guests
  • Last active:
  • Joined: --
This seems to be very promising :-).

In my own work, making substitutions using regular expressions in hundreds of HTML files, I may have over a hundred substitutions to test/perform on a file. I don't know if this would be a consideration for caching (mentioned by PhiLho). In general, there are three ways that I use regular expressions now (using Grep, Sed, and TextPad):
* Look for a pattern in each line, and if it occurs, make a substitution.
* Test for a pattern in a line, and if it occurs, make a substitution ("addressing")
* Merge two or more lines together (or possibly all the lines in the file) and search for a multiline pattern. If the pattern occurs, make a substitution.

I agree with JSLover that the simplest, most familiar syntax for substitutions is s///, with the option of using some other delimiter instead of "/". In this, I'm drawing on my experience with UNIX and UNIX utilities (not Perl). I also agree with the rest of his points.

One issue that will have to be addressed is what standard you use for escaped characters such as new-line and tabs. I was surprised to discover that AutoHotkey uses `n and `t instead of the familiar \n and \t. If you use the AutoHotkey escape sequence, it will be confusing to anyone who already knows regular expressions. If you use standard regular expressions escape sequence, it will be confusing to anyone who already knows AutoHotkey. Of course, I vote for using the standard regular expression escape sequence (backslashes). Perhaps there could be a flag set in the beginning of a script to define the regular expression escape sequence. I don't know if #EscapeChar will do the job, or if there needs to be a separate flag just for regular expressions.

If I understand the Split issue correctly, would you be able to duplicate that functionality using a regexp substitution, by inserting a newline character as part of the replace expression?

Thanks
John B.