Sorry for the late reply. Here are my comments.
Matches: is it a string or a real variable name? ...
Instead of an option, perhaps you can add a suffix and a number, ie. if var is "capture", we get data in capturePosition (or capturePos), caputreLenght (or captureLen), captureString (or captureStr) and the same numbered for sub-captures. Because if we want, for some reason, both string and pos, we would need to do two searches. Now, we might add options to select which names are generated.
Yes, I thought I might stick to the way PHP does it: have a separate option that says, "I want the positions instead of (or in addition to) the substrings themselves." When that option is in effect, different things would be stored in the array (or two arrays would be created).
I vote for case-sensitive [as the default].
From what others have said, that seems to be the consensus. Thanks.
Or should we have a way to set default options for next searches?
It might be best to avoid that because it hurts script maintainability and portability (e.g. it makes copy & paste of script sections more error-prone if you forget what options were in effect).
Omitting //gmsxi will not magically make newbies understand regular expressions better nor will the readability be drastically improved. It will make look simple RegExp's look clearer "\bsomeword[0-9]+\b".
But in case of "/(\+|\-|\*|\/|!|~|&|\||\^|(:|\-|\+|<|>|!)?=)/gi" it's six of one and half a dozen of another. And RegExp's can get very complicated very quickly, meaning such simple RegExp's like the first example will be rare.
I agree with JSLover that the simplest, most familiar syntax for substitutions is s///, with the option of using some other delimiter instead of "/". In this, I'm drawing on my experience with UNIX and UNIX utilities (not Perl).
...I like s///g notation...or s@@@g when parsing urls...can you support both options in the regex & a separate param?...they are "regexs" & should be advanced, like regexs are.
That's a good point, but I think I'd prefer to implement only one approach, at least initially. For one thing, it makes the documentation a lot simpler. As as someone who has learned a lot about RegEx's in the past year, I can tell you that PHP's requirement for delimiters at the beginning and end of RegEx strings was a source of considerable confusion for me (perhaps because it is poorly documented at php.net).
[s///] is OK if the language accepts this syntax from the start, but it is too much trouble to add it after. Lot of languages dropped this syntax. I kind of like it (when we have choice of delimiter), but I prefer to skip it in AHK.
I tend to agree, at least for the initial release. Extensions can be added later; so the important thing is to get it as close to "best" as we can on the first release.
And with [Replace()] the g modifier is a must.
...by find-all do you mean the g regex flag?...yes it should be supported...somehow...
It will definitely be supported by Replace(), but perhaps not initially by Match (InStrRE).
couldn't you support both?...g or the word global...?...i/I or the word case0/case1 for insensitive/sensitive.
That would boost readability and is a good thing to consider. Thanks.
It's a constant battle between readability and brevity. For functions, I tend to prefer brevity because they're more often used by advanced users who prefer shorter names. We could have a quick poll to decide.
can you support ahk_regex in all string params?
For now it's just Match() and Replace(); but eventually in the windowing commands and perhaps in a split/loop-parsing capability.
Sorry if I missed something but what about backreferences? Will you output the traditional $1 .. $9/$n variables?
Yes, as foom confirmed, the substring that matches each subpattern (backreference) would be stored in an array element.
One issue that will have to be addressed is what standard you use for escaped characters such as new-line and tabs. I was surprised to discover that AutoHotkey uses `n and `t instead of the familiar \n and \t. If you use the AutoHotkey escape sequence, it will be confusing to anyone who already knows regular expressions. If you use standard regular expressions escape sequence, it will be confusing to anyone who already knows AutoHotkey.
That's a good point. Assuming PCRE expects linefeeds, tabs, and other special characters to be sent in raw, I think we should stick with the AutoHotkey way of escaping because it will reduce code complexity and increase performance. This is because by then, AutoHotkey has already resolved `n to be a literal linefeed, `t to be a literal tab, etc.
If I understand the Split issue correctly, would you be able to duplicate that functionality using a regexp substitution, by inserting a newline character as part of the replace expression?
Possibly. In any case, I'm pretty sure that Split's functionality can be easily achieved in most cases without a built-in Split function (though eventually there will probably be one).
RegExReplace(): My implementation can be used as prototype. ...I suggest you see my notes on the topic in my TestPCRE_DLL.ahk. There is also there some test code that can be reused.
Replace feature: facts and ideas
Great stuff! When the time comes, I might have a few questions for you about these.
A quick check of the Regular Expression Pocket Reference from O'Reilly indicates the following:
The $n notation is used in Perl, Java,.NET, C#
For simplicity, I'm leaning toward supporting $ only (no backslash) for backreferences. PhiLho gave a lot of great references about how other languages do this, which should help in choosing a good method.
I agree that this would be way cool, and could save a lot of time. Writing an RE to validate input may be good for the soul, but can be difficult to do in practice. It could also be a big help when transforming the input as PhiLho explained.
... a powerful and cool feature in the replace function: we can provide a function as replace string.
If this refers to PCRE's callout/callback feature, I agree it would be useful. Certainly not in the initial release, but perhaps down the road.
I think it is about time AHK have & operator for functions. This will allow us to use functions in reg exp replace, to use subclassing for advanced automatition and many other things that currently are not possible. This single update will open entire world of options.
You probably guessed that it would be non-trivial to implement. However, your proposal of using the address operator with an AHK function and have the callback automatically set up properly inside AHK is the ultimate in elegance and simplicity. Hopefully it will be feasible to implement someday.
[CRLF] is important in our Windows universe...
More letters to add to the available options, unless Chris allows some +0x00008000 form (but this one is important enough to have its shortcut).
I assume you mean that there should be an option to switch between CRLF mode and LF mode. I'm assuming there's no easy way to auto-detect or auto-adapt, so the question becomes: should LF or CRLF be the default. AutoHotkey uses plain LF a lot in its internal strings, and encourages scripts to do the same. But Windows itself uses physical CRLF more often (such as in text files), and so does FileRead by default (for performance reasons).
Thanks for all the comments. More are welcome.