Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

AutoHotkey_L v1.1.05 beta: faster RegEx and more


  • Please log in to reply
37 replies to this topic
Lexikos
  • Administrators
  • 9844 posts
  • AutoHotkey Foundation
  • Last active:
  • Joined: 17 Oct 2006
As some of you may be aware, RegExMatch and RegExReplace are slower in Unicode versions of AutoHotkey_L due to the need to convert strings and offsets between the format used by the script (UTF-16) and the format used by PCRE (UTF-8). See my post here if you want a detailed explanation.

For this beta release, I have made wide-sweeping changes to PCRE so that it can accept UTF-16 strings in Unicode builds and ANSI strings in ANSI builds. This should improve performance considerably for Unicode builds. However, there's an unfortunately real possibility that I've broken something along the way, so it could use some real-world testing.

Testing is needed for the Unicode builds with ASCII, non-ASCII and mixed text; and to a lesser degree for the ANSI build. Feel free to also run benchmarks if you have the inclination.

If you think you've found a bug, please confirm the buggy behaviour does not appear in v1.1.04.01 before reporting it.

PCRE has also been updated from 8.10 to 8.13. See changelog.txt for details. It's mostly bug fixes and other changes which won't affect the majority of AutoHotkey users.

------

In addition to these changes to PCRE, two new v2-related features have been added. If you have any comments regarding v2, please post them in the v2 thread and not here. The new features are:

[*:2tzieqde]CoordMode now has a "Client" mode, which makes the coordinates relative to the top-left of the window's client area. The client area excludes the window's title bar and borders; the Gui commands already used client coordinates. For v1.x, "Relative" will remain the default, but "Client" will become the default in v2. "Window" has been added as a synonym for "Relative" to reduce ambiguity. The string "Relative" won't be recognized in v2.

[*:2tzieqde]If the capital letter "O" is present in the regex options (e.g. "O)pattern"), RegExMatch stores a match object in its output var instead of creating a pseudo-array. Similarly, regex callout functions receive a match object instead of a pseudo-array. This object has the following properties:

Match.Pos(): Same as the return value of RegExMatch.
Match.Len(): Length of overall match.
Match.Value(): Value of overall match.
Match.Count(): Number of captured sub-patterns.
Match[N]: Value of named or numbered sub-pattern. Note that each named sub-pattern also has a number. Number 0 is the overall match. () can be used instead of []. If N is "Pos", "Len" or "Value" AND there is no sub-pattern with that name, the appropriate property is returned. For instance Match["Pos"] and Match.Pos are usually equivalent to Match.Pos().
Match.Pos[N]: Position of sub-pattern.
Match.Len[N]: Length of sub-pattern.
Match.Name[N]: Name of sub-pattern, if it has one.
Match.XYZ: Equivalent to Match["XYZ"].
This will be made the default behaviour in v2, and the O option will be removed.[/list]Additionally, #include has been changed to show a more useful error message if it fails. See this thread for details.

Further updates (beta 3):
[*:2tzieqde]Super-globals: When used outside of any function, global varname makes varname available by default in every function except where overridden by a parameter or local declaration. This even affects variable references in functions defined prior to the global declaration, but performance may be slightly lower in those cases (by about the same as a ByRef parameter).

[*:2tzieqde]class classname automatically makes classname available in functions by default, using the super-global mechanism.Downloads (exe only):
Unicode 32-bit
ANSI 32-bit
Unicode 64-bit


fincs
  • Moderators
  • 1662 posts
  • Last active:
  • Joined: 05 May 2007
:shock:

I'm testing this ASAP. Keep up the hard work! :)

guest3456
  • Members
  • 1704 posts
  • Last active: Nov 19 2015 11:58 AM
  • Joined: 10 Mar 2011
i'm curious why you chose UTF-16 instead of UTF-8 for AHKL

jethrow
  • Moderators
  • 2854 posts
  • Last active: May 17 2017 01:57 AM
  • Joined: 24 May 2009
Thank you for the updates :)

Feel free to also run benchmarks if you have the inclination.

I ran the following on 64-bit 1.1.05.00-beta & 1.1.04.00 simultaneously. I figured One Hundred Million loop iterations should cover most needs:
[color=#107095]SetBatchLines[/color], -1
Unicode := [color=#666666]"šđčćžΔπραβγθω"[/color]
ANSI := [color=#666666]"abcdefghijklm"[/color]

output := [color=#666666]"Loop`tUnicode`t`tANSI`r`n---------------------------------"[/color]
[color=#107095]Loop[/color], 10 {
  QPX(1)
  [color=#107095]Loop[/color], 100000000
    [color=#107095]RegExMatch[/color](Unicode, [color=#666666]"Δ|ω"[/color], U)
  R1 := QPX(0)
  QPX(1)
  [color=#107095]Loop[/color], 100000000
    [color=#107095]RegExMatch[/color](ANSI, [color=#666666]"f|m"[/color], A)
  R2 := QPX(0)
  output .= [color=#666666]"`r`n"[/color] [color=brown]A_Index[/color] [color=#666666]"`t"[/color] R1 [color=#666666]"`t"[/color] R2
}
[color=#107095]FileAppend[/color], %output%, %[color=brown]A_AHKversion[/color]% [color=#107095]RegExMatch[/color] Results.txt
; 1.1.05.00-beta1 RegExMatch Results.txt

[color=#107095]Loop[/color]	Unicode		ANSI
---------------------------------
1		90.615568	89.645735
2		89.719277	89.496797
3		89.766385	89.188127
4		90.757741	90.223635
5		89.373248	90.082754
6		89.788517	90.504121
7		89.851908	89.949456
8		90.032936	89.894111
9		89.889044	88.996904
10		89.478873	88.624140
; 1.1.04.00 RegExMatch Results.txt

[color=#107095]Loop[/color]	Unicode		ANSI
---------------------------------
1		249.841795	189.305598
2		249.799293	189.980255
3		249.280614	189.480450
4		249.384238	189.079531
5		250.617913	189.736234
6		250.928567	189.727809
7		250.943764	189.745889
8		250.888681	189.704941
9		250.983004	189.726732
10		251.664441	190.867723
... results are in seconds. Then, for reference, I ran the same code for AHK Basic, but changed the first RegExMatch to duplicate the second ANSI one:
; 1.0.48.05 RegExMatch Results.txt

[color=#107095]Loop[/color]	ANSI		ANSI
---------------------------------
1		78.011639	78.172204
2		77.967096	78.189184
3		77.983800	78.213207
4		77.959545	78.219603
5		77.992335	78.149763
6		77.996710	78.232177
7		77.951734	78.258539
8		77.984742	78.262056
9		77.950316	78.598541
10		78.006811	78.095530

Also, I'm running the BBCode Parser source code (which uses a fair amount of RegEx) - using 1.1.05.00-beta1 - on this post to test for RegEx Errors.

EDIT - I don't see any RegEx Errors :)

Lexikos
  • Administrators
  • 9844 posts
  • AutoHotkey Foundation
  • Last active:
  • Joined: 17 Oct 2006
v1.1.05.00-beta2:
[*:1zzvzcxy]Fixed thread not exiting in some common cases when an assignment fails.
[*:1zzvzcxy]Fixed #MaxMem being applied even if the variable was already large enough.
[*:1zzvzcxy]Fixed COM errors not throwing exceptions when wrapped in a TRY block.
[*:1zzvzcxy]Fixed heap corruption in PCRE caused by insufficient allocation when named sub-patterns are present.See first post for updated downloads.

i'm curious why you chose UTF-16 instead of UTF-8 for AHKL

Windows uses UTF-16. Aside from that, Unicode support was added by jackieku via AutoHotkeyU.

jethrow
  • Moderators
  • 2854 posts
  • Last active: May 17 2017 01:57 AM
  • Joined: 24 May 2009

v1.1.05.00-beta2:
[*:2oe6rfrj]Fixed COM errors not throwing exceptions when wrapped in a TRY block.

Excellent :D . Also, I was pleasantly surprised that try/catch will pick up DllCall errors - specifically with Com vTable calls. This is a nice alternative to checking the HResult each time. Thanks fincs & Lexikos!

One thought: why is a parameter required for throw? What if you don't care about the message, you just want to exit the try-block? I realize you can simple pass a blank sting though.

Lexikos
  • Administrators
  • 9844 posts
  • AutoHotkey Foundation
  • Last active:
  • Joined: 17 Oct 2006
v1.1.05.00-beta3:
[*:2y5sdbrl]Back-ported super-global support from v2.
[*:2y5sdbrl]Fixed super-global declarations having no effect if the variable already existed.
[*:2y5sdbrl]Added a hack to allow referencing super-globals prior to declaration.
[*:2y5sdbrl]Changed class definitions to create super-global variables.See my first post for a better explanation.

Also, I was pleasantly surprised that try/catch will pick up DllCall errors - specifically with Com vTable calls. This is a nice alternative to checking the HResult each time.

Not really. If DllCall failed and it's not in a try block, you don't get a HRESULT, just an empty string. If DllCall succeeded, you still need to check the HRESULT. Wrapping a DllCall and having the wrapper throw an exception could simplify things - which was the purpose of try/catch - but it doesn't seem like that's what you were talking about.

One thought: why is a parameter required for throw?

Good question. Catch doesn't require a parameter (and try doesn't require a catch), so throw probably shouldn't require a parameter. What say we have it throw an Exception() object?

fincs
  • Moderators
  • 1662 posts
  • Last active:
  • Joined: 05 May 2007
Bug: this works:
pcre_callout = RegExCallout
test = This is \*some test *which might or \*might not\* parse correctly*.

RegExMatch(test, "\*.+?\*(?C)")

RegExCallout(m)
{
	msgbox % m
	return 0
}
while this doesn't:
pcre_callout = RegExCallout
test = This is \*some test *which might or \*might not\* parse correctly*.

RegExMatch(test, "O)\*.+?\*(?C)")

RegExCallout(m)
{
	msgbox % m.Value()
	return 0
}


Tuncay
  • Members
  • 1945 posts
  • Last active: Feb 08 2015 03:49 PM
  • Joined: 07 Nov 2006
Lexikos, thanks for the update with super globals for classes! That was one minor thing, which bugged me since I learned about it.

I am building a small test unit environment for RegEx (consistent testing with each new version of AHK). But I don`t know what to test here. And still I have to learn about UTF encoded files too, to make tests correctly. It tests all test units with their associated test files against a list of AutoHotkey executables (currently only 32-bit on my test system). I will upload it later, if I have integrated some test cases. Just need help for the right direction (for regex and haystack, and probably about the encoding…).

No signature.


Lexikos
  • Administrators
  • 9844 posts
  • AutoHotkey Foundation
  • Last active:
  • Joined: 17 Oct 2006
fincs: The values returned by the object are based solely on the offset vector. Since the overall pattern hasn't been matched yet, the corresponding slots in the offset vector are set to "no match". I'll have to change it to derive the value from the start_match and current_position fields of the pcre_callout_block structure, as before. (Btw, the offset vector and those other fields are accessible via A_EventInfo, as documented.)

Edit: When I tried to remove the FoundPos parameter (for v2), I realized that Match.Pos was always 0 since the match wasn't complete yet; but it didn't occur to me that Match.Value would be blank as a result. Once this is fixed, the FoundPos parameter will be redundant. I think it should be removed in v2; what do you think about omitting it in v1 when O) is present?
--- end edit ---

Tuncay: You could test the individual pattern elements mentioned in the help file or in pcre.txt and random combinations of these patterns, or samples from real-world scripts. Haystack can be anything, but particularly non-ASCII text needs testing. Anything you come up with is probably going to be different to the basic test cases I already have (and different is better in this case).

Tuncay
  • Members
  • 1945 posts
  • Last active: Feb 08 2015 03:49 PM
  • Joined: 07 Nov 2006
Here my current test environment for this. Currently I have only 3 ANSI test cases for one html (html file from AHK Documentation). Later if I have time, I will add UTF test files and cases too and it will grow. Included are 8 AutoHotkey executables. The archive is in 7z format, because its compresses very well (5.44 MB > 1.4 MB) and its free and known format today.

In case someone is interested in it, I`ll share:
<!-- m -->https://ahknet.autoh...hkRegexTests.7z<!-- m -->

Usage: "test_all.ahk" calls every "test_unit_*.ahk" file from subfolder "tests" against all founded executables from subfolder "interpreter". A summary is displayed after job is done. Alternatively, you can double click any "test_unit_*.ahk" file manually, to execute it against current installed one. If no error is found, nothing is reported.

Psshh: It can be used to test any other library.

No signature.


  • Guests
  • Last active:
  • Joined: --
It must be so hard to simply provide test results like jethrow did.

fincs
  • Moderators
  • 1662 posts
  • Last active:
  • Joined: 05 May 2007

I'll have to change it to derive the value from the start_match and current_position fields of the pcre_callout_block structure, as before.

:)

By the way, it would be nice to support this:
pcre_callout := Func("SomeFunc")


Tuncay <mobile>
  • Guests
  • Last active:
  • Joined: --
Are you trolling again? If you know why such tests are done, then you woukd not write such ironic thing, mr. Unnamed guest. jethrow provided benchmarks. I try to provide many different test scenarios to check its correctly implemented. I bet you are the guest who again and again shows the weak sides of ahk. From now on i try to ignore you, cause you just trolling around. See my sig for more info...

jethrow
  • Moderators
  • 2854 posts
  • Last active: May 17 2017 01:57 AM
  • Joined: 24 May 2009

v1.1.05.00-beta3:
[*:o9ky973d]Back-ported super-global support from v2.

Wow - quick turnaround :) . Now to nitpick ... is there any way to create a Super-global variable while inside a function?

What say we have it throw an Exception() object?

Sounds good.

It must be so hard to simply provide test results like jethrow did.

Actually, with the computer fan working in the background, it was challenging to get a nap in ... but I'm an overcomer.