Complex Regular Expressions fail when input exceeds a certain length.

Report problems with documented functionality
iseahound
Posts: 307
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Complex Regular Expressions fail when input exceeds a certain length.

Post by iseahound » 20 Apr 2018, 11:14

Code: Select all

#a::
if (clipboard ~= "([A-Za-z0-9+/=])*")
   MsgBox success
else
   MsgBox either this is not base 64 or the input is too long.
return
Steps:
1) Highlight the below text, and control c to clipboard.
2) press windows a to check if it is base64
3) it fails

try again with a smaller portion of the below text. It will succeed.

Code: Select all

/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAFaAWIDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3SiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKiurhLS0muZAxSGNpGCjJIAyce9S1DdxzTWc8VtP9nneNljm2B/LYjhtp4ODzik720GrX1OW8PeNW19Fu44NNFj5ZllMGpCa4t0wSpliCDaTjGAxwfWrK/EDwy8PmpqEjJ5H2lSLSY74u7KNnzAfxYztwc4waox+AhJrJ1W8uNPW7EM0YbTtP+zeY0q4Z5fnYyHuBkck0lr8P/s32H/iZ7vsmjPpX/Hvjdu/5afe4/wB39aG30X9a/wDA+8S8/wCtv+CS6542S0sNQm0xrOdodJ/tK3LNIS6k4BK7QNv/AAPPsOtXYfFtpG90uoTW0C2tpBcysrOWHmDgbdmOSMAKzE+g4zkN8ON1k9t/auN2hrpG77P0wc+Z97/x39afqnw5g1ZdSWfUGAvLS2gTEIPlvCSVY5OGBJ5XjjvTej0/r4v+AJX/AK+X/BNoeMdCNncXLXciLbzJbyxSW0qTCRsbV8oqHJORjC8/ga1bK9i1C2FxCk6oSRie3khbj/ZcBv0rkn8B+bo13p0ieHhFclCyQ6KYo/l3c4SYNv5GGDDGOBzW34W0A+GtCj003kl0EZmVmztQE8IgZmIUDAALH60+4a6G1RRRSGFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFAHL6l4hH2z7Ml29pGZmhjeCLzZJCuA7Y2sEjVjtZmGAcZIBG6j/wAJU2l6nJbz3sl5FHgzCeHy2RMgNJGyqqyKu4bgM44yQeDTeMw+ILuwv9dvdJgWSSZZYFgWPEjs6lmljbg5Kg8DchHJYVR1myiklj07SvEl9rEl6GgbebZ4lLAjGYo1O4Dc554CnI+YV87L6/7d1lNcqduW/norcvVefW90dsI07crXTf8ApnqFYGq6vci6ls7FJT5RVZJIVVn3kbtq7vlXC4Ys3HzADJPG/Xnni2NkvdRsrnT4L63v3juPKmLBWjVURjwrFijIrEAZwwxk4B9LM6lanQvR3ul6L5tfmvVbnPRgpTszSutX1TSLncJbnUYghZkljjAbGThWQAq2B/ECp6ZBIrrbe4iu7WK5gcPDMgkjYfxKRkH8q8olv3k1T+0f7Nt5LnyfI8+Pz1kwWGyICSFC25zxjp3xxn07R7N9O0SwspGDPbW0cLEdCVUA/wAq5snqYlxlDEO9ra6PXrqm/I2xNOEVFx6lPWNWmtphaWiFpjH5kjKu5kBbaoC9CWO7BJ2jaS3A5znutVtVjnTUJpXwCbe7SLaf9kmNQQfcEj2NReJYlTWCt1GslnqNoLUhlypZS5KH/eVzx/smuce8ska3WGxEcGnIyiWaBoxbxhcYQsOc4A+XjAOT0B8vNMZjIYtwpyaS200en6vQ3wuGhOCkz0bStQj1XTIL2NSglX5kbqjA4ZT7ggj8Kg1jVf7OWKNDEJZQ7b5mwkSKMtI3TIGVGMj7w5AyRF4WtZ7Tw7bJcoY5pGknZCOU8x2fafcbsfhWd4vhxdaVeygm2ilCPgdGMkTpn/ZLRBT/ALwr6LEVKkcM5rSVvW3fTrY4YxTqcqK9zq7W1ml+nidDA7YjebyDC55+XIUE9D0bPBroNF1UatYtK0YimikaGaINu2MPQ4GQQQw9mFeeW7fZ7yKdYtS3Dy1NzKUJk2+RywB3fP8AZ1DZA4d87e3WeArI2nh5nG7ybiUPBu6mNY0jVj/vCPd9CK8zKK9Sc5RdRzWu621019N0dOJo8kE2rM1tc1iLRNMa7kAbBwqkkdAWY8AnAUMxwM4U45rlp/EN2kcrS6jdwXSFs24+yeWNvmbjuyyqq+TNkNICPLIPOM7fjCLOlQ3LxrJb2s3mXCMu4GIo6MSPQB8n2BrzzUNfsLXxjY+FzpXmi8tXkNw0qlcETFgyshZy2ZMkuM+Y2c5ObzLEVI1VBXSS5tGloviW3YnD0VNX87ffsekeHNebWYrqK4hEN7aSGKeMdMgkZHX+JXXqeVOCRgm/quox6TpdxfSI0giX5Y0BLSOThUUDkszEKAOpIrnfBVufP1K9QEW8hSBGYkmRkLl2JPJJaQgk8kq1aviqznvfDtxHalxc
joefiesta
Posts: 154
Joined: 24 Jan 2016, 13:54

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by joefiesta » 20 Apr 2018, 11:24

So, ~= is documented as a shortcut for RegExMatch. That's all the doc says. How then is one supposed to translate

clipboard ~= "([A-Za-z0-9+/=])*"

into

FoundPos := RegExMatch(Haystack, NeedleRegEx [, OutputVar, StartingPosition := 1])

???
iseahound
Posts: 307
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by iseahound » 20 Apr 2018, 11:42

you don't. (x ~= "x") returns true/false. (If var x matches string "x"). Of course I could be wrong, as I was about the bug report. It's a case of un-optimized regex.

Original

Code: Select all

^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$
New

Code: Select all

^(?:[A-Za-z0-9+/]{4})*?(?:[A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$
I added ?: after the left parenthesis to denote the non-capturing group, but even then it would still crash (although it would take up to 10x the input). Changing the asterisk in (?:[A-Za-z0-9+/]{4})* to (?:[A-Za-z0-9+/]{4})*? changed the algorithm from greedy to lazy, preventing backtracking. Honestly, this stuff is pretty arcane. I think a warning that Regular Expressions can crash without error is needed in the documentation however.
swagfag
Posts: 1211
Joined: 11 Jan 2017, 17:59

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by swagfag » 20 Apr 2018, 14:12

the regex doesnt crash without error, check the error level: -21 "recursion too deep"

as per the docs: A negative number, which means an error occurred during the execution of the regular expression. Although such errors are rare, the ones most likely to occur are "too many possible empty-string matches" (-22), "recursion too deep" (-21), and "reached match limit" (-8). If these happen, try to redesign the pattern to be more restrictive, such as replacing each * with a ?, +, or a limit like {0,3} wherever feasible.
https://autohotkey.com/docs/commands/Re ... ErrorLevel

your sample regex

Code: Select all

needle := "([A-Za-z0-9+/=])*"
for the provided 3k char long string, executes in, well, 3k steps. i couldnt find out whether there's a limit on the amount of capturing groups that you could make. there is a limit on recursion depth:

Code: Select all

/* The above limit applies to all calls of match(), whether or not they
   increase the recursion depth. In some environments it is desirable to limit
   the depth of recursive calls of match() more strictly, in order to restrict
   the maximum amount of stack (or heap, if NO_RECURSE is defined) that is
   used. The value of MATCH_LIMIT_RECURSION applies only to recursive calls of
   match(). To have any useful effect, it must be less than the value of
   MATCH_LIMIT. The default is to use the same value as MATCH_LIMIT. There is
   a runtime method for setting a different limit. On systems that support it,
   "configure" can be used to override the default. */
/* AutoHotkey: Original value was MATCH_LIMIT (i.e. 10000000).  It was lowered to 6000 so that
the program's current stack limit of 4 MB won't be exceeded. The limit was computed from the
following info in PCRE docs somewhere: "As a very rough rule of thumb, you should reckon on
about 500 bytes per recursion. Thus, if you want to limit your stack usage to 8Mb, you should
set the limit at 16000 recursions. A 64Mb stack, on the other hand, can support around
128000 recursions."
*/
#ifndef MATCH_LIMIT_RECURSION
#define MATCH_LIMIT_RECURSION 6000
#endif
...
md->match_limit_recursion = MATCH_LIMIT_RECURSION;
...
if (rdepth >= md->match_limit_recursion) RRETURN(PCRE_ERROR_RECURSIONLIMIT);
that said, i havent looked into how it works under the hood, but i suspect the problem to be linked to the ()* construct. what is the regex supposed to do exactly? if u put the star inside the capturing group, the regex executes in 2 steps
iseahound
Posts: 307
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by iseahound » 20 Apr 2018, 14:39

The main problem was that it was greedy, using ()*? Fixed it. It's supposed to match in groups of 4 ([a-z]{4})*?. It matches base64 data, which can get extremely large, on the order of several megabytes of string data. This is image data converted into text.
guest3456
Posts: 2419
Joined: 09 Oct 2013, 10:31

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by guest3456 » 20 Apr 2018, 18:19

good stuff swagfag

alessiomarkz
Posts: 1
Joined: 23 Apr 2018, 05:41
Contact:

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by alessiomarkz » 23 Apr 2018, 08:41

This is image data converted into text.
iseahound
Posts: 307
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by iseahound » 23 Apr 2018, 17:08

Actually I did some further testing:

Problematic:
https://regex101.com/r/fnDsNx/1

"Fixed" Version
https://regex101.com/r/0ocazu/1

The amount of time it takes to compute the regex increases by a factor of 3 in the fixed version, yet it doesn't crash in AHK. However, the original RegEx is more efficient but crashes, despite it being better written! I think this is actually a bug.

If you click on Regex debugger on the left hand side, you can clearly see that the problematic RegEx works in O(n) while the fixed RegEx works in O(4n). I can tentatively conclude this is an AHK implementation problem, since the Perl/Python engines work fine.
Last edited by iseahound on 24 Apr 2018, 04:53, edited 1 time in total.
iseahound
Posts: 307
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by iseahound » 23 Apr 2018, 17:34

Fastest AHK Implementation, O(n) using zero-width assertions. https://regex101.com/r/XfKTM9/1

Code: Select all

^(?:[A-Za-z0-9+\/]{4})*?$(?<=[A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)$
EDIT: Actually this crashes 50% of the time. only the slow but fixed version works.
User avatar
nnnik
Posts: 3330
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by nnnik » 24 Apr 2018, 01:16

It seems to me like you are making this more complex than it actually is.
You are using the RegEx to acertain that:

The string is in 4 byte groups of characters
There are only characters that are allowed to go inside the groups are inside the group
At the end the = sign is used to fill up the last group

For the first job I'd use modulo since all the characters across the groups are the same and there is no specific structure inside them:

Code: Select all

if ( mod(strLen( data ), 4 )  ;if the strLen of the data is not deviseable by 4
For the second requirement I would use RegEx - since all the characters across the groups are the same and there is no specific structure inside them:

Code: Select all

RegExMatch( data,  "^[A-Za-z0-9+\/]+
Lastly the last group can be filled up with = meaning that at the end of the string there are 0 to 3 = but never more since then the whole group would get removed:

Code: Select all

RegExMatch( data,  "^[A-Za-z0-9+\/]+={0,3}$")
Since you want speed we can put it in an atomic subgroup since there is not a single point where we backtrace:

Code: Select all

RegExMatch( data,  "(?>^[A-Za-z0-9+\/]+={0,3}$)")
Putting it together with what we had previously we get:

Code: Select all

if ( mod(strLen( data ), 4 ) && RegExMatch( data,  "(?>^[A-Za-z0-9+\/]+={0,3}$)" ) )
This is probably the fastest approach for what you were doing.
You can toss the complexity notation in the trash bin for comparing our 2 versions though.
Recommends AHK Studio
iseahound
Posts: 307
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by iseahound » 24 Apr 2018, 04:52

You realize that the complexity notation is for RegEx analysis, not for real world performance? In all my benchmarks, both codes have excelled at an exceedingly fast 0ms. What I want to know is why ([a-z]{4})* will crash yet ([a-z]{4})*? will not. Also base64 code with 3 trailing equal signs is invalid. Also the fastest method is to separate the data into two parts, then run a simple [a-z\dA-Z+/]*, followed by ^([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$

The main frustration is writing efficient RegEx that crashes in AutoHotkey, but not in any other language.

EDIT: I can disable backtracking by making it possessive, and it will still crash. ([a-z]{4})*+ Nevermind, possessive modifiers work fine and even fix my original RegEx. Nope, just tested a possessive modifier with working code, and it crashes. ^(?:[A-Za-z0-9+\/]{4})*+$(?<=[A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)$
just me
Posts: 5510
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by just me » 24 Apr 2018, 09:18

Hi nnnik,

thanks for the hint about "atomic groups". I didn't know that. They extremely increase performance in case of mismatches at the end of the haystack.
User avatar
nnnik
Posts: 3330
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by nnnik » 24 Apr 2018, 09:30

iseahound have you tried running the regexmatch repeadetly?

@just me:
I learned them from regEx101.com one time I went ahead and learned all of the regex tags they have their - it was worth it.
Recommends AHK Studio
User avatar
jeeswg
Posts: 5134
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by jeeswg » 24 Apr 2018, 11:13

@nnnik: Any other good finds?

[click Show Sidebar, View Regex Quick Reference]
Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript
https://regex101.com/
User avatar
nnnik
Posts: 3330
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by nnnik » 24 Apr 2018, 15:45

Yeah the stuff for recursion seemed pretty powerful.
Other than that I can't quite remember everything.
I still do reccomend reading it - it was a nice experience
Recommends AHK Studio
User avatar
jeeswg
Posts: 5134
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by jeeswg » 24 Apr 2018, 15:54

- Thanks nnnik.
- I've got my tutorial here, and I'm always on the lookout for additional RegEx techniques not mentioned in the AHK documentation.
jeeswg's RegEx tutorial (RegExMatch, RegExReplace) - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=7&t=28031
- The thing that would be really useful is if you are or anyone else is able to come up with a really good example/examples of where certain bits of syntax are useful, especially those not mentioned in the AHK documentation.
- Someone posted here re. atomic groups, I sense that they are useful, but I haven't seen a really good/classic example yet.
RegEx Atomic Group / Possesive Quantifier - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=5&t=47267
iseahound
Posts: 307
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

Post by iseahound » 24 Apr 2018, 17:00

Sure. Here's an example of an atomic group that matches the word "coccyx" if and only if it is not enclosed in parenthesis. Due to restraints, parenthesis are only supported up to two nested layers, ((coccyx)) and not (((coccyx))).
https://regex101.com/r/N5LJUr/1

However, if you remove the atomic group ?> and make it into a non-capturing group ?:, it will fail if no parenthesis are present in the test string. https://regex101.com/r/k59yLV/1

EDIT: Here's an example from production used in my Subtitle class https://regex101.com/r/6m8kis/1
Post Reply

Return to “Bug Reports”