Complex Regular Expressions fail when input exceeds a certain length.

Report problems with documented functionality
iseahound
Posts: 231
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Complex Regular Expressions fail when input exceeds a certain length.

20 Apr 2018, 11:14

Code: [Select all] [Download] GeSHi © Codebox Plus

#a::
if (clipboard ~= "([A-Za-z0-9+/=])*")
MsgBox success
else
MsgBox either this is not base 64 or the input is too long.
return


Steps:
1) Highlight the below text, and control c to clipboard.
2) press windows a to check if it is base64
3) it fails

try again with a smaller portion of the below text. It will succeed.

Code: [Select all] [Download] GeSHi © Codebox Plus

/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAFaAWIDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3SiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKiurhLS0muZAxSGNpGCjJIAyce9S1DdxzTWc8VtP9nneNljm2B/LYjhtp4ODzik720GrX1OW8PeNW19Fu44NNFj5ZllMGpCa4t0wSpliCDaTjGAxwfWrK/EDwy8PmpqEjJ5H2lSLSY74u7KNnzAfxYztwc4waox+AhJrJ1W8uNPW7EM0YbTtP+zeY0q4Z5fnYyHuBkck0lr8P/s32H/iZ7vsmjPpX/Hvjdu/5afe4/wB39aG30X9a/wDA+8S8/wCtv+CS6542S0sNQm0xrOdodJ/tK3LNIS6k4BK7QNv/AAPPsOtXYfFtpG90uoTW0C2tpBcysrOWHmDgbdmOSMAKzE+g4zkN8ON1k9t/auN2hrpG77P0wc+Z97/x39afqnw5g1ZdSWfUGAvLS2gTEIPlvCSVY5OGBJ5XjjvTej0/r4v+AJX/AK+X/BNoeMdCNncXLXciLbzJbyxSW0qTCRsbV8oqHJORjC8/ga1bK9i1C2FxCk6oSRie3khbj/ZcBv0rkn8B+bo13p0ieHhFclCyQ6KYo/l3c4SYNv5GGDDGOBzW34W0A+GtCj003kl0EZmVmztQE8IgZmIUDAALH60+4a6G1RRRSGFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFAHL6l4hH2z7Ml29pGZmhjeCLzZJCuA7Y2sEjVjtZmGAcZIBG6j/wAJU2l6nJbz3sl5FHgzCeHy2RMgNJGyqqyKu4bgM44yQeDTeMw+ILuwv9dvdJgWSSZZYFgWPEjs6lmljbg5Kg8DchHJYVR1myiklj07SvEl9rEl6GgbebZ4lLAjGYo1O4Dc554CnI+YV87L6/7d1lNcqduW/norcvVefW90dsI07crXTf8ApnqFYGq6vci6ls7FJT5RVZJIVVn3kbtq7vlXC4Ys3HzADJPG/Xnni2NkvdRsrnT4L63v3juPKmLBWjVURjwrFijIrEAZwwxk4B9LM6lanQvR3ul6L5tfmvVbnPRgpTszSutX1TSLncJbnUYghZkljjAbGThWQAq2B/ECp6ZBIrrbe4iu7WK5gcPDMgkjYfxKRkH8q8olv3k1T+0f7Nt5LnyfI8+Pz1kwWGyICSFC25zxjp3xxn07R7N9O0SwspGDPbW0cLEdCVUA/wAq5snqYlxlDEO9ra6PXrqm/I2xNOEVFx6lPWNWmtphaWiFpjH5kjKu5kBbaoC9CWO7BJ2jaS3A5znutVtVjnTUJpXwCbe7SLaf9kmNQQfcEj2NReJYlTWCt1GslnqNoLUhlypZS5KH/eVzx/smuce8ska3WGxEcGnIyiWaBoxbxhcYQsOc4A+XjAOT0B8vNMZjIYtwpyaS200en6vQ3wuGhOCkz0bStQj1XTIL2NSglX5kbqjA4ZT7ggj8Kg1jVf7OWKNDEJZQ7b5mwkSKMtI3TIGVGMj7w5AyRF4WtZ7Tw7bJcoY5pGknZCOU8x2fafcbsfhWd4vhxdaVeygm2ilCPgdGMkTpn/ZLRBT/ALwr6LEVKkcM5rSVvW3fTrY4YxTqcqK9zq7W1ml+nidDA7YjebyDC55+XIUE9D0bPBroNF1UatYtK0YimikaGaINu2MPQ4GQQQw9mFeeW7fZ7yKdYtS3Dy1NzKUJk2+RywB3fP8AZ1DZA4d87e3WeArI2nh5nG7ybiUPBu6mNY0jVj/vCPd9CK8zKK9Sc5RdRzWu621019N0dOJo8kE2rM1tc1iLRNMa7kAbBwqkkdAWY8AnAUMxwM4U45rlp/EN2kcrS6jdwXSFs24+yeWNvmbjuyyqq+TNkNICPLIPOM7fjCLOlQ3LxrJb2s3mXCMu4GIo6MSPQB8n2BrzzUNfsLXxjY+FzpXmi8tXkNw0qlcETFgyshZy2ZMkuM+Y2c5ObzLEVI1VBXSS5tGloviW3YnD0VNX87ffsekeHNebWYrqK4hEN7aSGKeMdMgkZHX+JXXqeVOCRgm/quox6TpdxfSI0giX5Y0BLSOThUUDkszEKAOpIrnfBVufP1K9QEW8hSBGYkmRkLl2JPJJaQgk8kq1aviqznvfDtxHalxc
joefiesta
Posts: 141
Joined: 24 Jan 2016, 13:54

Re: Complex Regular Expressions fail when input exceeds a certain length.

20 Apr 2018, 11:24

So, ~= is documented as a shortcut for RegExMatch. That's all the doc says. How then is one supposed to translate

clipboard ~= "([A-Za-z0-9+/=])*"

into

FoundPos := RegExMatch(Haystack, NeedleRegEx [, OutputVar, StartingPosition := 1])

???
iseahound
Posts: 231
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

20 Apr 2018, 11:42

you don't. (x ~= "x") returns true/false. (If var x matches string "x"). Of course I could be wrong, as I was about the bug report. It's a case of un-optimized regex.

Original

Code: [Select all] [Download] GeSHi © Codebox Plus

^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$


New

Code: [Select all] [Download] GeSHi © Codebox Plus

^(?:[A-Za-z0-9+/]{4})*?(?:[A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$


I added ?: after the left parenthesis to denote the non-capturing group, but even then it would still crash (although it would take up to 10x the input). Changing the asterisk in (?:[A-Za-z0-9+/]{4})* to (?:[A-Za-z0-9+/]{4})*? changed the algorithm from greedy to lazy, preventing backtracking. Honestly, this stuff is pretty arcane. I think a warning that Regular Expressions can crash without error is needed in the documentation however.
swagfag
Posts: 710
Joined: 11 Jan 2017, 17:59

Re: Complex Regular Expressions fail when input exceeds a certain length.

20 Apr 2018, 14:12

the regex doesnt crash without error, check the error level: -21 "recursion too deep"

as per the docs: A negative number, which means an error occurred during the execution of the regular expression. Although such errors are rare, the ones most likely to occur are "too many possible empty-string matches" (-22), "recursion too deep" (-21), and "reached match limit" (-8). If these happen, try to redesign the pattern to be more restrictive, such as replacing each * with a ?, +, or a limit like {0,3} wherever feasible.
https://autohotkey.com/docs/commands/Re ... ErrorLevel

your sample regex

Code: [Select all] [Download] GeSHi © Codebox Plus

needle := "([A-Za-z0-9+/=])*"

for the provided 3k char long string, executes in, well, 3k steps. i couldnt find out whether there's a limit on the amount of capturing groups that you could make. there is a limit on recursion depth:

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus



that said, i havent looked into how it works under the hood, but i suspect the problem to be linked to the ()* construct. what is the regex supposed to do exactly? if u put the star inside the capturing group, the regex executes in 2 steps
iseahound
Posts: 231
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

20 Apr 2018, 14:39

The main problem was that it was greedy, using ()*? Fixed it. It's supposed to match in groups of 4 ([a-z]{4})*?. It matches base64 data, which can get extremely large, on the order of several megabytes of string data. This is image data converted into text.
iseahound
Posts: 231
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

23 Apr 2018, 17:08

Actually I did some further testing:

Problematic:
https://regex101.com/r/fnDsNx/1

"Fixed" Version
https://regex101.com/r/0ocazu/1

The amount of time it takes to compute the regex increases by a factor of 3 in the fixed version, yet it doesn't crash in AHK. However, the original RegEx is more efficient but crashes, despite it being better written! I think this is actually a bug.

If you click on Regex debugger on the left hand side, you can clearly see that the problematic RegEx works in O(n) while the fixed RegEx works in O(4n). I can tentatively conclude this is an AHK implementation problem, since the Perl/Python engines work fine.
Last edited by iseahound on 24 Apr 2018, 04:53, edited 1 time in total.
iseahound
Posts: 231
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

23 Apr 2018, 17:34

Fastest AHK Implementation, O(n) using zero-width assertions. https://regex101.com/r/XfKTM9/1

Code: [Select all] [Download] GeSHi © Codebox Plus

^(?:[A-Za-z0-9+\/]{4})*?$(?<=[A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)$


EDIT: Actually this crashes 50% of the time. only the slow but fixed version works.
User avatar
nnnik
Posts: 2899
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Complex Regular Expressions fail when input exceeds a certain length.

24 Apr 2018, 01:16

It seems to me like you are making this more complex than it actually is.
You are using the RegEx to acertain that:

The string is in 4 byte groups of characters
There are only characters that are allowed to go inside the groups are inside the group
At the end the = sign is used to fill up the last group

For the first job I'd use modulo since all the characters across the groups are the same and there is no specific structure inside them:

Code: [Select all] [Download] GeSHi © Codebox Plus

if ( mod(strLen( data ), 4 )  ;if the strLen of the data is not deviseable by 4


For the second requirement I would use RegEx - since all the characters across the groups are the same and there is no specific structure inside them:

Code: [Select all] [Download] GeSHi © Codebox Plus

RegExMatch( data,  "^[A-Za-z0-9+\/]+


Lastly the last group can be filled up with = meaning that at the end of the string there are 0 to 3 = but never more since then the whole group would get removed:

Code: [Select all] [Download] GeSHi © Codebox Plus

RegExMatch( data,  "^[A-Za-z0-9+\/]+={0,3}$")


Since you want speed we can put it in an atomic subgroup since there is not a single point where we backtrace:

Code: [Select all] [Download] GeSHi © Codebox Plus

RegExMatch( data,  "(?>^[A-Za-z0-9+\/]+={0,3}$)")


Putting it together with what we had previously we get:

Code: [Select all] [Download] GeSHi © Codebox Plus

if ( mod(strLen( data ), 4 ) && RegExMatch( data,  "(?>^[A-Za-z0-9+\/]+={0,3}$)" ) )

This is probably the fastest approach for what you were doing.
You can toss the complexity notation in the trash bin for comparing our 2 versions though.
Recommends AHK Studio
iseahound
Posts: 231
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

24 Apr 2018, 04:52

You realize that the complexity notation is for RegEx analysis, not for real world performance? In all my benchmarks, both codes have excelled at an exceedingly fast 0ms. What I want to know is why ([a-z]{4})* will crash yet ([a-z]{4})*? will not. Also base64 code with 3 trailing equal signs is invalid. Also the fastest method is to separate the data into two parts, then run a simple [a-z\dA-Z+/]*, followed by ^([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$

The main frustration is writing efficient RegEx that crashes in AutoHotkey, but not in any other language.

EDIT: I can disable backtracking by making it possessive, and it will still crash. ([a-z]{4})*+ Nevermind, possessive modifiers work fine and even fix my original RegEx. Nope, just tested a possessive modifier with working code, and it crashes. ^(?:[A-Za-z0-9+\/]{4})*+$(?<=[A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)$
just me
Posts: 5396
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: Complex Regular Expressions fail when input exceeds a certain length.

24 Apr 2018, 09:18

Hi nnnik,

thanks for the hint about "atomic groups". I didn't know that. They extremely increase performance in case of mismatches at the end of the haystack.
User avatar
nnnik
Posts: 2899
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Complex Regular Expressions fail when input exceeds a certain length.

24 Apr 2018, 09:30

iseahound have you tried running the regexmatch repeadetly?

@just me:
I learned them from regEx101.com one time I went ahead and learned all of the regex tags they have their - it was worth it.
Recommends AHK Studio
User avatar
jeeswg
Posts: 4483
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Complex Regular Expressions fail when input exceeds a certain length.

24 Apr 2018, 11:13

@nnnik: Any other good finds?

[click Show Sidebar, View Regex Quick Reference]
Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript
https://regex101.com/
User avatar
nnnik
Posts: 2899
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Complex Regular Expressions fail when input exceeds a certain length.

24 Apr 2018, 15:45

Yeah the stuff for recursion seemed pretty powerful.
Other than that I can't quite remember everything.
I still do reccomend reading it - it was a nice experience
Recommends AHK Studio
User avatar
jeeswg
Posts: 4483
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Complex Regular Expressions fail when input exceeds a certain length.

24 Apr 2018, 15:54

- Thanks nnnik.
- I've got my tutorial here, and I'm always on the lookout for additional RegEx techniques not mentioned in the AHK documentation.
jeeswg's RegEx tutorial (RegExMatch, RegExReplace) - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=7&t=28031
- The thing that would be really useful is if you are or anyone else is able to come up with a really good example/examples of where certain bits of syntax are useful, especially those not mentioned in the AHK documentation.
- Someone posted here re. atomic groups, I sense that they are useful, but I haven't seen a really good/classic example yet.
RegEx Atomic Group / Possesive Quantifier - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=5&t=47267
iseahound
Posts: 231
Joined: 13 Aug 2016, 21:04
GitHub: iseahound

Re: Complex Regular Expressions fail when input exceeds a certain length.

24 Apr 2018, 17:00

Sure. Here's an example of an atomic group that matches the word "coccyx" if and only if it is not enclosed in parenthesis. Due to restraints, parenthesis are only supported up to two nested layers, ((coccyx)) and not (((coccyx))).
https://regex101.com/r/N5LJUr/1

However, if you remove the atomic group ?> and make it into a non-capturing group ?:, it will fail if no parenthesis are present in the test string. https://regex101.com/r/k59yLV/1

EDIT: Here's an example from production used in my Subtitle class https://regex101.com/r/6m8kis/1

Return to “Bug Reports”

Who is online

Users browsing this forum: No registered users and 2 guests