Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Machine code binary buffer searching regardless of NULL


  • Please log in to reply
52 replies to this topic
polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
No I didn't do any work, I just knew that \0 was the character for null and patterns in pcre can always be matched regardless of what precedes it. The problem you discovered with FileRead is as you noticed, nothing to do with RegEx but how AutoHotkey stores data.

Your example indicates that in the search string we have to replace each special character ([]()"\'.*?<>^$|null…) with a hex (or octal, as in your example) escape sequence (or precede them with ""), which is not a very elegant solution.

You misunderstood, I used \x61 as a control in the benchmark test so you could easily replace it with any other hex value to observe an identical result. In the previous script I had written expressions that used '\d' and '.' to a similar effect, any valid regex pattern can be used. Again, it doesn't hurt to try yourself before coming to such presumptions.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


wOxxOm
  • Members
  • 371 posts
  • Last active: Feb 20 2015 12:10 PM
  • Joined: 09 Feb 2006
so try to search for 4MB binary needle inside 60MB binary buffer (or bigger, using e.g. CreateFileMapping produced memory block, say 1GB, is it enough :-D) using RegExMatch and compare that to InBuf, hehe

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Care to post your findings? Don't forget expressions are compiled and cached first time they're used, so it's best to exclude the first call from your test.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


wOxxOm
  • Members
  • 371 posts
  • Last active: Feb 20 2015 12:10 PM
  • Joined: 09 Feb 2006
first how do you think you will read 1GB file? using a chunk read cycle? nah, it's bad! Is there anyway to feed 1GB mapped memory space to RegExMatch? if no then there is not much sense to compare IMO, since CreateFileMapping + MapViewOfFile + InBuf would be the most suitable and *elegant* solution, not requiring pre-escaping of a large binary buffer. BTW it would take long to convert a large binary buffer to the RegExMatch acceptable format using pure AHK, doesn't it :-) ?

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Different algorithms are suited for specific situations. Even if your function is specialized to parse extremely large sets of data it lacks the features of regular expressions. Conversely, regex was designed for manipulating complex strings and requires additional steps to read large variables.

I don't mean to depreciate your work, it's certainly most impressive to see assembly used in AutoHotkey. I just find that regex can satisfy most requirements, including searching for and past null characters.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


wOxxOm
  • Members
  • 371 posts
  • Last active: Feb 20 2015 12:10 PM
  • Joined: 09 Feb 2006
yes, tastes differ :-D yet I also like regexps

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

Your example indicates that in the search string we have to replace each special character ([]()"\'.*?<>^$|null…) with a hex (or octal, as in your example) escape sequence (or precede them with ""), which is not a very elegant solution.

You misunderstood, I used \x61 as a control in the benchmark test so you could easily replace it with any other hex value to observe an identical result. In the previous script I had written expressions that used '\d' and '.' to a similar effect, any valid regex pattern can be used. Again, it doesn't hurt to try yourself before coming to such presumptions.

I did try myself. I was speaking about the scenario where you search for a binary pattern, like a piece of code in a program, or an embedded MD5 signature, or descriptors of images, videos. These patterns are normally taken from other files. You cannot use those patterns directly as the search string, can you? You have to "replace each special character", convert the non-printables to escape sentences, that is, first you have to process the search string. If I misunderstood, and you can use a binary file or buffer directly as the search string in another buffer, please tell the trick. I tried the following:
FileRead a, *c %A_AhkPath%
VarSetCapacity(b,StrLen(a),1)
DllCall("RtlMoveMemory", UInt,&b, UInt,&a, Uint,StrLen(a))
VarSetCapacity(y,4)
Loop % StrLen(a) {
   x := NumGet(b,A_Index)
   NumPut(x,y)
   r := RegExMatch(b, y)
   If ErrorLevel <> 0
      MsgBox %A_Index% : %ErrorLevel%
}
I got the following msg:
---------------------------
test.ahk
---------------------------
1006 : Compile error 14 at offset 4: missing )
---------------------------
OK
---------------------------
It shows that some 4-byte binary search strings lead to errors.

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Try with \Q\E... and before you ask it's in the manual.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


wOxxOm
  • Members
  • 371 posts
  • Last active: Feb 20 2015 12:10 PM
  • Joined: 09 Feb 2006
does that mean that if needle contains null char then this will work:
L:=binNeedleLength
VarSetCapacity( binNeedleSpoof, L+4+1, 0 )
binNeedleSpoof=\Q
binNeedleSpoof2=\E
DllCall("RtlMoveMemory", "uint",&binNeedleSpoof+L+2, "str", binNeedleSpoof2, "uint",2)
DllCall("RtlMoveMemory", "uint",&binNeedleSpoof+2, "uint", &binNeedle, "uint",L)
res:=RegExMatch( buf, binNeedleSpoof)

hmm looks pretty terrrrrifying :-D

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012

looks pretty terrrrrifying

Only for the way you do it.
Your method can be condensed to one line:

res := RegExMatch(buf, "\Q" . binNeedle . "\E")

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


wOxxOm
  • Members
  • 371 posts
  • Last active: Feb 20 2015 12:10 PM
  • Joined: 09 Feb 2006
hehe, so why is this showing 1 as found pos when it should be 4 :-D ?
varsetcapacity( hay, 100, 0 )
hay=aaa
varsetcapacity( n, 10, 0 )
msgbox % "hay: " hay ". pos=" regExMatch( hay, "\Q" . n . "\E" )


polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Are you trying to dispute pcre's abilities?
RtlMoveMemory or something to update the StrLen value is probably needed in your script due to AutoHotkey's limitations with binary variables.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


wOxxOm
  • Members
  • 371 posts
  • Last active: Feb 20 2015 12:10 PM
  • Joined: 09 Feb 2006
this: "\Q" . binneedle . "\E" is supposed to fail on the very first null encountered, isn't it? what pcre has to do with it if AHK won't pass the correct parameter anyway?

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

Try with \Q\E... and before you ask it's in the manual.

You should know that \Q...\E does not help. The binary search string could contain "\E", which is still interpreted as a control sequence. There could be other forbidden substrings, too.
y = \E)) ; binary string containing "\E"
r := RegExMatch(b, "\Q" . y . "\E")
MsgBox Error = %ErrorLevel%

If the search string contains \0, other tricks are needed in AHK (as I noted earlier): VarSetCapacity and dllcall to RtlMoveMemory or NumPuts to have the desired StrLen. You can make a wrapper function, but it is still ugly.

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012

\Q...\E does not help

That is only true if your program is so poorly designed without fault tolerance and control. This is often a cause of bugs and security holes which AutoHotkey's uniquely simplistic syntax aims to prevent in the first place. It's ironic how you find the need to replace a single \E so mundane and 'ugly' knowing the overheads of a machine code function.

I never expected that you would be so adamant to suppress alternatives techniques. Raw buffer searching and regex have their trade-offs and either are suited for different applications.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit