- I've been interested in having a function for case-insensitive searching in binary buffers. E.g. search a binary file for a string.
- AFAIK there are no existing Winapi functions for searching for ANSI/UTF-8/UTF-16 strings, which can handle null characters. I.e. they stop at the first null character.
- One way to do this would be to repeatedly use NumGet, but this is very very slow, so instead, writing a function in C++ and compiling it, and using it as a machine code function, would be better. E.g. something like this:
GitHub - HelgeffegleH/buf: Functions for searching in and writing to buffers
https://github.com/HelgeffegleH/buf
C++: C++ to machine code via TDM-GCC - AutoHotkey Community
https://autohotkey.com/boards/viewtopic ... 23&t=49554
- (Generally speaking I write everything in AHK, but a few times, machine code functions that can be used in AHK are necessary.)
- (Note: case-sensitive searching is easier, because you only have to search for an exact stream of bytes. You can search for 'AutoHotkey', you don't have to also check for 'a'/'U'/'T' etc.)
- To do a case-insensitive search for 'AutoHotkey'. I check each character until I find 'A' or 'a', then I check if the next character is 'U' or 'u', then I check for 'T'/'t' etc. Either I've found the entire word, else, I start looking for 'A'/'a' again.
- The question is how to most efficiently handle checking multiple byte blocks.
- E.g. to do a case-sensitive search for 'résumé' in UTF-8:
R É S U M É
r é s u m é
- Also, for some letters, the upper case and lower case forms do not have the same number of bytes.
- Actually, this case is so rare that I would suggest any code use 2 separate algorithms. One algorithm for searching strings where each Unicode character has the same number of bytes for upper/lower case forms in UTF-8, and one algorithm to handle the general case where the number can differ.
- This script sought to find any characters where the upper case and lower case forms had a different number of bytes when stored as UTF-8.
- I found only 7 pairs of characters.
Code: Select all
;==================================================
q:: ;characters where upper/lower case have different sizes in UTF-8
vOutput := "`r`n"
Loop, 1114111
{
vCharU := Format("{:U}", Chr(A_Index))
vCharL := Format("{:L}", Chr(A_Index))
vSizeU := StrPut(vCharU, "UTF-8") - 1
vSizeL := StrPut(vCharL, "UTF-8") - 1
vOrdU := Ord(vCharU)
vOrdL := Ord(vCharL)
vTemp := vCharU " " vCharL " " vSizeU " " vSizeL " " vOrdU " " vOrdL
if !(vSizeU = vSizeL)
&& !InStr(vOutput, "`r`n" vTemp "`r`n")
vOutput .= vTemp "`r`n"
}
vOutput := SubStr(vOutput, 3)
Clipboard := vOutput
MsgBox, % vOutput
return
;==================================================
;Ⱥ ⱥ 2 3 570 11365
;Ⱦ ⱦ 2 3 574 11366
;Ɐ ɐ 3 2 11375 592
;Ɑ ɑ 3 2 11373 593
;Ɫ ɫ 3 2 11362 619
;Ɱ ɱ 3 2 11374 625
;Ɽ ɽ 3 2 11364 637
;Ⱥⱥ Ⱦⱦ Ɐɐ Ɑɑ Ɫɫ Ɱɱ Ɽɽ
;==================================================
A BC D EF
a bc d ef
- We split the needle into byte groups, one byte group for each character.
- The length of each split-up needle is 9 (including spaces). Let's say when searching for 'A'/'a', we've found 'ABCdef'.
- We check if 'A'/'a' matches. 'A' matches. The next byte is a pad byte so we skip to the next byte. We check if 'B'/'b' matches. 'B' matches. The next byte is not the pad byte, (so we are still within the byte group for a character,) so we only check if 'C' matches, we do not check for 'c'. 'C' matches. Etc.
- The more general algorithm, where byte groups could vary in length, could ignore pad characters, and wait until there is a pad character in both byte lists (to signal a new byte group).
- Another algorithm could make use of a list. E.g. using '110110' to represent 'A BC D EF'. I.e. 1 marks the start of a new byte group.
- [EDIT:] This approach is probably better. One problem with having a pad byte, is that you need to specify a pad byte number between 0 and 255. Your needle might include all 256 different bytes.
- [EDIT:] You could use a ternary (base 3) pattern for each string. 2 for the start of a byte group, 1 for the continuation of a byte group, 0 for a pad byte.
- [EDIT:] Perhaps the best system. The upper and lower case versions of the needle each have a binary 'guide' number that represents them. While checking a lower case byte group, you move the pointer along on the lower case 'guide' until you hit 1, at which point you move the pointer along on the upper case 'guide' until it hits 1. (That way you can have byte groups of differing lengths, but avoid using pad bytes. You keep the pointers on each binary 'guide' in sync.)
- Or perhaps another idea.
- The question is, which algorithm is best?
- [EDIT:] I realised that the UTF-8 forms of upper and lower case characters, can start with the same bytes, but then differ, so ... if one 'character' (byte group) fails to match, you need to backtrack and try the other one. Or you check the bytes from both byte groups simultaneously.