How to detect file is binary or ascii?

Get help with using AutoHotkey and its commands and hotkeys
User avatar
panofish
Posts: 11
Joined: 04 Oct 2013, 09:20
Contact:

How to detect file is binary or ascii?

Post by panofish » 04 Oct 2013, 09:27

What is the best way to determine if a file is binary or ascii?
Preferably a fast and simple technique that is not dependent on file extension.
I need to process many files.

Thanks
User avatar
joedf
Posts: 6487
Joined: 29 Sep 2013, 17:08
Facebook: J0EDF
Google: +joedf
GitHub: joedf
Location: Canada, Quebec
Contact:

Re: How to detect file is binary or ascii?

Post by joedf » 04 Oct 2013, 12:35

ASCII/every Can/is be represented in binary.
I believe you mean an executable vs text files?
User avatar
panofish
Posts: 11
Joined: 04 Oct 2013, 09:20
Contact:

Re: How to detect file is binary or ascii?

Post by panofish » 04 Oct 2013, 13:21

Here is what I have currently. It appears to work well, but may not be efficient. However, in my case it is plenty fast enough.

Code: Select all

folder = c:\temp
recurse = 1  ; 0 = no recursion, 1 = recursion

Loop, %folder%\*,, %recurse%
{

    FileRead, recs, %A_LoopFileLongPath%
    numeric= 1,2,3,4,5,6,7,8,9,.

    if recs not contains %numeric% 
    {
        ;outputdebug binary - %a_loopfilename%
    } else {
        outputdebug ascii - %a_loopfilename%
    }
    
}
User avatar
MilesAhead
Posts: 230
Joined: 03 Oct 2013, 09:44

Re: How to detect file is binary or ascii?

Post by MilesAhead » 04 Oct 2013, 13:21

I would look for the source of the Linux "file" command. It's pretty good at catching text files and some printer format file types. I suspect executable it uses the file attribute info that's not built into NTFS but is in Linux file systems.
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."

- MilesAhead
User avatar
joedf
Posts: 6487
Joined: 29 Sep 2013, 17:08
Facebook: J0EDF
Google: +joedf
GitHub: joedf
Location: Canada, Quebec
Contact:

Re: How to detect file is binary or ascii?

Post by joedf » 04 Oct 2013, 14:27

I have better, but I'm Not at home right now, so i dont have my "setup"
User avatar
joedf
Posts: 6487
Joined: 29 Sep 2013, 17:08
Facebook: J0EDF
Google: +joedf
GitHub: joedf
Location: Canada, Quebec
Contact:

Re: How to detect file is binary or ascii?

Post by joedf » 04 Oct 2013, 18:18

I have made a function: isBinFile
It reads the first few bytes (default: 5) and determines if that byte is within ASCII Printable Chars Range (9-13, 32-126)
seems to work well...

Code: Select all

folder = c:\tools
recurse = 1  ; 0 = no recursion, 1 = recursion

Loop, %folder%\*,, %recurse%
	MsgBox % A_LoopFileLongPath "`n==>" (isBinFile(A_LoopFileLongPath) ? "Binary" : "ASCII")

isBinFile(Filename,tolerance=5) {
	file:=FileOpen(Filename,"r")
	loop, %tolerance% {
		file.RawRead(a,1)
		byte:=NumGet(&a,"Char")
		if (byte<9) or (byte>126) or ((byte<32) and (byte>13)) {
			file.Close()
			return 1
		}
	}
	file.Close()
	return 0
}
cheers!
Guest10
Posts: 578
Joined: 01 Oct 2013, 02:50

Re: How to detect file is binary or ascii?

Post by Guest10 » 06 Oct 2013, 04:11

tested and works great. i'll be sure to find some applications for this in my scripts! :lol:
joedf wrote:I have made a function: isBinFile
It reads the first few bytes (default: 5) and determines if that byte is within ASCII Printable Chars Range (9-13, 32-126)
seems to work well...

Code: Select all

folder = c:\tools
recurse = 1  ; 0 = no recursion, 1 = recursion

Loop, %folder%\*,, %recurse%
	MsgBox % A_LoopFileLongPath "`n==>" (isBinFile(A_LoopFileLongPath) ? "Binary" : "ASCII")

isBinFile(Filename,tolerance=5) {
	file:=FileOpen(Filename,"r")
	loop, %tolerance% {
		file.RawRead(a,1)
		byte:=NumGet(&a,"Char")
		if (byte<9) or (byte>126) or ((byte<32) and (byte>13)) {
			file.Close()
			return 1
		}
	}
	file.Close()
	return 0
}
cheers!
User avatar
joedf
Posts: 6487
Joined: 29 Sep 2013, 17:08
Facebook: J0EDF
Google: +joedf
GitHub: joedf
Location: Canada, Quebec
Contact:

Re: How to detect file is binary or ascii?

Post by joedf » 06 Oct 2013, 09:01

Thanks :) if at one point, it does not work, increase the "tolerance" and If it still doesn't work for a certain file..
Report it here, and I'll fix it. ;)
User avatar
panofish
Posts: 11
Joined: 04 Oct 2013, 09:20
Contact:

Re: How to detect file is binary or ascii?

Post by panofish » 07 Oct 2013, 16:07

If the ascii file is not encoded as ANSI, such as UCS-2 Big Endian... then it isBinFile will show Binary because of the 0 bytes.
User avatar
joedf
Posts: 6487
Joined: 29 Sep 2013, 17:08
Facebook: J0EDF
Google: +joedf
GitHub: joedf
Location: Canada, Quebec
Contact:

Re: How to detect file is binary or ascii?

Post by joedf » 07 Oct 2013, 16:17

panofish wrote:If the ascii file is not encoded as ANSI, such as UCS-2 Big Endian... then it isBinFile will show Binary because of the 0 bytes.
I Knew about that... But what you're asking is Actually Unicode.
ASCII and Unicode are 2 different character sets.
The original question was precisely ASCII.

I will update it and try to conform, for it to function with Unicode also.
Will post it soon!

Cheers! ;)
User avatar
panofish
Posts: 11
Joined: 04 Oct 2013, 09:20
Contact:

Re: How to detect file is binary or ascii?

Post by panofish » 07 Oct 2013, 16:49

Sorry about that joedf. You are correct. What you created works great for what I need. I just thought I'd point that out for anyone else that may need this. THANKS!
User avatar
joedf
Posts: 6487
Joined: 29 Sep 2013, 17:08
Facebook: J0EDF
Google: +joedf
GitHub: joedf
Location: Canada, Quebec
Contact:

Re: How to detect file is binary or ascii?

Post by joedf » 08 Oct 2013, 22:01

HotKeyIt wrote:Probably IsBOM() will help?
Yes thank you, it has helped as an example :)
I have done some research on unicode at wikipedia, the official website, Unicode tables, and etc.
here is what i have. seems to work well ;)

Code: Select all

/* Version 2 relies on BOM to indentify unicode files
;BOM ("Byte Order Mark")
;Table from : http://www.unicode.org/faq/utf_bom.html#bom4
-----------------------------------------    ;Woohoo! ASCII Art.. get it? lol..
|    Bytes    |      Encoding Form      |    ;if you don't well, we're trying
|----------------------------------------    ;to detect ASCII here... :P
|00 00 FE FF  |  UTF-32, big-endian     |    ;... and Unicode, of course! ;)
|FF FE 00 00  |  UTF-32, little-endian  |
|FE FF        |  UTF-16, big-endian     |
|FF FE        |  UTF-16, little-endian  |
|EF BB BF     |  UTF-8                  |    ;I know we can not rely on this...
-----------------------------------------
*/

isBinFile(Filename,tolerance=5,asumetext=4,detectunicode=1) {
	file:=FileOpen(Filename,"r")
	file.Position:=0 ;force position to 0 (zero)
	nbytes:=file.RawRead(a,tolerance)
	if (nbytes < asumetext) ;recommended 4 minimum for unicode detection
		return 0 ;asume text file, if too short
	
	if (detectunicode) {
		;read first 4 bytes
		byteA:=Numget(&a,0,"UChar")
		byteB:=Numget(&a,1,"UChar")
		byteC:=Numget(&a,2,"UChar")
		byteD:=Numget(&a,3,"UChar")
		
		;determine BOM if possible/existant
		if (byteA=0xFE && byteB=0xFF)
			or (byteA=0xFF && byteB=0xFE)
			return 0 ;text Utf-16 BE/LE file
		if (byteA=0xEF && byteB=0xBB && byteC=0xBF)
			return 0 ;text Utf-8 file
		if (byteA=0x00 && byteB=0x00
			&& byteC=0xFE && byteD=0xFF)
			or (byteA=0xFF && byteB=0xFE
			&& byteC=0x00 && byteD=0x00)
			return 0 ;text Utf-32 BE/LE file
	}
	;otherwise continue tradition method : detect ASCII (printable ranges)
	loop, %nbytes% {
		byte:=NumGet(&a,A_index-1,"UChar") ;start loop at 0 (zero)
		if (byte<9) or (byte>126) or (byte=11) or (byte=12) or ((byte<32) and (byte>13)) {
			file.Close()
			return 1
		}
	}
	file.Close()
	return 0
}
Utf-8 without BOM is a know flaw.. working on it :P
when this flaw is fixed, i will add it to the functions topic.
Dont worry, i know how to fix it, just need to sleep first :lol:

cheers! ;)
just me
Posts: 5510
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: How to detect file is binary or ascii?

Post by just me » 09 Oct 2013, 00:49

Hi joedf,

you might consider that extended ASCII codes like "Ü" (154) are valid in some European languages.

A file without a BOM might be considered to be binary if you find a NULL byte within the first nnn bytes, though it's still a guess.
User avatar
joedf
Posts: 6487
Joined: 29 Sep 2013, 17:08
Facebook: J0EDF
Google: +joedf
GitHub: joedf
Location: Canada, Quebec
Contact:

Re: How to detect file is binary or ascii?

Post by joedf » 09 Oct 2013, 09:20

I know of that, hmm but I didn't think that they would be needed...
Hmm Oh Well! I'll add support for that too! Thanks for your feedback ;)
User avatar
MilesAhead
Posts: 230
Joined: 03 Oct 2013, 09:44

Re: How to detect file is binary or ascii?

Post by MilesAhead » 09 Oct 2013, 13:07

Hmmm, I'm curious how "file" does it. But I can't look at tar.gz files on this library Windows PC. If anyone is curious, here's the link to the source archive.ftp://ftp.astron.com/pub/file/

I believe it's a bash shell script.
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."

- MilesAhead
User avatar
joedf
Posts: 6487
Joined: 29 Sep 2013, 17:08
Facebook: J0EDF
Google: +joedf
GitHub: joedf
Location: Canada, Quebec
Contact:

Re: How to detect file is binary or ascii?

Post by joedf » 10 Oct 2013, 13:44

the file command actually only checks for the the "signature"... like for exe it's "MZ", bmp it's something like "BM"
User avatar
MilesAhead
Posts: 230
Joined: 03 Oct 2013, 09:44

Re: How to detect file is binary or ascii?

Post by MilesAhead » 10 Oct 2013, 13:58

joedf wrote:the file command actually only checks for the the "signature"... like for exe it's "MZ", bmp it's something like "BM"
I assumed it did so on stuff like image files, printer format files like pdf postscript etc.. but I thought for text it might be able to detect ascii/unicode types. But you're saying "text" is the fall through if nothing else is found?

Dang! I wish I could just look at the script. Hopefully soon I'll have a machine instead of using a library loaner. :)
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."

- MilesAhead
User avatar
joedf
Posts: 6487
Joined: 29 Sep 2013, 17:08
Facebook: J0EDF
Google: +joedf
GitHub: joedf
Location: Canada, Quebec
Contact:

Re: How to detect file is binary or ascii?

Post by joedf » 10 Oct 2013, 14:02

Well I'm working on it when I arrive home, don't worry I can't detect utf-8 with BOM
Just need to get home :P
User avatar
MilesAhead
Posts: 230
Joined: 03 Oct 2013, 09:44

Re: How to detect file is binary or ascii?

Post by MilesAhead » 10 Oct 2013, 15:11

It's no biggie for me. Just a matter of curiosity. I know I looked through the 'file' script. But it was years ago. I was probably running Mandrake 9.1 then. But I bet it does fall through to text as last resort. The frustration is just generally dealing with these super restricted library computers. Not anything to do with this thread. :)
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."

- MilesAhead
Post Reply

Return to “Ask For Help”