iniRead UTF-8

Azev · 11 Oct 2018, 16:02

Hi all!
The iniRead function is not reading UTF-8 files properly. It retuns trash data instead of chars like ä é ü Ó etc..

Can it be fixed?

jeeswg · 11 Oct 2018, 16:07

Cheers.
UTF-8 ini files - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=6&t=38511

Azev · 11 Oct 2018, 16:20

Thanks!

Oh boy, I though ahk was meant for quick coding. I always have go hunting for patches.
How hard it would be to include it natively on the main code?
Nowadays everything is UTF-8. sigh...

jeeswg · 11 Oct 2018, 16:46

- AutoHotkey uses the Winapi functions for ini handling, which support ANSI or UTF-16 LE.
- So the question becomes: when will Windows become more UTF-8-friendly?
- I suppose that UTF-8 versions of numerous existing functions would be necessary for a UTF-8-friendly Windows. That would be a massive undertaking. (Or maybe just a few new UTF-8 function variants would be needed ... for handling ini files.)
- UTF-16 is generally easier to play about with (e.g. string length measurement is easy), which is why I'm normally not too concerned for more UTF-8 support in Windows. I.e. UTF-16 is good for manipulation and UTF-8 is good for storage for Latin-based languages.

User · 11 Oct 2018, 17:25

Azev wrote:The iniRead function is not reading UTF-8 files properly. It retuns trash data instead of chars like ä é ü Ó etc..

I'm really sorry about my ignorance on this matter, but why not saving the file as "Unicode" instead "UTF-8"?

Unicode.png: (3.21 KiB) Downloaded 103 times

by using "UTF-8" , iniRead returns "Error"!

by using "Unicode" , iniRead works fine!

Again, sorry for my ignorance! (just want to know why "UTF-8" instead "Unicode")

jeeswg · 11 Oct 2018, 17:28

- What Notepad calls 'Unicode' is UTF-16 LE. And that is a possibility if you want a Unicode ini file.
- Although the file sizes could be approximately double what they would be as UTF-16 versus UTF-8, if you use mainly ASCII characters.
- If you use IniRead on a UTF-8 file, it will read the raw UTF-8 bytes, but note: you would need a comment at the top of the ini file, otherwise the first section couldn't be read from/written to. (Using a comment at the top of the file is a hack.)

Azev · 11 Oct 2018, 18:02

I can't change the ini encoding. I'll write my own ini read function then

jeeswg · 11 Oct 2018, 18:19

- Huh? AHK handles ANSI and UTF-16 LE. My functions handle UTF-8 (if you have a comment starting with a semicolon as the first line). (Having a blank line as the first line, or starting the file with a dummy section that you never use, also work.)
- You could experiment with FileRead/FileAppend or AutoHotkey's File object, and implement UTF-8 ini files yourself, it wouldn't be too difficult.

User · 11 Oct 2018, 19:55

jeeswg wrote:- What Notepad calls 'Unicode' is UTF-16 LE. And that is a possibility if you want a Unicode ini file.
- Although the file sizes could be approximately double what they would be as UTF-16 versus UTF-8, if you use mainly ASCII characters.
- If you use IniRead on a UTF-8 file, it will read the raw UTF-8 bytes, but note: you would need a comment at the top of the ini file, otherwise the first section couldn't be read from/written to. (Using a comment at the top of the file is a hack.)

I always preferred "Unicode\UTF-16\whatever" over "UTF-8" because it prevents unpaired Unicode surrogate characters from being replaced and saved as "�" (Unicode character 65533)!

Code: Select all

	;Unicode "Surrogate" characters (from 55296 to 56319)

String := Chr(55296) "." Chr(55297) "." Chr(55298) "." Chr(55299) "." Chr(55300)

FileAppend , % String, #_ UTF-8 test.txt, UTF-8		;unpaired Unicode surrogate characters are replaced and saved as "�" (Unicode character 65533)
FileAppend , % String, #_ UTF-16 test.txt, UTF-16	;unpaired Unicode surrogate characters are not replaced and saved as "�" (Unicode character 65533)

jeeswg · 11 Oct 2018, 20:14

Great example User. Cheers. Btw, out of interest, does that situation happen to you often, or just when you create a list of the first 65535 Unicode characters? Either way, it's an absolutely valid concern, thanks for pointing it out.

User · 11 Oct 2018, 21:35

jeeswg wrote:Great example User. Cheers. Btw, out of interest, does that situation happen to you often, or just when you create a list of the first 65535 Unicode characters? Either way, it's an absolutely valid concern, thanks for pointing it out.

Look @jeeswg, I know that you already know about what I'm going to write below, but I will write it anyway:

Any Unicode character above 65535, are combinations of paired Unicode "surrogate" characters!

I don't remember very well, but I think that Unicode "Surrogate" characters start from 55296 to 56319 !

Code: Select all

	;above "65535", "msgbox" below returns paired numbers of Unicode "Surrogate" characters
	;for example, chr(65536) = pair of 55296 + 56320 Unicode "Surrogate" characters
	;for example, chr(1114111) = pair of 56319 + 57343 Unicode "Surrogate" characters

string := chr(65536)		;Max allowed "1114111"

msgbox, % ord(SubStr(String, 1, 1)) " + " ord(SubStr(String, 2, 1))

While saving in "Utf-8" files, only unpaired or wrongly paired Unicode "Surrogate" characters are replaced and saved as "�" (Unicode character 65533)

Code: Select all

	;Unicode "Surrogate" characters (from 55296 to 56319)

String := ""
. "chr(66565) = " Chr(55297) Chr(56325) " / Unpaired = " Chr(55297) "." Chr(56325)  "`n"
. "`n"
. "chr(66566) = " Chr(55297) Chr(56326) " / Unpaired = " Chr(55297) "." Chr(56326)  "`n"
. "`n"
. "chr(66561) = " Chr(55297) Chr(56321) " / Unpaired = " Chr(55297) "." Chr(56321)  "`n"
. "`n"

FileAppend , % String, #_ UTF-8 test.txt, UTF-8		;unpaired Unicode surrogate characters are replaced and saved as "�" (Unicode character 65533)
FileAppend , % String, #_ UTF-16 test.txt, UTF-16	;unpaired Unicode surrogate characters are not replaced and saved as "�" (Unicode character 65533)

Well, if you ask me, I think this Unicode "Surrogate" characters thing absolutely absurd!

Using pairs of 2 Unicode characters to represent another unicode character? Haha, sorry but, this really makes me Lol!

Anyway, It is what it is, so ...!

jeeswg · 11 Oct 2018, 21:52

- Well some system or other was needed!
- I thought it was alright.
- I liked the brevity of UTF-8. How it let you store Unicode characters whilst hardly increasing the file size of an ASCII/mostly ASCII text file (instead of doubling it like UTF-16 does).
- I liked the simplicity of UTF-16.
- By reserving 2048 characters that gave you just over a million more.

65536
1024*1024=1048576 [using chars 55296 to 57343]
65536+1048576=1114112

The question is ...
Are there 1114112 characters?
Or 1114112-2048=1112064 characters?
1114112 for the win.

### Btw it seems you use one Unicode character a lot. #ThisOne ###

User · 11 Oct 2018, 22:39

jeeswg wrote:Are there 1114112 characters?

Technically, yes! We already have 1114111 unicode characters!

But most of them don't have yet a "drawing\symbol\whatever" associated with!

Anyway, why limiting ourselves to 1114111 if we can go unlimited?

In the matter of the fact, I think there are only 65535 Unicode characters! The same way a pair of "`r`n" returns a new line, different pairs of 2 unicode surrogate characters return different symbols!

jeeswg wrote:### Btw it seems you use one Unicode character a lot. #ThisOne ###

I like to use it as delimiter, that's it!

User · 19 Oct 2018, 00:01

jeeswg wrote: ↑
11 Oct 2018, 21:52
.

I found something interesting!

the below ".rar" file contains all the possible 1 to 5 combinations (total = 1118480) between hexadecimal symbols (0-9 and A-F)!

http://www.mediafire.com/file/dm6w5d5h3 ... 480%29.rar

The thing is, any surrogate character consumes 4 hex symbol, and since that any Unicode character above 65535 to 1114111 are combinations of paired Unicode surrogate characters, it means that each one of them consumes 8 hex symbol!

In the other hand, the ".rar" file above suggests that Unicode characters above 65535 to 1114111 would consume only 5 hex symbol!

so, why in the world they opted to this surrogate thing instead of using all the possible 5 combinations between hexadecimal symbols?

By the way, what about all the possibles 8 combinations between hexadecimal symbols? 16^8 = 4.294.967.296 (Approximately 4 Billions Unicode characters at easy!)

SOTE · 19 Oct 2018, 03:46

jeeswg wrote: ↑
11 Oct 2018, 16:46
- AutoHotkey uses the Winapi functions for ini handling, which support ANSI or UTF-16 LE.
- So the question becomes: when will Windows become more UTF-8-friendly?
- I suppose that UTF-8 versions of numerous existing functions would be necessary for a UTF-8-friendly Windows. That would be a massive undertaking. (Or maybe just a few new UTF-8 function variants would be needed ... for handling ini files.)
- UTF-16 is generally easier to play about with (e.g. string length measurement is easy), which is why I'm normally not too concerned for more UTF-8 support in Windows. I.e. UTF-16 is good for manipulation and UTF-8 is good for storage for Latin-based languages.

UTF-8 appears to have no issues with many Asian based languages (CJK characters), in addition to Latin based.

jeeswg · 19 Oct 2018, 08:46

- Here is some code that works out how many bytes each Unicode character requires in each system.
- It appears that for characters above 65535 in both UTF-8 and UTF-16, 4 bytes are required.
- @SOTE: The point I made was that if you want to store Latin-based text, using UTF-8 will generally take up less space than UTF-16, because in UTF-8, ASCII characters require 1 byte, whereas in UTF-16, ASCII characters require 2 bytes.
- UTF-16 can only use characters 0 to 1114111, however, UTF-8 could be extended beyond 1114111 in future, if desired.
- @User: UTF-32 gives a simple system where each character is stored using 4 bytes, and thus you would have 256**4 = 4294967296 characters.
- I would count sizes using bytes rather than hex symbols, as this is easier. (Note: you get 2 hex symbols (nibbles) per 1 byte.)

Code: Select all

;UTF-16
;2 0-65535
;4 65536-1114111

;UTF-8
;1 0-127 (ASCII)
;2 128-2047
;3 2048-65535
;4 65536-1114111

;256**2 = 65536
;Unicode: 1114112
;256**4 = 4294967296

q:: ;UTF-16 size
vOutput := "", vSizeLast := 0
Loop, 1114111
{
	vSize := StrLen(Chr(A_Index)) * 2
	if !(vSize = vSizeLast)
		vOutput .= vSize " " A_Index "`r`n"
	vSizeLast := vSize
}
Clipboard := vOutput
MsgBox, % vOutput
return

w:: ;UTF-8 size
vOutput := "", vSizeLast := 0
Loop, 1114111
{
	vSize := StrPut(Chr(A_Index), "UTF-8") - 1
	if !(vSize = vSizeLast)
		vOutput .= vSize " " A_Index "`r`n"
	vSizeLast := vSize
}
Clipboard := vOutput
MsgBox, % vOutput
return

- Btw I didn't find the quote below too clear. In both UTF-8 and UTF-16 the range 65536-1114111 uses 4 bytes (8 hex symbols).
- For me the advantage of UTF-8 is that you get smaller text files for Latin-based text, avoid using the null character, and can use UTF-8 strings where ANSI strings are supported (e.g. AHK's IniRead/IniWrite and the Winapi's OutputDebugString).
- And the advantage of UTF-16 is simpler text manipulations when using programming languages, and you get smaller text files for characters in the range 2048-65535.
- For people regularly using characters above 65535, UTF-32 might be easier.

In the other hand, the ".rar" file above suggests that Unicode characters above 65535 to 1114111 would consume only 5 hex symbol!

so, why in the world they opted to this surrogate thing instead of using all the possible 5 combinations between hexadecimal symbols?

User · 19 Oct 2018, 10:04

jeeswg wrote: ↑
19 Oct 2018, 08:46
- Btw I didn't find the quote below too clear. In both UTF-8 and UTF-16 the range 65536-1114111 uses 4 bytes (8 hex symbols).

Unicode currently uses pairs of 4 hex symbols, so we have 16^4 = 65536 characters!

Since the above is limited to 65536 characters, the Unicode surrogate paired system was invented:
Each chars above 65536 to 1114111 consumes 4 bytes (8 hex symbols = a pair of 2 Unicode surrogate characters)

Unicode Base 5 (pairs of 5 hex symbols), we would have 16^5 = 1048576 characters!
So, it means that Unicode characters above 65536 to 1048576 would consume only 5 hex symbols = 2.5 bytes

lets imagine that in the future we will need more than 1048576 characters, so:
Unicode Base 10 (pairs of 10 hex symbols), we would have 16^10 = 1.099.511.627.776 characters! (1 Trillion approximately)
So, it would mean that in an "Unicode Base 10" system, the max byte a character can consume is "5 bytes" only = a pair of "10 hex" symbols!

How much 'bytes" or "pair of hex symbols" do you think that would be necessary to represent Unicode character "1 Trillion" in the "Unicode surrogate paired system"?

iniRead UTF-8

iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Re: iniRead UTF-8

Who is online