Unidecode port for AHK

Post your working scripts, libraries and tools for AHK v1.1 and older
User avatar
haichen
Posts: 631
Joined: 09 Feb 2014, 08:24

Unidecode port for AHK

15 Jun 2015, 11:54

What is it for:
It tries to translate unicode characters to ascii.
Why:
When i saw the topic for removing LetterAccent I remembered that i've done somthing similar to that. I needed to transcript names to ascii. Swedish names,russian and others.
Credits:
After searching the web i found a perlscript named Unidecode. Mr Burke has done the whole transcription of Unicodesymbols to ascii.
Not all of course but a lot.

What i did:
I use his translated files and merged them to one big texfile of ~400kB.
You can load this file to an array and could easily translate a lot of unicode to ascii-chars.


At first you've to download his files and extract them to a directory. You can delete everything besides the unidecode Dir.
Put the follwing script above this dir and run it to make the textfile unidecode.tbl

Code: Select all

makeUniDecodetablefile()
return

makeUniDecodetablefile(pathToUnidecodeDir="unidecode",tablename="unidecode.tbl"){
	b:=[]
	i:=0
	Loop, Files, %pathToUnidecodeDir%\*.pm
	{
		FileRead, OutputVar, %A_LoopFileFullPath%
		index:= PerlfilePMToVar(OutputVar)
		SetFormat, IntegerFast, d
		i := index +1
		;for debugging
		;c:=count(OutputVar)
		;msgbox, % i " " index ", " c ", "OutputVar
		b[i]:=OutputVar
	}
	for i, element in b
		FileAppend , %element% `, `n,%tablename%
}



PerlfilePMToVar(ByRef haystack){
	static test
	result:=haystack
	Pattern1:= "i),\s+(#.*?`n)"
	Pattern2:= "i)(#\s+BLOCK.*?`n)"
	pos:=1
	while pos
	{
		pos := RegExMatch(Haystack, pattern1, match, pos + strlen(match))
		result:=strReplace(result,match1,"")
	}
	pos:=1
	
	while pos
	{
		pos := RegExMatch(Haystack, pattern2, matcher, pos + strlen(matcher))
		result:=strReplace(result,matcher1,"")
	}
	match2:=""
	result:=strReplace(result,"`n")
	Pattern:= "i)\[(.*)\]\s+=\s+\[(.*),\]"
	Pattern2:= "i)\[(.*)\]\s+=\s+Text"
	if !InStr(result,"make_placeholder_map")
		pos := RegExMatch(result, pattern, match)
	else
	{
		pos := RegExMatch(result, pattern2, match)
		loop,255
			match2 .= """"","
		match2 .= """"""
	}
	
	match2:=strReplace(match2,"`n")
	haystack:=match2
	return match1
}



count(Text){
	; only for debugging
	count:=0
	Text:= strreplace(text,"qq{,,}","qq{zweikomma}")
	text:= strreplace(Text,"qq{,}","qq{einkomma}")
	text:= strreplace(Text,"qq{, }","qq{dreikomma}")
	Loop, Parse,Text,`,
	{
		x:=trim(A_LoopField)
		if (x = "qq{zweikomma}")
			count++
		else if (x = "qq{einkomma}")
			count++
		else if  (x = "qq{dreikomma}")
			count++
		else if (x = """""")
			count++
		else if  (InStr(x,"qq{"))
			count++
		else
			count++
		
	}
	return count
}
And here is an example how to use it:
(put the file unidecode.tbl in the same dir)
Try it in the chinese forum. Dont know if there is a meaning for the ascii chars. :D

Code: Select all

;text= €€€€€@@@ßäÄüÜö
;msgbox, % unidecode(text)

!^u::msgbox, % unidecode(clipboard,"äÄ")

return

unidecode(text, donotdecode=""){
	static a
	Transform, text, HTML, %text% ,2
	if (donotdecode<>"")
		Loop, Parse,donotdecode
		{
			Transform, dn, HTML, %A_Loopfield% ,2
			text := strReplace(text,dn,A_Loopfield)
		}
	
	u:=getDecUnicode(text)
	
	Sort u, N D, U
	if !(a.length()=65536){
		a:=[]
		FileRead, tbl, unidecode.tbl
		a:=unidecodeTable2Array(tbl,a)
	}
	if !(a.length()=65536){
		msgbox, % "Error loading unidecode.tbl. Array length is " a.length() " instead of 65536."
		exitapp
	}
	
	Loop, Parse,u,`,
		text := strReplace(text,"&#" . A_Loopfield ";", a[A_Loopfield])
	return text
}

unidecodeTable2Array(Text,array){
	loop, parse, Text,`n
	{
		usatz:= strreplace(A_LoopField,"qq{,,}","qq{zweikomma}")
		usatz:= strreplace(usatz,"qq{,}","qq{einkomma}")
		usatz:= strreplace(usatz,"qq{, }","qq{dreikomma}")
		
		i:=(a_index-1)*256
		;msgbox, % "i" i
		index:=0+i
		Loop, Parse,usatz,`,
		{
			
			x:=trim(A_LoopField)
			if (x = "qq{zweikomma}")
				array[Index]:=",,"
			else if (x = "qq{einkomma}")
				array[Index]:=","
			else if  (x = "qq{dreikomma}")
				array[Index]:=", "
			else if (x = """""")
				array[Index]:=""
			else if  (InStr(x,"qq{"))
				array[Index]:=substr(x,4,strlen(x)-4)
			else
				array[Index]:=substr(x,2,strlen(x)-2)
			index++
		}
	}
	return array
}


getDecUnicode(haystack){
	Pattern:= "i)&#(\d+)?;"
	pos:=1
	while pos
	{
		pos := RegExMatch(Haystack, pattern, match, pos + strlen(match))
		if (result ="") and (match1<>"")
			result := match1
		else if (match1<>"")
			result .= "," . match1
	}
	return result
}
edit: the unidecode.tbl file will now loaded only once.

⠁⠥⠞⠕⠓⠕⠞⠅⠑⠽ makes it possible!
User avatar
haichen
Posts: 631
Joined: 09 Feb 2014, 08:24

Re: Unidecode port for AHK

17 Jun 2015, 10:37

Some examples.

Unicode Input:
François Truffaut
András Faragó B ä ß a € äöüß ÄÖÜ søgning
Ascii-Output:
Francois Truffaut
Andras Farago B a ss a EUR aouss AOU sogning

Unicode Input:
⠁⠥⠞⠕⠓⠕⠞⠅⠑⠽
Ascii-Output:
autohotkey

Unicode Input:
Русский (Russian)
Ascii-Output:
Russkii (Russian)

Unicode Input:
中文 (Chinese)
这里中文用户可以用自己熟悉的语言交流和分享(包括简
Ascii-Output:
Zhong Wen (Chinese)
Zhe Li Zhong Wen Yong Hu Ke Yi Yong Zi Ji Shou Xi De Yu Yan Jiao Liu He Fen Xiang (Bao Gua Jian

Unicode Input:
ÁáÀàÂâǍǎĂăÃãẢảẠạÄäÅåĀāĄąẤấẦầẪẫẨẩẬậẮắẰằẴẵẲẳẶặǺǻĆćĈĉČčĊċÇçĎďĐđÐÉéÈèÊêĚěĔĕẼẽẺẻĖėËëĒēĘęẾếỀềỄễỂểẸẹỆệĞğĜĝĠġĢģĤĥĦħÍíÌìĬĭÎîǏǐÏïĨĩĮįĪīỈỉỊịĴĵĶķĹ弾ĻļŁłĿŀŃńŇňÑñŅņÓóÒòŎŏÔôỐốỒồỖỗỔổǑǒÖöŐőÕõØøǾǿŌōỎỏƠơỚớỜờỠỡỞởỢợỌọỘộṔṕṖṗŔŕŘřŖŗŚśŜŝŠšŞşŤťŢţŦŧÚúÙùŬŭÛûǓǔŮůÜüǗǘǛǜǙǚǕǖŰűŨũŲųŪūỦủƯưỨứỪừỮữỬửỰựỤụẂẃẀẁŴŵẄẅÝýỲỳŶŷŸÿỸỹỶỷỴỵŹźŽžŻż
Ascii-Output:
AaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaCcCcCcCcCcDdDdDEeEeEeEeEeEeEeEeEeEeEeEeEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiIiIiIiIiIiIiJjKkLlLlLlLlLlNnNnNnNnOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoPpPpRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuUuUuUuUuUuUuUuUuUuUuUuUuUuUuUuUuUuWwWwWwWwYyYyYyYyYyYyYyZzZzZz
User avatar
haichen
Posts: 631
Joined: 09 Feb 2014, 08:24

Re: Unidecode port for AHK

21 Feb 2016, 10:05

You can use unidecode() now via #include.
I added the transliteration tables now with the function in one file.
I also use now just me's RegExMatchGlobal()

download unidecode() source (236 KB).

This is an example to use it to copy plain ascii text without unicode:

Code: Select all

#SingleInstance ignore
;@Ahk2Exe-SetName CopyPlainText
;@Ahk2Exe-SetDescription CTRL+SHIFT+V releases Clipboard as ascii-plaintext
;@Ahk2Exe-SetVersion 1.0
;@Ahk2Exe-SetOrigFilename CopyPlainText.ahk
;@Ahk2Exe-SetMainIcon CopyPlainText.ico
;mark a text with unicode chars and use CTRL+C and CTRL+SHIFT+V (Deutsch STRG+SHIFT+V) to get an ascii representation of the string
; I use fincs Ahk2Exe (https://autohotkey.com/boards/viewtopic.php?f=24&t=521) as compiler but it should work without it.

Tiptext = 
(
STRG+SHIFT+V
changes the clipboard to plaintext
)	

menutitle:="STRG+SHIFT+V - Plain ascii text"
Menu, tray,add,  info
Menu, tray,Disable,  info
Menu, tray,Rename, info , %menutitle%
Menu, tray, add  
Menu, Tray, Tip, %Tiptext%
Menu, tray, add, Exit
Menu, tray, default, Exit
Menu, tray, NoStandard

Unidecode("init")

^+v:: 
Clipboard:= Unidecode(Clipboard,"Ä ä Ö ö Ü ü ß ¥Yen €Eur „"" ")
; "Ä ä Ö ö Ü ü ß ¥Yen €Eur „"" " a typical german setting :-)
send, ^v
return

return

Exit:
exitapp

info:
return


#include unidecode.ahk

Edit: Downloadpath corrected
Last edited by haichen on 25 Feb 2018, 10:27, edited 1 time in total.
User avatar
haichen
Posts: 631
Joined: 09 Feb 2014, 08:24

Re: Unidecode port for AHK

23 Feb 2016, 12:40

found some small errors and repaired them.
For those who want to give it a try i put the compiled example in my dropbox.

CopyPlainText.exe
You can copy some text with CTRL+C (STRG+C) and release it with CTRL+SHIFT+V (STRG+SHIFT+V) to get rid of all Non-Ascii characters. (in this example german vowels remain)

I also uploaded the CopyPlainText.exe to Virustotal:
SHA256: 7475ada1d1edef8e3f910c262b420252f8581d18658257a4f124189ba5fadad3
File name: CopyPlainText.exe
Detection ratio: 0 / 55
Qriist
Posts: 82
Joined: 11 Sep 2016, 04:02

Re: Unidecode port for AHK

17 Feb 2018, 04:12

Would you mind reposting this? None of the links work anymore.
haichenatwork

Re: Unidecode port for AHK

21 Feb 2018, 03:27

The Path changed from:
https://github.com/haichen/unidecode/bl ... decode.ahk
to:
https://github.com/haichen/unidecode/ra ... decode.ahk
I will correct this in the next days.

I also run a short test. Seems to work with the last AHK 1.1.28.

Hopefully this helps
haichen
User avatar
haichen
Posts: 631
Joined: 09 Feb 2014, 08:24

Re: Unidecode port for AHK

25 Feb 2018, 10:40

Sorry! I was in hurry and took the wrong url.
next try:
https://github.com/haichen
https://raw.githubusercontent.com/haich ... decode.ahk

Put both files (CopyPlainText.ahk and unidecode.ahk) in the same directory and run CopyPlainText.ahk.

Return to “Scripts and Functions (v1)”

Who is online

Users browsing this forum: Chunjee, gwarble, jacek678, vysmaty and 71 guests