Page 1 of 1

filter non English lines

Posted: 14 Jun 2018, 10:24
by DataLife
NetworkList.PNG
NetworkList.PNG (15.23 KiB) Viewed 1963 times
I am trying to filter out all lines that contain non english words. How can I get rid of the remaining non english lines?

Code: Select all

loop 50
 {  
  FileReadLine,var,textfile.txt,%a_index% ;dynamic contents
  FoundPos := RegExMatch(var, "[a-zA-Z0-9,.!?]") ;from https://autohotkey.com/board/topic/149454-how-do-i-identify-non-english-letters-in-a-string/#entry732502
  if FoundPos <> 0
   List = %List%`n%var%  
 }
 MsgBox %list%

Re: filter non English lines  Topic is solved

Posted: 14 Jun 2018, 12:46
by Helgef
Hello, maybe,

Code: Select all

filterNonEnglish(str){
	return regexreplace(trim(regexreplace(str, "`nm)^.*[^[:ascii:]].*$"),  "`n"), "\R{2,}",  "`n")
}

Re: filter non English lines

Posted: 14 Jun 2018, 13:55
by DataLife
Helgef wrote:Hello, maybe,

Code: Select all

filterNonEnglish(str){
	return strreplace(regexreplace(str, "`nm)^.*[^[:ascii:]].*$"), "`n`n")
}
That appears to work perfectly. I will know for sure when my user in Sweden is able to run it on his computer.

Regex looks like magic to me.

thank you very much
DataLife
English only variables.PNG
English only variables.PNG (11.07 KiB) Viewed 1943 times

Re: filter non English lines

Posted: 14 Jun 2018, 14:25
by Helgef
:thumbup:
I edit it, I added trim.

Cheers.

Edit: It doesn't work :thumbdown:
Edit2: I think it works now :thumbup: .

Re: filter non English lines

Posted: 15 Jun 2018, 14:10
by DataLife
Helgef wrote::thumbup:
I edit it, I added trim.

Cheers.

Edit: It doesn't work :thumbdown:
Edit2: I think it works now :thumbup: .
Yes, it works, thanks very much

Re: filter non English lines

Posted: 16 Jun 2018, 03:38
by Helgef
my user in Sweden
pcre.txt wrote:ascii character codes 0 - 127
Wikipedia wrote:The Swedish alphabet is the writing system used for the Swedish language. The 29 letters of this alphabet are the modern 26-letter basic Latin alphabet ('A' through 'Z') plus 'Å', 'Ä', and 'Ö'
Wikipedia - Å wrote:Unicode 197
Maybe this is more appropriate, if the users uses any 'Å', 'Ä', and 'Ö',

Code: Select all

removeNonSwedishLines(str){
	return regexreplace(trim(regexreplace(str, "`nm)^.*[^\x{0}-\x{ff}].*$"),  "`n"), "\R{2,}",  "`n")
}
I do not know, just guessing.

Cheers.

Re: filter non English lines

Posted: 16 Jun 2018, 18:04
by DataLife
Helgef wrote:
my user in Sweden
pcre.txt wrote:ascii character codes 0 - 127
Wikipedia wrote:The Swedish alphabet is the writing system used for the Swedish language. The 29 letters of this alphabet are the modern 26-letter basic Latin alphabet ('A' through 'Z') plus 'Å', 'Ä', and 'Ö'
Wikipedia - Å wrote:Unicode 197
Maybe this is more appropriate, if the users uses any 'Å', 'Ä', and 'Ö',

Code: Select all

removeNonSwedishLines(str){
	return regexreplace(trim(regexreplace(str, "`nm)^.*[^\x{0}-\x{ff}].*$"),  "`n"), "\R{2,}",  "`n")
}
I do not know, just guessing.

Cheers.
Yes I was concerned about that but I am not able to test until he comes back from vacation. Your changes appear to fix this issue before it even occurs.

A_language returns 0409 on his system. I don't know how all that works but if his language code is 0409 does that mean that he would only be using English characters?
Thanks for your help
DataLife

Re: filter non English lines

Posted: 17 Jun 2018, 05:50
by jeeswg
For something like this I would create a list of all the strings. I would then determine a list of all of the unique characters in that list and determine whether to allow each character. I would assess each character manually, I might then make a RegEx line based on my conclusions.
E.g. see, LIST EVERY CHARACTER THAT APPEARS IN A STRING:
jeeswg's characters tutorial - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=7&t=26486

Re: filter non English lines

Posted: 17 Jun 2018, 06:09
by Helgef
0x0409 is english (US).
does that mean that he would only be using English characters?
Not really, even if the system language is english, the user may set his keyboard layout to any language, eg, swedish, and he may join any non-english named network.

Cheers.