StrReplace Chinese characters help

robinson · 13 May 2024, 16:57

Hi, noob here.
I'm editing subtitle (.srt) files
and I'm finding them looking a little cramped where double quotation marks are beside Chinese characters.
How can I use two lines of StrReplace()
to replace [“][Chinese character] with [space][“][Chinese character]
and then [Chinese character][”] with [Chinese character][”][space] ?
Every instance at once, I mean.
Thanks!

Seven0528 · 13 May 2024, 18:08

Code: Select all

haystack := "“你好”`r`n“祝你有个美好的一天”"
msgbox A_Clipboard := editSubtitles(haystack)

editSubtitles(haystack)    {
    newStr := haystack
    newStr := regExReplace(newStr, "“(?=\p{Han})", " ${0}")
    newStr := regExReplace(newStr, "(?<=\p{Han})”", "${0} ")
    return newStr
}

robinson · 15 May 2024, 18:02

@Seven0528
Thanks!
Are there any other ways to specify a Chinese character?
Also what's the difference between ${0} and $1 ?
And the difference between \p{Han} and ?=\p{Han} ?

Seven0528 · 15 May 2024, 18:58

robinson wrote: ↑
15 May 2024, 18:02
Are there any other ways to specify a Chinese character?

　Yes, there are other methods, but ultimately, that would involve specifying the Unicode range directly.
The most widely known range is called CJK Unified Ideographs, which spans from U+4E00 to U+9FFF.
Represented in regex, it would be something like [一-鿿].
There are additional areas such as CJK Unified Ideographs Extension A, where supplementary characters gather.
Also, considering supplementary characters like U+3007 (〇), pinpointing Chinese characters is quite a challenging task.
I once tried to achieve this by examining the entire Unicode chart, but it wasn't easy.
Moreover, Chinese characters are not only used in China but also in Japan and Korea, so specifying only those used in China would require a lot of effort.
Distinguishing between simplified and traditional characters adds even more complexity.
Personally, I think \p{Han} is sufficient. Since it can be utilized in AHK regular expressions implemented with PCRE, it's quite handy.
For more detailed information, please refer to the document below.
How to detect Chinese characters with punctuation in regex?
[〇一-鿿㐀-䶿豈-﫿𠀀-𪛟𪜀-𫜿𫝀-𫠟丽-𯨟⼀-⿕⺀-⻳＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､　、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·]*[！？｡。][」﹂”』’》）］｝〕〗〙〛〉】]*

robinson wrote: ↑
15 May 2024, 18:02
Also, what's the difference between ${0} and $1 ?

The difference between ${1} and ${0} in regular expressions is whether they refer to the first subpattern or the entire pattern.
The presence or absence of curly braces is merely a stylistic difference. Of course, when the number exceeds two digits, curly braces are necessary.
A subpattern usually refers to a pattern enclosed in parentheses.
(The parentheses used here are not subpatterns but are called assertions in another syntax. Subpatterns are not used here.)

robinson wrote: ↑
15 May 2024, 18:02
And the difference between \p{Han} and ?=\p{Han} ?

The reason for using assertions instead of directly specifying characters in this regular expression is efficiency.
Regular expressions typically evaluate patterns from left to right, sequentially checking for matches.

Code: Select all

“祝你有个美好的一天”

In this string, the pattern (?<=\p{Han})” is simply searching for the character ”, but with the condition that there must be a preceding Chinese character. This process requires only 5 steps.
If the pattern \p{Han}” were used instead, it would find all Chinese characters and then check if there is a ” following them. This would require 20 steps.

Code: Select all

“发布这一世界人权宣言，作为所有人民和所有国家努力实现的共同标准，以期每一个人和社会机构经常铭念本宣言，努力通过教诲和教育促进对权利和自由的尊重，并通过国家的和国际的渐进措施，使这些权利和自由在各会员国本身人民及在其管辖下领土的人民中得到普遍和有效的承认和遵行”

Efficiency gains become more significant as the length of the text increases, especially in longer texts composed primarily of Chinese characters.
For instance, in a longer text such as the one provided, the pattern \p{Han}” would need to perform 255 steps because it has to examine all Chinese characters. On the other hand, (?<=\p{Han})” still only requires 5 steps because it only needs to find the character ” and then check if there is a Chinese character before it.
Using appropriate assertions like this can significantly enhance the efficiency of regular expressions.

robinson · 16 May 2024, 13:42

@Seven0528
Wow that's amazing man, thanks!

StrReplace Chinese characters help

StrReplace Chinese characters help

Re: StrReplace Chinese characters help

Re: StrReplace Chinese characters help

Re: StrReplace Chinese characters help

Re: StrReplace Chinese characters help

Who is online