Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Tutorial: An AHK Introduction to RegEx


  • Please log in to reply
40 replies to this topic

Poll: Did you find this tutorial helpful? (45 member(s) have cast votes)

Did you find this tutorial helpful?

  1. Yes, I found it helpful. (48 votes [90.57%])

    Percentage of vote: 90.57%

  2. No, it wasn't helpful. (2 votes [3.77%])

    Percentage of vote: 3.77%

  3. Who taught you how to write, e.e. cummings? (3 votes [5.66%])

    Percentage of vote: 5.66%

Vote Guests cannot vote
Sean
  • Members
  • 2462 posts
  • Last active: Feb 07 2012 04:00 AM
  • Joined: 12 Feb 2007
Although I don't use RegEx frequently because it's a pain considering how to make it perform faster, especially fail faster, however, sometimes it became a good/unavoidable challenge. Recently it made me curious what RegEx would be corresponding to AHK's Loop, Parse, Haystack, CSV, and I'd like to share the challenge with the members.

So, my challenge is to parse using RegEx CSV.

In case if helpful, I used ones which is not documented in the AHK's help file.

sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008
@ Sean

I think I've found it, stealing your matching technique and jaco0646's RegEx while loop:

; The contents of my CSVList.csv file are as follows:
; 1,2,3,4,5
; this,that,and,the,other
; a,b,c,d,e
; why,won't,this,just,parse
; a1,b2,c3,d4,e5
; )(*@,)@&%,)!$&,(@!^,(#&%

Pos = 1
Loop, read, CSVList.csv
{

	While Pos := RegExMatch(A_LoopReadLine,"(?P<String>[^`,]+`,?)",Sub,Pos+StrLen(Sub))
	{
		Sub0++
		FileAppend % RegExReplace(SubString,"([^`,]+)`,?","$1")"`n", Output.txt
	}
	FileAppend, `nDone!`n`n, Output.txt
	Sub0 = 0
	Sub = 0

}

I don't know if that's the optimal solution, seems a tad slow for the small amount of work to do but it gets the job done.

sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008
Well, the RegEx parsing loop is slower, but not by nearly as much as I thought it would be:

QPC()
Pos = 1
Loop, read, CSVList.csv
{

	While Pos := RegExMatch(A_LoopReadLine,"(?P<String>[^`,]+`,?)",Sub,Pos+StrLen(Sub))
	{
		Sub0++
		FileAppend % RegExReplace(SubString,"([^`,]+)`,?","$1")"`n", Output.txt
	}
	FileAppend, `nDone!`n`n, Output.txt
	Sub0 = 0
	Sub = 0

}
T1 := QPC()
QPC()
Loop, read, CSVList.csv
{
	
	Loop, parse, A_LoopReadLine, `,
	{
		FileAppend % "" A_LoopField "`n", Output2.txt
	}
	FileAppend, `nDone!`n`n, Output2.txt
	
}
T2 := QPC()
MsgBox % "RegEx parsing loop: " T1 " seconds`nRegular parsing loop: " T2 " seconds"
FileAppend % "" T1 "`," T2 "`n", ParseTimes.txt

QPC() {
   Static Freq, LastCount
   If !Freq
      DllCall("QueryPerformanceFrequency", "Int64*", Freq)
   DllCall("QueryPerformanceCounter", "Int64*", Count)
   Return (Count-LastCount)/Freq, LastCount:=Count
}

Averaged over 50 runs (excluding aberrant results):

Regular parsing loop: 0.003374 sec
RegEx parsing loop: 0.003942 sec



Sean
  • Members
  • 2462 posts
  • Last active: Feb 07 2012 04:00 AM
  • Joined: 12 Feb 2007

RegExMatch(A_LoopReadLine,"(?P<String>[^`,]+`,?)",Sub,Pos+StrLen(Sub))

One thing first: you don't have to use `, inside an expression, just , is fine. BTW, I linked to the wiki page of CSV to make clear what CSV is as I also got it wrong at first.
Haystack=[color=darkred]"[/color]We've got ""123,456,789""![color=darkred]"[/color][color=red],[/color]123[color=red],[/color]456[color=red],[/color]789
Oh, one more. I didn't mean to post the answers figured out, it was supposed to be mainly as motivation to extend the knowledge/usage of RegEx beyond what's already documented in AHK's doc.

rulfzid
  • Members
  • 62 posts
  • Last active: Mar 11 2011 08:31 PM
  • Joined: 27 Nov 2008

Oh, one more. I didn't mean to post the answers figured out, it was supposed to be mainly as motivation to extend the knowledge/usage of RegEx beyond what's already documented in AHK's doc.


There is already a pretty good implementation of "correct" csv handling using regex here: <!-- m -->http://www.autohotke... ... 280#203280<!-- m -->

The one issue I have there is that it's a little brittle, depending on strict usage of linefeeds vs carriage returns to indicate the difference between multiline fields and the ends of rows.

Unless somebody else gets to it, I'll post an attempt pretty soon.

rulfzid
  • Members
  • 62 posts
  • Last active: Mar 11 2011 08:31 PM
  • Joined: 27 Nov 2008
So it turns out that in thinking about it more, the way I really wanted to do it was with a parsing loop and state machine. It seemed the easiest way to handle the quotes and multiline stuff correctly. Also, it's probably pretty fast.

At any rate, here's what I've got so far - doesn't have options for handling whitespace, but does everything else correctly (alas, sans regex).

csv = 
(
"We've got ""123,456,789""!",123,456,789
abc,def,ghi,"testing the
""line breaks"" here"
finalrow,,finalfield
)

CSVParse(csv)

CSVParse(csv, d=",", e="""")
{
	inquotes := 0
	row := 1
	col := 1
	prev_char_is_quote := 0
	current_field := ""
	
	; Add final comma (if none exists) to catch the final field
	If (SubStr(csv, 0) != d )
		csv .= d
		
	; Make all linebreaks `n only - easier to handle when parsing
	; one character at a time
	StringReplace, csv, csv, `r`n, `n, All
	
	Loop, parse, csv
	{
		if prev_char_is_quote and (A_LoopField != e)
			prev_char_is_quote := !prev_char_is_quote
		
		; current character is comma or newline
		if (A_LoopField = d) or (A_LoopField = "`n")
		{
			if inquotes ; if quoted, add to current field
				current_field .= A_LoopField
			else ; otherwise, here's where you do stuff with the info:
			{    ; you've got the row #, col #, and field value to work with
				msgbox % "row: " . row . "`tcol: " . col 
					   . "`n--------------------------------`n" 
					   . current_field 
					   . "`n--------------------------------"
				
				; reset current_field and set row/col counters as necessary				
				current_field = 
				if (A_LoopField = d) 
					col++
				else
					col:=1, row++
			}	
		}
		; current character is quote
		else if (A_LoopField = e) ; 
		{
			if inquotes
				prev_char_is_quote := 1
			else 
				if prev_char_is_quote
					current_field .= e
			
			inquotes := !inquotes
		}
		; current character is everything else (incl. tabs and spaces)
		else 
			current_field .= A_LoopField
	}
}


Sean
  • Members
  • 2462 posts
  • Last active: Feb 07 2012 04:00 AM
  • Joined: 12 Feb 2007
Your CSV string is not CSV-Compliant. Anyway, as you posted, here is mine.
sData="We've got ""123,456,789""!",123,456,789
nPos:=1, s:=""
While	nPos := RegExMatch(sData, "\G("")?(?<Field>(?(1)(?:""""|[^""])*|[^,""\s]+))(?(1)"")(?:,|$)", s, nPos+StrLen(s))
	MsgBox % sField := RegExReplace(sField,"""""","""")


rulfzid
  • Members
  • 62 posts
  • Last active: Mar 11 2011 08:31 PM
  • Joined: 27 Nov 2008

Your CSV string is not CSV-Compliant.


It's not? I based it off the multiline examples from the wikipedia page. Oh, nm - it's got a completely empty field, which should be "" according to the standard.

rulfzid
  • Members
  • 62 posts
  • Last active: Mar 11 2011 08:31 PM
  • Joined: 27 Nov 2008

Your CSV string is not CSV-Compliant. Anyway, as you posted, here is mine.

sData="We've got ""123,456,789""!",123,456,789
nPos:=1, s:=""
While	nPos := RegExMatch(sData, "\G("")?(?<Field>(?(1)(?:""""|[^""])*|[^,""\s]+))(?(1)"")(?:,|$)", s, nPos+StrLen(s))
	MsgBox % sField := RegExReplace(sField,"""""","""")


Also, this is very cool.

For those who don't know, the
(?(1)(?:""""|[^""])*|[^,""\s]+))
is called a conditional subpattern, and you can read more about it here

From the pcre docs:

The two possible forms of conditional subpattern
are

(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)

If the condition is satisfied, the yes-pattern is used; otherwise the
no-pattern (if present) is used. If there are more than two alterna-
tives in the subpattern, a compile-time error occurs.


So breaking the conditional down, we get:

condition yes-pattern no-pattern:
([color=red]?(1)[/color][color=olive](?:""""|[^""])*[/color]|[color=yellow][^,""\s]+[/color]))

The conditional asks whether subpattern #1 has been captured (one double quote mark).

If so, it matches double double-quotes or anything that's not a single double quote.

If not, then it matches anything that's not a comma, space, or quote-mark.

Sorry if I've gone overkill with the explanation, but I think it's a powerful tool (that I just read/learned about) and thought it warranted some explication.

kiwijunglist
  • Members
  • 61 posts
  • Last active: Aug 14 2013 05:56 AM
  • Joined: 26 May 2009

The tutorial is  a bit confusing because it contains bb formating.

 

Can someone help me with this expression

 

Str := "isfeniefsiefseifsefsiefnefs<name>first.name</name>ssefiefhseiufhe<name>second.name</name>sifjsidfsdifsdfjfd<name>third.name</name>sudnfsidfnsdiufsdunf"

 

Loop

{

Msgbox % A_Index "=" RegExMatch(Str, "*<name>(*)</name>", "$" . A_Index)

} until at the end of the string



emmanuel d
  • Members
  • 519 posts
  • Last active: Jul 15 2017 12:04 PM
  • Joined: 29 Jan 2009

that is a endless loop, use while:

Str := "isfeniefsiefseifsefsiefnefs<name>first.name</name>ssefiefhseiufhe<name>second.name</name>sifjsidfsdifsdfjfd<name>third.name</name>sudnfsidfnsdiufsdunf"

 
found:=0
while found := RegExMatch(Str,"U).*<name>(.*)</name>", Match,found+1+StrLen(Match)) ; get section names even empty ones
	msgbox,%Match1%




Stopwatch emdkplayer
the code i post falls under the: WTFYW-WTFPL license

http://www.ahkscript.org/ the new forum