Help with a regex that can solve a lot of HTML matching Topic is solved

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Help with a regex that can solve a lot of HTML matching

19 May 2018, 14:27

Earlier I came across some code by swagfag https://autohotkey.com/boards/viewtopic.php?f=5&t=49293

Very often a website will have code that looks something like one of the following

Sample1

Code: Select all

<span class="field field--type-string hidden" property="schema:name">headline</span>
or
Sample2

Code: Select all

<span class="field field--type-string hidden">headline</span>
This regex works perfect on the second sample

Code: Select all

"<span class=""field field[^>]+>(?P<Name>[^<]+)"
However, it returns nothing on the first.

It would be great if someone can share a regex that will cover the following:

match something that starts with literal words for example in above it would be <span class=

next enter some more literal data in above it would be "field field" in regex so far we have "<span class=""field field

The next part is the key, now go until you hit a > meaning, all data not just char or numbers just everything until >

Then grab the text after that until you hit a </span> or </a> (in regex you would write the literal what you expect to hit).

I tried playing with (+.*) and I must have tried 30+ things but did not succeed :thumbdown:
swagfag
Posts: 6222
Joined: 11 Jan 2017, 17:59

Re: Help with a regex that can solve a lot of HTML matching  Topic is solved

19 May 2018, 14:56

the regex unaltered works as expected:

Code: Select all

Needle := "<span class=""field field[^>]+>(?P<Name>[^<]+)"

Haystack = <span class="field field--type-string hidden" property="schema:name">headlineSCHEMA</span> ; legacy, cause too lazy to do the escaping
RegExMatch(Haystack, Needle, Match)
MsgBox, % MatchName

Haystack = <span class="field field--type-string hidden">headlineNORMAL</span> ; legacy, cause too lazy to do the escaping
RegExMatch(Haystack, Needle, Match)
MsgBox, % MatchName
as for the rest of it, youve pretty much figured it out already:

[^>]+ - negated character set, this is a construct that often comes up. "Match everything that isnt whatever character is included in the set(u may put in more characters as well if u want to, its not limited to just one)"
  • <span class=""field field - literal text, double quotes as they need to be escaped, since the regex is inside a quoted-string
  • [^>]+ - negated char set, everything that isnt a >
  • > - literal >
  • ( - start capturing group
  • ?P<Name> - named capturing pattern, save the result of this capturing group in whatever name u used for your regex match OutputVar + Name appended to it
  • [^<]+ - negated char set, everything that isnt a <
  • ) - end capturing group
Last edited by swagfag on 19 May 2018, 15:10, edited 2 times in total.
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: Help with a regex that can solve a lot of HTML matching

19 May 2018, 15:08

swagfag wrote:the regex unaltered works as expected:

Code: Select all

Needle := "<span class=""field field[^>]+>(?P<Name>[^<]+)"

Haystack = <span class="field field--type-string hidden" property="schema:name">headlineSCHEMA</span> ; legacy, cause too lazy to do the escaping
RegExMatch(Haystack, Needle, Match)
MsgBox, % MatchName

Haystack = <span class="field field--type-string hidden">headlineNORMAL</span> ; legacy, cause too lazy to do the escaping
RegExMatch(Haystack, Needle, Match)
MsgBox, % MatchName
The IWB learner tool showed me <span class= and when I checked the actual html I did see <span class= but also saw one with just class=

When I changed the regex to "class=""field field[^>]+>(?P<Name>[^<]+)" it worked when you keep it as "<span class=""field field[^>]+>(?P<Name>[^<]+)" it only returns first few headlines

You can see what I mean in this code if curious

Code: Select all

FileDelete, TempFile96.txt
Output := ""
UrlDownloadToFile, % "https://www.zerohedge.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt
Needle := "class=""field field[^>]+>(?P<Name>[^<]+)"

Pos := 1
While (Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match)))
	Output .= MatchName "`r`n"
msgbox, % output
ExitApp
Only issue left is figuring out why quotes for example is returned as " instead of " I never really had this issue with sites

Thanks again you cleared up so much confusion with the explanation
swagfag
Posts: 6222
Joined: 11 Jan 2017, 17:59

Re: Help with a regex that can solve a lot of HTML matching

19 May 2018, 15:21

AHKStudent wrote:Only issue left is figuring out why quotes for example is returned as " instead of " I never really had this issue with sites
thats a question for the site owner
regardless, u could strip those after youre done grabbing the headlines:

Code: Select all

FileDelete, TempFile96.txt
Output := ""
UrlDownloadToFile, % "https://www.zerohedge.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt
Needle := "class=""field field[^>]+>(?P<Name>[^<]+)"

Pos := 1
While (Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match)))
	Output .= MatchName "`r`n"

for encodedChar, decodedChar in {""": """", "'": "'"}
	Output := StrReplace(Output, encodedChar, decodedChar)

msgbox, % output
ExitApp
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: Help with a regex that can solve a lot of HTML matching

19 May 2018, 15:32

swagfag wrote:
AHKStudent wrote:Only issue left is figuring out why quotes for example is returned as " instead of " I never really had this issue with sites
thats a question for the site owner
regardless, u could strip those after youre done grabbing the headlines:

Code: Select all

FileDelete, TempFile96.txt
Output := ""
UrlDownloadToFile, % "https://www.zerohedge.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt
Needle := "class=""field field[^>]+>(?P<Name>[^<]+)"

Pos := 1
While (Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match)))
	Output .= MatchName "`r`n"

for encodedChar, decodedChar in {""": """", "'": "'"}
	Output := StrReplace(Output, encodedChar, decodedChar)

msgbox, % output
ExitApp
thanks a lot, this solves a lot of my regex html issues

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: mikeyww and 227 guests