Help using regex to get the title and link together code included

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Help using regex to get the title and link together code included

20 May 2018, 12:28

First big thanks to swagfag he helped come a long way with regex for my project

The html on the site is as follows

Code: Select all

<a class="title" href="http://www.businessinsider.com/ap-the-latest-mnuchin-says-us-china-putting-trade-war-on-hold-2018-5">MNUCHIN: The US-China trade war is 'on hold'</a>
I was able to create two different processes one gets me the link the other the title

The issue is I need to be able to get it at once to store it my db

so im hoping the loop runs and gives me one variable that has the title the other the url

I suspect this has to be one of the most common things people do, but I tried hard to find a ahk solution but couldn't.

Thanks for your time

Code: Select all

FileDelete, TempFile96.txt
Output := ""
UrlDownloadToFile, % "http://www.businessinsider.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt
Needle := "<a class=""title[^>]+>(?P<Name>[^<]+)"
Pos := 1
While (Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match)))
	Output .= MatchName "`r`n"
msgbox, % output

Needle2 := "<a class=""title"" href=""(?P<Name2>[^""]+)"
Pos2 := 1
While (Pos2 := RegExMatch(HTML, Needle2, Match2, Pos2 + StrLen(Match2)))
	Output2 .= Match2Name2 "`r`n"
msgbox, % output2

ExitApp
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: Help using regex to get the title and link together code included

20 May 2018, 14:57

So far this is what I came up with, I first check how many articles them loop based on that. I won't be using this anyway as I must use actual browser and page down or else only some of the articles show. I am sharing in case someone will have a use for it or if someone wants to show another way to do this.

Code: Select all

FileDelete, TempFile96.txt
Output := ""
UrlDownloadToFile, % "http://www.businessinsider.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt

Pos := 1
Pos2 := 1
Pos3 := 1
Needle := "<a class=""title[^>]+>(?P<Name>[^<]+)"
Needle2 := "<a class=""title"" href=""(?P<Name2>[^""]+)"

While (Pos3 := RegExMatch(HTML, Needle2, Match3, Pos3 + StrLen(Match3)))
count++

loop, %count%{
Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match))
Pos2 := RegExMatch(HTML, Needle2, Match2, Pos2 + StrLen(Match2))
	Output .= MatchName " " Match2Name2 "`r`n"
}
msgbox, % output
ExitApp
User avatar
TLM
Posts: 1608
Joined: 01 Oct 2013, 07:52
Contact:

Re: Help using regex to get the title and link together code included

20 May 2018, 15:24

I'm against parsing/searching full markup with RegEx as it can be very unpredictable.
Using the DOM object is a much more reliable approach:

Code: Select all

url 	= http://www.businessinsider.com

reqObj 	:= ComObjCreate( "WinHttp.WinHttpRequest.5.1" )
reqObj.Open( "GET", url, false ), reqObj.Send()
htmObj 	:= ComObjCreate( "HTMLfile" ), htmObj.Write( reqObj.ResponseText )

aTagObj := htmObj.getElementsByTagName( "a" )

While ( a_index-1 < aTagObj.length )
{
	if( aTagObj[ a_index-1 ].className = "title" )
	{
		str .= "Title: " 	aTagObj[ a_index-1 ].innerText
			.  "`nLink: " 	aTagObj[ a_index-1 ].href . "`n---------`n"
	}
}

msgbox % str
You should get this:
Spoiler
You can then easily fill an associative array with title and link keys for use elsewhere ( i'll let you do that part tho ;) )
Also remember if you need to call the http request and html file objects more than once, you can wrap them into a function.
swagfag
Posts: 6222
Joined: 11 Jan 2017, 17:59

Re: Help using regex to get the title and link together code included

20 May 2018, 15:39

u can match it with one regex

Code: Select all

FileDelete, TempFile96.txt
UrlDownloadToFile, % "http://www.businessinsider.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt

Output := {}
Needle := "<a class=""title"" href=""(?P<URL>[^""]+)"">(?P<Title>[^<]+)"
Pos := 1
While (Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match)))
	Output.Push({"URL": MatchURL, "Title": MatchTitle})

for each, Entry in Output
	Result .= "----`n" Entry.URL "`n" Entry.Title "`n----`n`n"

msgbox, % Result
ExitApp
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: Help using regex to get the title and link together code included

20 May 2018, 15:52

TLM wrote:I'm against parsing/searching full markup with RegEx as it can be very unpredictable.
Using the DOM object is a much more reliable approach:

Code: Select all

url 	= http://www.businessinsider.com

reqObj 	:= ComObjCreate( "WinHttp.WinHttpRequest.5.1" )
reqObj.Open( "GET", url, false ), reqObj.Send()
htmObj 	:= ComObjCreate( "HTMLfile" ), htmObj.Write( reqObj.ResponseText )

aTagObj := htmObj.getElementsByTagName( "a" )

While ( a_index-1 < aTagObj.length )
{
	if( aTagObj[ a_index-1 ].className = "title" )
	{
		str .= "Title: " 	aTagObj[ a_index-1 ].innerText
			.  "`nLink: " 	aTagObj[ a_index-1 ].href . "`n---------`n"
	}
}

msgbox % str
You should get this:
Spoiler
You can then easily fill an associative array with title and link keys for use elsewhere ( i'll let you do that part tho ;) )
Also remember if you need to call the http request and html file objects more than once, you can wrap them into a function.
yes, I did get that result, thank you

I did another test on a different site, I suspect the reason it failed is because that site does not have http:// in their links?

Code: Select all

url 	= https://www.zerohedge.com

reqObj 	:= ComObjCreate( "WinHttp.WinHttpRequest.5.1" )
reqObj.Open( "GET", url, false ), reqObj.Send()
htmObj 	:= ComObjCreate( "HTMLfile" ), htmObj.Write( reqObj.ResponseText )

aTagObj := htmObj.getElementsByTagName( "H2" )

While ( a_index-1 < aTagObj.length )
{
	if( aTagObj[ a_index-1 ].className = "teaser-title" )
	{
		str .= "Title: " 	aTagObj[ a_index-1 ].innerText
			.  "`nLink: " 	aTagObj[ a_index-1 ].href . "`n---------`n"
	}
}

msgbox % str
User avatar
TLM
Posts: 1608
Joined: 01 Oct 2013, 07:52
Contact:

Re: Help using regex to get the title and link together code included

20 May 2018, 16:40

AHKStudent wrote:I did another test on a different site, I suspect the reason it failed is because that site does not have http:// in their links?
Every site's markup is going to most likely be written differently. You'll have to inspect the markup 1st on a per site basis ( or even better, use their API ).

oh I forgot to mention that the added benefit of using a http request is that it returns the html directly to the HttpRequestObject.ResponseText property.
You don't need to 1st download the site to a file.

edit:
In the case of zerohedge, I would 1st add a h2 object just like the anchor object I previously used aTagObj := htmObj.getElementsByTagName( "a" )
Loop through the h2TagObj object looking for the class teaser-title and making sure it has the parent article element, then grab the data.
Image
I know it might seem "harder" but the problem with the RegEx approach is anomalous data can be injected into page in many ways throwing it off.
Regardless, the regular expression is going to be different per side anyway.. Just my 2cents
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: Help using regex to get the title and link together code included

20 May 2018, 17:36

swagfag wrote:u can match it with one regex

Code: Select all

FileDelete, TempFile96.txt
UrlDownloadToFile, % "http://www.businessinsider.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt

Output := {}
Needle := "<a class=""title"" href=""(?P<URL>[^""]+)"">(?P<Title>[^<]+)"
Pos := 1
While (Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match)))
	Output.Push({"URL": MatchURL, "Title": MatchTitle})

for each, Entry in Output
	Result .= "----`n" Entry.URL "`n" Entry.Title "`n----`n`n"

msgbox, % Result
ExitApp
that did work! it did bring back all the articles that show up if you load the page without scrolling down, thats fine.

A mystery I encountered is when I send the data to the database it only sent 7 items, when I ran the test as you posted I got 16

You can review my code here

Code: Select all

#Include Class_SQLiteDB.ahk ; incase you need it https://autohotkey.com/boards/viewtopic.php?t=1064
FormatTime, TimeS,, yyyy-MM-dd
MyDB := New SQLiteDB
MyDB.OpenDB("newstest2.sqlite")
MyDB.Exec("CREATE TABLE IF NOT EXISTS News (date TEXT, headline TEXT, link TEXT)")
FileDelete, TempFile96.txt
UrlDownloadToFile, % "http://www.businessinsider.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt

Output := {}
Needle := "<a class=""title"" href=""(?P<URL>[^""]+)"">(?P<Title>[^<]+)"
Pos := 1
While (Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match)))
	Output.Push({"URL": MatchURL, "Title": MatchTitle})

for each, Entry in Output
	MyDB.Exec("INSERT INTO News VALUES ('" . TimeS . "', '" . Entry.Title . "','" . Entry.URL . "');") 
	;Result .= "----`n" Entry.URL "`n" Entry.Title "`n----`n`n"     

;msgbox, % Result
MyDB.CloseDB()
ExitApp
swagfag
Posts: 6222
Joined: 11 Jan 2017, 17:59

Re: Help using regex to get the title and link together code included

20 May 2018, 18:13

tl;dr apostrophies messing the SQL statement up
use EscapeStr() or escape them yourself manually

Code: Select all

#Include Class_SQLiteDB.ahk ; incase you need it https://autohotkey.com/boards/viewtopic.php?t=1064
FormatTime, TimeS,, yyyy-MM-dd
MyDB := New SQLiteDB
MyDB.OpenDB("newstest2.sqlite")
MyDB.Exec("CREATE TABLE IF NOT EXISTS News (date TEXT, headline TEXT, link TEXT)")
FileDelete, TempFile96.txt
UrlDownloadToFile, % "http://www.businessinsider.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt

Output := {}
Needle := "<a class=""title"" href=""(?P<URL>[^""]+)"">(?P<Title>[^<]+)"
Pos := 1
While (Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match)))
	Output.Push({"URL": MatchURL, "Title": MatchTitle})

for each, Entry in Output
{
	msgBox % "PLAIN TEXT`n" Result .= "----`n" Entry.URL "`n" Entry.Title "`n----`n`n"
	msgBox % "Exec RetVal: " MyDB.Exec("INSERT INTO News VALUES ('" . TimeS . "', '" . Entry.Title . "','" . Entry.URL . "');")
			. "`nErrorCode: " MyDB.ErrorCode "`nErrorMsg: " MyDB.ErrorMsg
}

MyDB.CloseDB()
msgbox, % Result
ExitApp
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: Help using regex to get the title and link together code included

20 May 2018, 18:35

TLM wrote:
AHKStudent wrote:I did another test on a different site, I suspect the reason it failed is because that site does not have http:// in their links?
Every site's markup is going to most likely be written differently. You'll have to inspect the markup 1st on a per site basis ( or even better, use their API ).

oh I forgot to mention that the added benefit of using a http request is that it returns the html directly to the HttpRequestObject.ResponseText property.
You don't need to 1st download the site to a file.

edit:
In the case of zerohedge, I would 1st add a h2 object just like the anchor object I previously used aTagObj := htmObj.getElementsByTagName( "a" )
Loop through the h2TagObj object looking for the class teaser-title and making sure it has the parent article element, then grab the data.
Image
I know it might seem "harder" but the problem with the RegEx approach is anomalous data can be injected into page in many ways throwing it off.
Regardless, the regular expression is going to be different per side anyway.. Just my 2cents
yeah, I did have regex last week inject a wrong url. glad I caught it. I have lots to learn, trying ur method for zerohedge so far no luck but i will continue trying
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: Help using regex to get the title and link together code included

20 May 2018, 19:27

swagfag wrote:tl;dr apostrophies messing the SQL statement up
use EscapeStr() or escape them yourself manually

Code: Select all

#Include Class_SQLiteDB.ahk ; incase you need it https://autohotkey.com/boards/viewtopic.php?t=1064
FormatTime, TimeS,, yyyy-MM-dd
MyDB := New SQLiteDB
MyDB.OpenDB("newstest2.sqlite")
MyDB.Exec("CREATE TABLE IF NOT EXISTS News (date TEXT, headline TEXT, link TEXT)")
FileDelete, TempFile96.txt
UrlDownloadToFile, % "http://www.businessinsider.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt

Output := {}
Needle := "<a class=""title"" href=""(?P<URL>[^""]+)"">(?P<Title>[^<]+)"
Pos := 1
While (Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match)))
	Output.Push({"URL": MatchURL, "Title": MatchTitle})

for each, Entry in Output
{
	msgBox % "PLAIN TEXT`n" Result .= "----`n" Entry.URL "`n" Entry.Title "`n----`n`n"
	msgBox % "Exec RetVal: " MyDB.Exec("INSERT INTO News VALUES ('" . TimeS . "', '" . Entry.Title . "','" . Entry.URL . "');")
			. "`nErrorCode: " MyDB.ErrorCode "`nErrorMsg: " MyDB.ErrorMsg
}

MyDB.CloseDB()
msgbox, % Result
ExitApp
I used what you showed me here to escape and it worked! https://autohotkey.com/boards/viewtopic ... 77#p218377
thank you

Code: Select all

#Include Class_SQLiteDB.ahk ; incase you need it https://autohotkey.com/boards/viewtopic.php?t=1064
FormatTime, TimeS,, yyyy-MM-dd
MyDB := New SQLiteDB
MyDB.OpenDB("newstest3.sqlite")
MyDB.Exec("CREATE TABLE IF NOT EXISTS News (date TEXT, headline TEXT, link TEXT)")
FileDelete, TempFile96.txt
UrlDownloadToFile, % "http://www.businessinsider.com/", TempFile96.txt
FileRead, HTML, TempFile96.txt

Output := {}
Needle := "<a class=""title"" href=""(?P<URL>[^""]+)"">(?P<Title>[^<]+)"
Pos := 1
While (Pos := RegExMatch(HTML, Needle, Match, Pos + StrLen(Match)))
{
	escapedHeadline := StrReplace(MatchTitle, "'", "''")
	Output.Push({"URL": MatchURL, "Title": escapedHeadline})
}

for each, Entry in Output
{
	MyDB.Exec("INSERT INTO News VALUES ('" . TimeS . "', '" . Entry.Title . "','" . Entry.URL . "');")
	
}

MyDB.CloseDB()

ExitApp
User avatar
TLM
Posts: 1608
Joined: 01 Oct 2013, 07:52
Contact:

Re: Help using regex to get the title and link together code included

20 May 2018, 20:29

AHKStudent wrote:yeah, I did have regex last week inject a wrong url. glad I caught it. I have lots to learn, trying ur method for zerohedge so far no luck but i will continue trying
A simple http request will grab any page(s), so at least you don't have to download them.

Code: Select all

url 	= anysite.com

html = GetHtml( url )

msgbox % html

return

GetHtml( url )
{
	reqObj 	:= ComObjCreate( "WinHttp.WinHttpRequest.5.1" )
	reqObj.Open( "GET", url, false ), reqObj.Send()
		
	return reqObj.ResponseText
}
This way you can use any method you like to grab the data you want ;)
( note: you may need to declare the reqObj 1st sorry I'm on my cell right now and can't test it lol )

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Araphen, CrowexBR, Google [Bot], haomingchen1998, mcd, rubeusmalfoy, ShatterCoder, spellegrnio1 and 88 guests