Help using regex to get the title and link together code included

Get help with using AutoHotkey and its commands and hotkeys
AHKStudent
Posts: 213
Joined: 05 May 2018, 12:23

Help using regex to get the title and link together code included

20 May 2018, 12:28

First big thanks to swagfag he helped come a long way with regex for my project

The html on the site is as follows

Code: [Select all] [Download] GeSHi © Codebox Plus

<a class="title" href="http://www.businessinsider.com/ap-the-latest-mnuchin-says-us-china-putting-trade-war-on-hold-2018-5">MNUCHIN: The US-China trade war is 'on hold'</a>


I was able to create two different processes one gets me the link the other the title

The issue is I need to be able to get it at once to store it my db

so im hoping the loop runs and gives me one variable that has the title the other the url

I suspect this has to be one of the most common things people do, but I tried hard to find a ahk solution but couldn't.

Thanks for your time

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus

AHKStudent
Posts: 213
Joined: 05 May 2018, 12:23

Re: Help using regex to get the title and link together code included

20 May 2018, 14:57

So far this is what I came up with, I first check how many articles them loop based on that. I won't be using this anyway as I must use actual browser and page down or else only some of the articles show. I am sharing in case someone will have a use for it or if someone wants to show another way to do this.

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus

User avatar
TLM
Posts: 1289
Joined: 01 Oct 2013, 07:52

Re: Help using regex to get the title and link together code included

20 May 2018, 15:24

I'm against parsing/searching full markup with RegEx as it can be very unpredictable.
Using the DOM object is a much more reliable approach:

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus

You should get this:
Spoiler

You can then easily fill an associative array with title and link keys for use elsewhere ( i'll let you do that part tho ;) )
Also remember if you need to call the http request and html file objects more than once, you can wrap them into a function.
exams.. :headwall:
swagfag
Posts: 740
Joined: 11 Jan 2017, 17:59

Re: Help using regex to get the title and link together code included

20 May 2018, 15:39

u can match it with one regex

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus

AHKStudent
Posts: 213
Joined: 05 May 2018, 12:23

Re: Help using regex to get the title and link together code included

20 May 2018, 15:52

TLM wrote:I'm against parsing/searching full markup with RegEx as it can be very unpredictable.
Using the DOM object is a much more reliable approach:

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus

You should get this:
Spoiler

You can then easily fill an associative array with title and link keys for use elsewhere ( i'll let you do that part tho ;) )
Also remember if you need to call the http request and html file objects more than once, you can wrap them into a function.


yes, I did get that result, thank you

I did another test on a different site, I suspect the reason it failed is because that site does not have http:// in their links?

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus

User avatar
TLM
Posts: 1289
Joined: 01 Oct 2013, 07:52

Re: Help using regex to get the title and link together code included

20 May 2018, 16:40

AHKStudent wrote:I did another test on a different site, I suspect the reason it failed is because that site does not have http:// in their links?

Every site's markup is going to most likely be written differently. You'll have to inspect the markup 1st on a per site basis ( or even better, use their API ).

oh I forgot to mention that the added benefit of using a http request is that it returns the html directly to the HttpRequestObject.ResponseText property.
You don't need to 1st download the site to a file.

edit:
In the case of zerohedge, I would 1st add a h2 object just like the anchor object I previously used aTagObj := htmObj.getElementsByTagName( "a" )
Loop through the h2TagObj object looking for the class teaser-title and making sure it has the parent article element, then grab the data.
Image
I know it might seem "harder" but the problem with the RegEx approach is anomalous data can be injected into page in many ways throwing it off.
Regardless, the regular expression is going to be different per side anyway.. Just my 2cents
exams.. :headwall:
AHKStudent
Posts: 213
Joined: 05 May 2018, 12:23

Re: Help using regex to get the title and link together code included

20 May 2018, 17:36

swagfag wrote:u can match it with one regex

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus



that did work! it did bring back all the articles that show up if you load the page without scrolling down, thats fine.

A mystery I encountered is when I send the data to the database it only sent 7 items, when I ran the test as you posted I got 16

You can review my code here

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus

swagfag
Posts: 740
Joined: 11 Jan 2017, 17:59

Re: Help using regex to get the title and link together code included

20 May 2018, 18:13

tl;dr apostrophies messing the SQL statement up
use EscapeStr() or escape them yourself manually

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus

AHKStudent
Posts: 213
Joined: 05 May 2018, 12:23

Re: Help using regex to get the title and link together code included

20 May 2018, 18:35

TLM wrote:
AHKStudent wrote:I did another test on a different site, I suspect the reason it failed is because that site does not have http:// in their links?

Every site's markup is going to most likely be written differently. You'll have to inspect the markup 1st on a per site basis ( or even better, use their API ).

oh I forgot to mention that the added benefit of using a http request is that it returns the html directly to the HttpRequestObject.ResponseText property.
You don't need to 1st download the site to a file.

edit:
In the case of zerohedge, I would 1st add a h2 object just like the anchor object I previously used aTagObj := htmObj.getElementsByTagName( "a" )
Loop through the h2TagObj object looking for the class teaser-title and making sure it has the parent article element, then grab the data.
Image
I know it might seem "harder" but the problem with the RegEx approach is anomalous data can be injected into page in many ways throwing it off.
Regardless, the regular expression is going to be different per side anyway.. Just my 2cents


yeah, I did have regex last week inject a wrong url. glad I caught it. I have lots to learn, trying ur method for zerohedge so far no luck but i will continue trying
AHKStudent
Posts: 213
Joined: 05 May 2018, 12:23

Re: Help using regex to get the title and link together code included

20 May 2018, 19:27

swagfag wrote:tl;dr apostrophies messing the SQL statement up
use EscapeStr() or escape them yourself manually

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus


I used what you showed me here to escape and it worked! https://autohotkey.com/boards/viewtopic ... 77#p218377
thank you

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus

User avatar
TLM
Posts: 1289
Joined: 01 Oct 2013, 07:52

Re: Help using regex to get the title and link together code included

20 May 2018, 20:29

AHKStudent wrote:yeah, I did have regex last week inject a wrong url. glad I caught it. I have lots to learn, trying ur method for zerohedge so far no luck but i will continue trying
A simple http request will grab any page(s), so at least you don't have to download them.

Code: [Select all] [Expand] [Download] GeSHi © Codebox Plus

This way you can use any method you like to grab the data you want ;)
( note: you may need to declare the reqObj 1st sorry I'm on my cell right now and can't test it lol )
exams.. :headwall:

Return to “Ask For Help”

Who is online

Users browsing this forum: Google [Bot], imustbeamoron, nilsso, tyyi and 16 guests