Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Put here requests of problems with regular expressions


  • Please log in to reply
1074 replies to this topic
wooho
  • Members
  • 45 posts
  • Last active: Aug 10 2015 05:59 AM
  • Joined: 27 Dec 2013

ok how can i create space \s after specific character for example 45dog, and  i want to create space after 45 to be like this 45  dog? i mean how to create space after digits



Alpha Bravo
  • Members
  • 1687 posts
  • Last active: Nov 07 2015 03:06 PM
  • Joined: 01 Sep 2011

to insert a space between a digit immediately followed by a letter

h := "45dog 45 cat 60% 2.1 20, 40 1/2"
MsgBox % RegExReplace(H, "i)(?<=\d)(?=[a-z])", " ")


Joe Glines
  • Members
  • 118 posts
  • Last active: Jan 24 2016 03:08 PM
  • Joined: 23 Dec 2009

I'm trying to make a cheat-sheet for extracting info on drop-downs from a website.  In the below example I'm close but can't seem to lose the ending line  "></select>. " (I understand why it is there, I just can't grasp how to exclude it within this RegexReplace.

 

 

OuterHTML=
(
<select name="st" id="st"><option selected="selected" value="0">All results</option><option value="1">1 day</option><option value="7">7 days</option><option value="14">2 weeks</option><option value="30">1 month</option><option value="90">3 months</option><option value="180">6 months</option><option value="365">1 year</option></select>
)


MsgBox % Result := RegExReplace(OuterHTML,".*?option\s+(?:selected=""selected""\s+)?value=""(.*?)"">(.*?)</option(.*?)", "$1`t$2`n") 

Automating the mundane 1 script at a time...
https://www.linkedin.com/in/joeglines
The-Automator

kon
  • Members
  • 1652 posts
  • Last active:
  • Joined: 04 Mar 2013

MsgBox % Result := RegExReplace(OuterHTML,"U).*<option.*value=""(.+)"">(.+)</option>((?=<option)|.*$)", "$1`t$2`n")



Joe Glines
  • Members
  • 118 posts
  • Last active: Jan 24 2016 03:08 PM
  • Joined: 23 Dec 2009

Thank you Kon!   I definitely never would have got that!    What does the additional pattern at the end do? 

((?=<option)|.*$)

 

I believe part of it is saying find <option or the end of the line which is the dollar sign.  And I was just reading about the ?= which is a "Positive look ahead".  In the documention it says mentions two things.  1) Positive look aheads do not consume any characters (I get this part)  2) It requires the entire pattern match or it fails. 

 

This second part is throwing me as I would have thought it would have cancelled out the entire match but does it only cancel out the section within the parens where you used it?  Thank you so much for your time!   I've run into the above issue before and I usually do some stupid manipulation of my result thus I'd love to understand how your pattern match works!

Regards,Joe


Automating the mundane 1 script at a time...
https://www.linkedin.com/in/joeglines
The-Automator

Alpha Bravo
  • Members
  • 1687 posts
  • Last active: Nov 07 2015 03:06 PM
  • Joined: 01 Sep 2011
this pattern : U).*<option.*value=""(.+)"">(.+)</option>((?=<option)|.*$)
is interpreted as follows:
 
keep matching this pattern:
U).*<option.*value=""(.+)"">(.+)</option>(?=<option) ; look-ahead for "<option" but don't consume it
if failed to match the above, then use this pattern instead:
U).*<option.*value=""(.+)"">(.+)</option>.*$ ; consume everything to the end


Joe Glines
  • Members
  • 118 posts
  • Last active: Jan 24 2016 03:08 PM
  • Joined: 23 Dec 2009

Alpha Bravo- thank you for clarifying.  I'm scratching my head a little but I believe your comments will enable the fog to lift as I study it in time.   I haven't used look-ahead functionality (nor have I built Regexes with this kind of logic (look for this but if it isn't there, take that).  I'm looking forward to taking another step forward.   :)

 

I made some tweaks to make it a bit more flexible depending upon the form being ripped.  Mainly

1) making it case insensitive

2) named subpatterns

3) adding some allows for whitespace

4) changing the quotes to be optional

BTW- I removed the "U" from the RegExReplace as I couldn't figure out a way to keep the whole thing "ungreedy" but make the quotes optional.  Was there a way to have made the quotes optional w/o removing the U ?  

OuterHTML:= RegExReplace(OuterHTML,"i).*?<option.*?value=""?(?P<Key>.+?)""?>(?P<Value>.+?)</option>(\s+)?((?=<option)|.*$)", "${Key}`t${Value}`n") ;named

Automating the mundane 1 script at a time...
https://www.linkedin.com/in/joeglines
The-Automator

TLM
  • Administrators
  • 3864 posts
  • Last active:
  • Joined: 21 Aug 2006
@Joetazz, you can also simply use an html file object
OuterHTML=
(
<select name="st" id="st"><option selected="selected" value="0">All results</option><option value="1">1 day</option><option value="7">7 days</option><option value="14">2 weeks</option><option value="30">1 month</option><option value="90">3 months</option><option value="180">6 months</option><option value="365">1 year</option></select>
)

docObj := ComObjCreate("HTMLfile"), docObj.write(OuterHTML)

Loop % ( optionObj:=docObj.getElementsByTagName( "option" ) ).length
    Result .= optionObj[ a_index-1 ].value . a_tab . optionObj[ a_index-1 ].innerText "`n" 

msgbox % Result
You can also create an option object like this: optionObj:=docObj.getElementById("st").options
Just a thought...

Posted Image

don't duplicate, iterate!


Joe Glines
  • Members
  • 118 posts
  • Last active: Jan 24 2016 03:08 PM
  • Joined: 23 Dec 2009

You just blew-up my brain!   I had a thought about this a while back but discounted it because I I didn't think I'd be able to access them from the DOM!  Thank  you for proving me wrong!   This is a much sexier aproach in the long-run!


Automating the mundane 1 script at a time...
https://www.linkedin.com/in/joeglines
The-Automator

bbint
  • Members
  • 43 posts
  • Last active: Nov 23 2015 04:21 AM
  • Joined: 17 Sep 2008

**Genetic algorithm regular expression generator**

Look up the new and free regex generator that was released several weeks ago from Machine Learning Lab (http://regex.inginf.units.it/).

http://www.reddit.co...rating_regular/

It's based on genetic algorithms.

 

E.g. from regular-expressions/info:

Find all IP addresses:

\b\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\b

Captures matches such as 999.999.999.999.

Many times, you have to come up with the pattern yourself.

With the new generator, you submit a string, highlight what you want to match (in this case, highlight several IP addresses), wait for the program to run, and it generates a regular expression pattern for you.
It takes some time, as it has try many different combinations to meet your goal.
It learns and optimizes every time.