Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

How to remove duplicated lines with Regex ?


  • Please log in to reply
35 replies to this topic
Klark92
  • Members
  • 870 posts
  • Last active: Dec 29 2015 09:47 PM
  • Joined: 19 Feb 2012

Anybody has an example about that ? (I have large sized file )


I CAN PROTECT YOUR SCRIPT (ANTI-DECOMPILER by Klark92) (AHK_L*)(PM)
Klark92's Script2Exe Wizard
AHK_L / AHK COMPILED EXE / BIN ICON CHANGER


Learning one
  • Members
  • 1483 posts
  • Last active: Jan 02 2016 02:30 PM
  • Joined: 04 Apr 2009

I think RegEx is not the right tool for this. Here's how I would do it;

String=
(
mango
banana
mango
apple
)

MsgBox %  RemoveDuplicates(String)


;=== Function ===
RemoveDuplicates(String, Delimiter="`n") {
	oUniques := []
	Loop, parse, String, % Delimiter
	{
		For k,v in oUniques
		{
			if (v = A_LoopField)		; duplicate
				continue 2
		}
		; unique
		NewString .= Delimiter A_LoopField
		oUniques.Insert(A_LoopField)
	}
	return LTrim(NewString, Delimiter)
}

My Website • Recommended: AutoHotkey Unicode 32-bit • Join DropBox, Copy


rbrtryn
  • Members
  • 1177 posts
  • Last active: Sep 11 2013 08:04 PM
  • Joined: 22 Jun 2011

A shorter way:

 

String=
(
mango
banana
mango
apple
)
MsgBox %  RemoveDuplicates(String)

;=== Function ===
RemoveDuplicates(String, Delimiter="`n")
{
Loop, parse, String, % Delimiter
  if not InStr(out, A_LoopField) {
   out .= out ? "`n" : ""
   out .= A_LoopField
  }
return out
}
 

My Scripts are written for the latest released version of AutoHotkey.

Need a secure, accessible place to backup your stuff? Use Dropbox!


sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008
If the order's not important, just use the Sort command:
list=
(
here's a line of stuff
here's another line
and yet another line
and yet another line
here's a line of stuff
here's another line
)
Sort, list, U
MsgBox %    list


Learning one
  • Members
  • 1483 posts
  • Last active: Jan 02 2016 02:30 PM
  • Joined: 04 Apr 2009

rbrtryn, it is shorter, but does it work properly? Test this;

String=
(
mango
banana
mango
pineapple
apple
)

Oh, what happened to apple? wink.png


My Website • Recommended: AutoHotkey Unicode 32-bit • Join DropBox, Copy


rbrtryn
  • Members
  • 1177 posts
  • Last active: Sep 11 2013 08:04 PM
  • Joined: 22 Jun 2011

Oh, what happened to apple? wink.png

 
Good point, I forgot about substrings shocked.png
Fortunately, the fix is easy: grin.png
 
String=
(
mango
banana
mango
pineapple
apple
)

MsgBox %  RemoveDuplicates(String)

;=== Function ===
RemoveDuplicates(String, Delimiter="`n")
{
Loop, parse, String, % Delimiter
  if not RegExMatch(out, "\b" A_LoopField "\b") {
     out .= out ? "`n" : ""
     out .= A_LoopField
  }
return out
}
 
@sinkfaze: You're right, Sort is probably better in this case.

My Scripts are written for the latest released version of AutoHotkey.

Need a secure, accessible place to backup your stuff? Use Dropbox!


sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008

@sinkfaze: You're right, Sort is probably better in this case.

 
The fact that the Sort commands sorts the contents in addition to removing the duplicates is the potentially troublesome part. Probably a potential feature request in there somewhere. wink.png 

@Learning one, if you're going to use an object, wouldn't this be just as simple to do?

RemoveDuplicates(String, Delimiter="`n") {
	oUniques := []
	Loop, parse, String, % Delimiter
		if	!oUniques[A_LoopField]
			oUniques[A_LoopField] :=	1, NewString .=	Delimiter A_LoopField	return	LTrim(NewString, Delimiter)
}


Jackie Sztuk _Blackholyman
  • Spam Officer
  • 3757 posts
  • Last active: Apr 03 2016 08:47 PM
  • Joined: 28 Feb 2012
here's something from 2007 by polyethene seems to work to...
String = ; Join with same line ending as if read from a text file
(Join`r`n
mango
banana
mango
pineapple
apple
)



Loop, Parse, String, `n

	If not InStr(list, new := RegExReplace(A_LoopField, "[\d\-:. ]+(.*?)[/ ].*", "$1") . "`n")

		Str .= A_LoopField . "`n", list .= new



MsgBox, %Str%

Helping%20you%20learn%20autohotkey.jpg?d

[AHK] Version. 1.1+ [CLOUD] DropBox ; Copy [WEBSITE] Blog ; About

strobo
  • Members
  • 359 posts
  • Last active: Mar 10 2015 08:13 PM
  • Joined: 19 Jun 2012
String=
(
mango
banana
mango
banana
pineapple
apple
bla bla
bla
bla bli
bla bla
bla bla
)

delim:=instr(String,"`r`n") ? "`r`n" : "`n"
ndl:=(delim="`n" ? "`n" : "")  "ms)^(?:(.*?)" delim ")(?=.*^\1(" delim "|$))"
msgbox,% regexreplace(String,ndl)

Regards,
Babba

sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008

Excellent piece of code, Babba!  But it seems to fall into the same area as using the Sort command where we can only hope that the order after sorting is unimportant.  Can that code be modified to preserve the original order in case that's important?



strobo
  • Members
  • 359 posts
  • Last active: Mar 10 2015 08:13 PM
  • Joined: 19 Jun 2012

Thanks, sinkfaze!

The Output follows the rule: "last man standing."  Consider a .log file where new events are appended, then this code filters the newest "events", so this order can be meaningfull.
Getting the oldest "events" seems to be hard to accomplish:
(1) look-behinds are not as powerfull as look-aheads
(2) when one uses in addition some replacetext:="$1", regexreplace starts right behind the last match (afaik), and this simply didn't work with my needles.
But, maybe there is some needle...
Of course, one can get the other behaviour with a LinewiseReverse function... I'm just kidding, here are obviously better (more efficient) methods.
 


Regards,
Babba

rbrtryn
  • Members
  • 1177 posts
  • Last active: Sep 11 2013 08:04 PM
  • Joined: 22 Jun 2011

How about this?

 

String=
(
mango
banana
mango
banana
pineapple
apple
bla bla
bla
bla bli
bla bla
bla bla
)

Result := ""
Loop Parse, String, `n, `r
{
   TestStr := A_LoopField
   Loop Parse, Result, `n, `r
      if (TestStr = A_LoopField)
         continue 2
   Result .= Result ? "`n" : ""
   Result .= A_LoopField
}

MsgBox % Result

My Scripts are written for the latest released version of AutoHotkey.

Need a secure, accessible place to backup your stuff? Use Dropbox!


Klark92
  • Members
  • 870 posts
  • Last active: Dec 29 2015 09:47 PM
  • Joined: 19 Feb 2012

woooaw there are many solutions .. but Regex is my favourite ..thanks all .. :)


I CAN PROTECT YOUR SCRIPT (ANTI-DECOMPILER by Klark92) (AHK_L*)(PM)
Klark92's Script2Exe Wizard
AHK_L / AHK COMPILED EXE / BIN ICON CHANGER


sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008

The Output follows the rule: "last man standing."

 

Then in theory reversing the order of the items should result in a raw list of items in their original order once your replace is run, no?

String=
(
mango
banana
mango
banana
pineapple
apple
bla bla
bla
bla bli
bla bla
bla bla
)
Loop, parse, String, `r, `n
	reverse :=	A_LoopField (!reverse ? "" : "`n") reverse
MsgBox %	regexreplace(reverse,"`ams)^(?:(.*?)\v+)(?=.*^\1(\v+|$))")


rbrtryn
  • Members
  • 1177 posts
  • Last active: Sep 11 2013 08:04 PM
  • Joined: 22 Jun 2011

Then in theory reversing the order of the items should result in a raw list of items in their original order once your replace is run, no?

Loop, parse, String, `r, `n
	reverse :=	A_LoopField (!reverse ? "" : "`n") reverse
MsgBox %	regexreplace(reverse,"`ams)^(?:(.*?)\v+)(?=.*^\1(\v+|$))")

 

Gives this result:


mango
banana
pineapple
apple
bla
bla bli
bla bla
 

Why is bla bli before bla bla?

 

This script here not only gives the correct result, it runs about twice as fast.


RegEx: 0.000183 ms
Loop:    0.000095 ms

 

Spoiler

My Scripts are written for the latest released version of AutoHotkey.

Need a secure, accessible place to backup your stuff? Use Dropbox!