Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

[SOLVED] Remove both duplicates in a text file



  • Please log in to reply
18 replies to this topic
mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008
Hi all, I'm looking for suggestions on how to write a script that will remove both duplicate entries in a text file.

So if this is the content of the text file:
1
1
2
2
3
4
4
5

The result would be:
3
5

I found a great post here which shows how to remove duplicates, but I can't figure out how to remove both duplicates. Any ideas or suggestions would be appreciated.

- Mike

TheDewd
  • Members
  • 842 posts
  • Last active: Jun 10 2016 06:55 PM
  • Joined: 28 Mar 2010
Tested and worked for me: <!-- m -->http://www.autohotke... ... 8365#98365<!-- m -->

mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008

Tested and worked for me: <!-- m -->http://www.autohotke... ... 8365#98365<!-- m -->

Thank you, westoncampbell, but that is just doing a standard de-duping, and results in this:
1
2
3
4
5

I would like to remove both sets of dupes, so if it sees the same line twice, it removes both lines. Do you have any suggestions on how to do that?

- Mike

jaco0646
  • Moderators
  • 3165 posts
  • Last active: Apr 01 2014 01:46 AM
  • Joined: 07 Oct 2006
✓  Best Answer
This assumes the list is sorted.
var =

(

1 

1 

2 

2 

3 

4 

4 

5 

)

MsgBox,% RegExReplace(var,"m`a)^(.+)\R(\1(\R|$))+")


mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008
This works, even on large files. Thank you, jaco0646!

- Mike

  • Guests
  • Last active:
  • Joined: --
I know it's solved, but I had this ready to post before, so I still have to post it...

...so if it sees the same line twice...

...in a row?...or any duplicate lines at all?

What about this input?...
1
2
2
1
3
4
4
5...should the result still be...
3
5...or...
1
1
3
5...the 1's were not next to each other in the input, so are they both not removed? Also, should it recursively remove duplicate lines?...so that now 1 & 1 are next to each other, another cycle would remove them?

mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008
Ideally they would not need to be next to each other, but a sort command may be run prior to the duplicate removal.

- Mike

mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008
The regex command has been working very well for me, however, I have found a couple lines that are not being removed. When those two lines shown below are in the same text files, they are not being removed. Any idea why?

all1.txt:
GRA 17226 - 1 E-Mail Design Fee         -         75 - Pixel code change…        -                  
YEL 14651 - 0 Creative - Text Change    -         75 - Pixel code Change…        -                  
all2.txt:
GRA 17226 - 1 E-Mail Design Fee         -         75 - Pixel code change…        -                  
YEL 14651 - 0 Creative - Text Change    -         75 - Pixel code Change…        -                  
ahk script:
;
; ---[ Merge old (all1.txt) and new (all2.txt) job lists into one file, all3.txt ]---
;
FileRead, Text, all1.txt
FileAppend, %Text%, all3.txt
FileRead, Text, all2.txt
FileAppend, %Text%, all3.txt

;
; ---[ Sort the merged job list ]---
;
FileRead, Text, all3.txt
Sort, Text
FileAppend, %Text%, all4.txt

;
; ---[ Remove all sets of dupes ]---
;
FileRead, var, all4.txt
var2 := RegExReplace(var,"m`a)^(.+)\R(\1(\R|$))+")
FileAppend, %var2%, all5.txt
all5.txt (the resulting file, which should NOT contain these four lines):
GRA 17226 - 1 E-Mail Design Fee         -         75 - Pixel code change…        -                  
GRA 17226 - 1 E-Mail Design Fee         -         75 - Pixel code change…        -                  
YEL 14651 - 0 Creative - Text Change    -         75 - Pixel code Change…        -                  
YEL 14651 - 0 Creative - Text Change    -         75 - Pixel code Change…        -                  


TheDewd
  • Members
  • 842 posts
  • Last active: Jun 10 2016 06:55 PM
  • Joined: 28 Mar 2010
There might be an invisible character in your text files not being detected by RegEx.

  • Guests
  • Last active:
  • Joined: --

There might be an invisible character in your text files not being detected by RegEx.

...perhaps not invisible...I think it might be the "…" ellipsis character. The ellipsis character is a single char containing an ellipsis (which is 3 dots).
ellipsis, using 3 dots/periods: "..."
ellipsis character: "…"...try adding an ellipsis character (copy/paste it) to one of the other lines (that is duplicated, but not a problem) & see if it becomes a "problem line", then remove the ellipsis character from all lines...or replace it with 3 dots...& see if it fixes the problem.

If that is the problem, then either the RegEx needs tweaked...or it's cuz of some other issue with weird chars (Unicode?).

mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008

There might be an invisible character in your text files not being detected by RegEx.

...perhaps not invisible...I think it might be the "…" ellipsis character. The ellipsis character is a single char containing an ellipsis (which is 3 dots).
ellipsis, using 3 dots/periods: "..."
ellipsis character: "…"...try adding an ellipsis character (copy/paste it) to one of the other lines (that is duplicated, but not a problem) & see if it becomes a "problem line", then remove the ellipsis character from all lines...or replace it with 3 dots...& see if it fixes the problem.

If that is the problem, then either the RegEx needs tweaked...or it's cuz of some other issue with weird chars (Unicode?).

Yes! That is the problem! I added the ellipsis character to the same line in both text files, and that line item started showing up in the final list too. Also, those two jobs were the only two which had the ellipsis character.

So I guess the next question is, how can I handle these characters? I do not know how to write RegEx strings.

- Mike

  • Guests
  • Last active:
  • Joined: --

So I guess the next question is, how can I handle these characters?

...well since I hate the "ellipsis character" (but love ellipses)...I would StringReplace all ellipsis chars into 3 dots.

I do not know how to write RegEx strings.

...I didn't write this RegEx, but I will test & see what I can come up with...there is noting obviously wrong with the RegEx...it should support any char. My guess is that the "ellipsis character" might be considered either high ASCII or even Unicode...& AutoHotkey or the RegEx library AutoHotkey uses can't handle it.

Are you testing in AutoHotkey 1.0.48.05 or some version of AHK_L? I would love to know if this work in AHK_L.

  • Guests
  • Last active:
  • Joined: --
Simple fix!!! - Change the script file's encoding to UTF-8.

In a quick test...this script...

var=
(LTrim
	1
	1
	2
	2
	testing1...
	testing1...
	3
	testing2…
	testing2…
	4
	4
	5
)

;//StringReplace, var, var, …, ..., a

var:=RegExReplace(var,"m`a)^(.+)\R(\1(\R|$))+")

msgbox, 64, , %var%
...will FAIL (leaves both "testing2" lines) when the file is saved with ANSI encoding...but will SUCCEED if the file encoding is changed to UTF-8.

In Notepad2, you can change the encoding by clicking...File -> Encoding -> UTF-8. I highly recommend Notepad2, it's much better than the default Windows Notepad & less intense than other Notepad replacements.

I also included that StringReplace (commented out), just in case you wanna rid yourself of ellipsis chars.

  • Guests
  • Last active:
  • Joined: --
I did a little more testing & in your case, instead of changing the encoding on the script file itself, you need to change the encoding on the 2 source files.

I also cleaned up your script, you were reading from & writing to files more than necessary...

1
2
testing1_unique...
testing1_dupe...
testing2_dupe…
4
5

1
2
testing1_dupe...
3
testing2_unique…
testing2_dupe…
4

file_old=old.txt
file_new=new.txt
file_output=unique.txt

;//
;// ---[ Read old (%file_old%) & new (%file_new%) job lists into vars ]---
;//
FileRead, old, %file_old%
FileRead, new, %file_new%

;//
;// ---[ Sort the merged job list ]---
;//
text:=old "`n" new
Sort, text

;//
;// ---[ Remove all sets of dupes ]---
;//
text:=RegExReplace(text, "m`a)^(.+)\R(\1(\R|$))+")
FileDelete, %file_output%
FileAppend, %text%, %file_output%
...& the expected output is...

3
5
testing1_unique...
testing2_unique…


mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008
Thank you very much for your help, I really appreciate it. The stringreplace solution appears to be an easy way to solve the problem, and without any noticeable changes in processing speed.

I'm using AHK 1.0.47.6 (standard version).

Thank you also for the tip on Notepad2. It does look much better than the standard notepad, and great for working with scripts.

The read/write improvements are great too. Do you have any suggestions for improving the "Mark each line as old or new" part of the code, shown below?
FileRead, old, all1.txt
FileRead, new, all2.txt

both := old "`n" new ; Merge old and new job lists
Sort, both ; Sort the merged job list by job number
StringReplace, both, both, …, ..., a
var2 := RegExReplace(both,"m`a)^(.+)\R(\1(\R|$))+")
FileAppend, %var2%, all5.txt

;
; ---[ Mark each line as old or new ]---
;
Loop, Read, all5.txt
	{
	all5 := A_LoopReadLine
	Loop, Read, all1.txt
		{
		If (all5 = A_LoopReadLine)
			{
			all5 := all5 " - Old"
			FileAppend, %all5%`n, all6.txt
			}
		}
	Loop, Read, all2.txt
		{
		If (all5 = A_LoopReadLine)
			{
			all5 := all5 " - New"
			FileAppend, %all5%`n, all6.txt
			}
		}
	}

- Mike