Remove duplicates from text file that matches a given string

Question

Remove duplicates from text file that matches a given string

AD 1

Hi,
I'm really hoping to use Powershell to save time removing some duplicates from some files, the below I found on the old Microsoft site is exactly what I want to do and using the example text there that answer does work but when I modify the text that it should match the line with it doesn't work and just removes both instances from the file.

https://social.technet.microsoft.com/Forums/lync/en-US/13c69268-da2f-4ecb-bfd4-f98c9f5170ab/remove-duplicates-from-text-file-that-matches-a-given-string?forum=winserverpowershell

I've created the below example just to demonstrate the issue. What I need to do is create a new file with all of this data in it but only one instance of " <credits>A Human</credits>", as you can see below this line is shown twice, everything else should stay exactly as it is.

<actor>
<name>Actor</name>
</actor>
<actor>
<name>Person</name>
</actor>
<actor>
<name>A Human</name>
</actor>
<credits>Person</credits>
<credits>A Human</credits>
<credits>A Human</credits>
<credits>Another Human</credits>
<credits>Another totally different human</credits>

Ideally the location of each instance shouldn't matter and duplicates from below would also be spotted (the example from that Microsoft site mentioned both duplicate lines were consecutive but that may not be the case), I have no way of knowing how many characters may appear between <credits> and </credits>, that will vary but it would always start <credits> and end </credits>.

<credits>Person</credits>
<credits>A Human</credits>
<credits>Another Human</credits>
<credits>Another totally different human</credits>
<credits>A Human</credits>

If I use <credits> as the match string I was expecting it to find all lines containing <credits> and put just the unique entries in the new file like below:

<credits>Person</credits>
<credits>A Human</credits>
<credits>Another Human</credits>
<credits>Another totally different human</credits>

Instead the two duplicates containing "A Human" are removed completely from the credits section and the credits section of the new file shows:

<credits>Person</credits>
<credits>Another Human</credits>
<credits>Another totally different human</credits>

Can anyone suggest how I can correct the below so it works to achieve what I'm looking to do?

$lines = get-content "c:\test\2.txt"
$count = 1;
$matchString = "<credits>";
foreach($line in $lines)
{
if($line.ToString().IndexOf($matchString) -gt 0)
{
if($count -eq 1)
{
$line | out-string | add-content "c:\test\new.txt";
$count = $count + 1;
}
}
else
{
$line | out-string | add-content "c:\test\new.txt";
}

}

Thanks

Rich Matheisen 47,901 Reputation points

2022-05-04T02:00:12.947+00:00
The format of the data you supplied looks very much like part of a larger XML file. Is the file a valid XML file?

How is the code going to handle data that looks like this? It's perfectly valid format that represents the same data as "<credits>A Human</credits>":

<credits> A Human </credits>

How would your code deal with duplicates of the same credits when they appear in some other actor's data in the same file?
Yitzhak Khabinsky 26,586 Reputation points

2022-05-04T12:29:57.4+00:00

@AD ,

If your file in question is a well-formed XML with a root tag, it is better to use XSLT transformation for your task.
AD 1 Reputation point

2022-05-04T13:16:58.743+00:00

Hi, this is just a basic text file, not XML despite what the example text I listed looks like.

Yes would always be a single line starting with <credits> and ending with </credits> like:
<credits>Totally new example</credits>

Rather than:
<credits>
Totally new example
</credits>

If it makes things easier to look for lines starting with <credits> and ending with </credits> to deduplicate rather than just containing some matching text like <credits> then that would be OK.

There should be no other matches within any other actor's data as none of those lines would contain the text "<credits>", remember absolutely everything in the file should be left untouched unless the line contains a specific section of text (in this case "<credits>", or starting with <credits> and ending with </credits> if that's easier to work with)
AD 1 Reputation point

2022-05-04T13:17:42.857+00:00

Hi, this is just a basic text file, not XML despite what the example text I listed looks like. thanks

2 answers

Your answer

Rich Matheisen 47,901 Reputation points

2022-05-04T02:00:12.947+00:00

The format of the data you supplied looks very much like part of a larger XML file. Is the file a valid XML file?

How is the code going to handle data that looks like this? It's perfectly valid format that represents the same data as "<credits>A Human</credits>":

<credits> A Human </credits>

How would your code deal with duplicates of the same credits when they appear in some other actor's data in the same file?
Yitzhak Khabinsky 26,586 Reputation points

2022-05-04T12:29:57.4+00:00

@AD ,

If your file in question is a well-formed XML with a root tag, it is better to use XSLT transformation for your task.
AD 1 Reputation point

2022-05-04T13:16:58.743+00:00

Hi, this is just a basic text file, not XML despite what the example text I listed looks like.

Yes would always be a single line starting with <credits> and ending with </credits> like:
<credits>Totally new example</credits>

Rather than:
<credits>
Totally new example
</credits>

If it makes things easier to look for lines starting with <credits> and ending with </credits> to deduplicate rather than just containing some matching text like <credits> then that would be OK.

There should be no other matches within any other actor's data as none of those lines would contain the text "<credits>", remember absolutely everything in the file should be left untouched unless the line contains a specific section of text (in this case "<credits>", or starting with <credits> and ending with </credits> if that's easier to work with)
AD 1 Reputation point

2022-05-04T13:17:42.857+00:00

Hi, this is just a basic text file, not XML despite what the example text I listed looks like. thanks

Answer 1

Andreas Baumgarten 123.6K MVP Volunteer Moderator

Hi @AD ,

maybe this helps:

Content of File1.txt:

<credits>Person</credits>  
<credits>A Human</credits>  
<credits>Another Human</credits>  
<credits>Another Human</credits>  
<credits>Another totally different human</credits>  
<credits>A Human</credits>

Script:

$in = Get-Content -Path .\Junk\File1.txt  
[array]$out = ""  
foreach ($line in $in) {  
  Write-Output "Processing $line"  
  if ($out -notcontains $line) {  
    $out = $out + $line  
  }  
}  
$out

The output will be like this:

<credits>Person</credits>  
<credits>A Human</credits>  
<credits>Another Human</credits>  
<credits>Another totally different human</credits>

----------

(If the reply was helpful please don't forget to upvote and/or accept as answer, thank you)

Regards
Andreas Baumgarten

AD 1 Reputation point

2022-05-04T12:54:49.407+00:00

Hi,
Thanks, that output does indeed only show one instance of those <credit> section entries however when run on the full file it just creates a much bigger problem by removing duplicates of the <actor> and </actor> lines, I need to only check for duplicates on lines containing the text "<credits>", "</credits>" (or both if that helps, the potentially duplicated lines will always start with <credits> and end in </credits>), the rest of the file should be left completely untouched which is what that MSDN link I listed achieved, sadly it doesn't seem to work for me when I change the match text to "<credits>" or I'd have already solved this.

Output from your script on above would be:

<actor>
<name>Actor</name>
</actor>
<name>Person</name>
<name>A Human</name>
<credits>Person</credits>
<credits>A Human</credits>
<credits>Another Human</credits>
<credits>Another totally different human</credits>

Thanks
Andreas Baumgarten 123.6K Reputation points MVP Volunteer Moderator

2022-05-04T19:29:15.227+00:00
Hi @AD ,

just to understand:
The format of all text files are the same?
Always start with this section:

<actor> <name>Actor</name> </actor> <name>Person</name> <name>A Human</name>

The credits section is always at the end and there are no other tags than <credits>*something*</credits> in between?

----------

(If the reply was helpful please don't forget to upvote and/or accept as answer, thank you)

Regards
Andreas Baumgarten

Answer 2

Rich Matheisen 47,901

If the assumption is that the "<credits>.....<\credits>" always appear on one line, and that the value between the two tags doesn't vary in format (e.g., this won't happen "<credits>MyCredit<\credits>" and <credits>[space]MyCredit<\credits>"), then this will quickly eliminate the duplicates:

$credits = @{ }
Get-Content "c:\test\2.txt" |
    ForEach-Object {
        if ($_ -notmatch '^<credits>[^<]+?</credits>') {
            $_
        }
        else{
            if (-NOT $credits.ContainsKey($_)){
                $credits[$_] = ""
                $_
            }
        }
    } | Out-File "c:\test\new.txt";

AD 1 Reputation point

2022-05-04T13:09:37.793+00:00

Hi, this looks to have all the data from that one file added into the script to process, I would instead need to read it from a file (as the data would change every time) and then outout it to a second file that's identical to the first file except those lines that start <credits> and end </credits> would not have any duplicates. I have no way of knowing what would be between <credits> and </credits> and format would likely differ although it should always start with a letter of the alphabet after <credits> so can't see any scenario where something like "<credits> Human" would appear with a space at the front, ther may be 1,2,3,4... spaces in total though between <credits> and </credits>, no way to know. For instance as further examples I'd need it to handle

<credits>H u m a n</credits>
<credits>Never seen this example before</credits>

Yes would always be a single line starting with <credits> and ending with </credits> like:
<credits>Totally new example</credits>

Rather than:
<credits>
Totally new example
</credits>

Hopefully that explains things better
Rich Matheisen 47,901 Reputation points

2022-05-04T15:16:59.393+00:00

The data's in the code because I didn't want to create a file to hold it. I'm guessing you don't know PowerShell very well or you'd be able to make the minor changes to the code to substitute "Get-Content" on line 19 for the split operation.

In any case, I updated the code sample I submitted earlier. It now works with a file and eliminates the need for the last few lines of code that produced the "de-duped" data.
AD 1 Reputation point

2022-05-04T16:06:42.757+00:00

Sorry I'm still at work so only had time for a very brief check of what was posted, I'm certainly not a powershell expert but trying to learn. I tried to run what you entered last time and powershell just threw an error, didn't have time to check why at the time and didn't try and adjust to point at a file since it wasn't working, was hoping to come back to it later tonight after work if I can get some free time. Had a brief check with this new one and no errors this time but sadly the outut is exactly the same as the input, did not remove any duplicates, eg using a test file that just contains the below gives an output file containing exactly the same (<credits>A Human</credits> line still appears twice).

<credits>NonHuman</credits>
<credits>A Human</credits>
<credits>A Human</credits>

Thanks
Rich Matheisen 47,901 Reputation points

2022-05-04T19:18:54.467+00:00

Here are the input and output I used with the code that uses files:
198972-2.txt
198918-new.txt

There are NO duplicate "<credits>" tags in the output.

Share via

Remove duplicates from text file that matches a given string

2 answers

Your answer