Remove duplicates from text file that matches a given string

AD 1 Reputation point
2022-05-03T19:45:08.69+00:00

Hi,
I'm really hoping to use Powershell to save time removing some duplicates from some files, the below I found on the old Microsoft site is exactly what I want to do and using the example text there that answer does work but when I modify the text that it should match the line with it doesn't work and just removes both instances from the file.

https://social.technet.microsoft.com/Forums/lync/en-US/13c69268-da2f-4ecb-bfd4-f98c9f5170ab/remove-duplicates-from-text-file-that-matches-a-given-string?forum=winserverpowershell

I've created the below example just to demonstrate the issue. What I need to do is create a new file with all of this data in it but only one instance of " <credits>A Human</credits>", as you can see below this line is shown twice, everything else should stay exactly as it is.

<actor>
<name>Actor</name>
</actor>
<actor>
<name>Person</name>
</actor>
<actor>
<name>A Human</name>
</actor>
<credits>Person</credits>
<credits>A Human</credits>
<credits>A Human</credits>
<credits>Another Human</credits>
<credits>Another totally different human</credits>

Ideally the location of each instance shouldn't matter and duplicates from below would also be spotted (the example from that Microsoft site mentioned both duplicate lines were consecutive but that may not be the case), I have no way of knowing how many characters may appear between <credits> and </credits>, that will vary but it would always start <credits> and end </credits>.

<credits>Person</credits>
<credits>A Human</credits>
<credits>Another Human</credits>
<credits>Another totally different human</credits>
<credits>A Human</credits>

If I use <credits> as the match string I was expecting it to find all lines containing <credits> and put just the unique entries in the new file like below:

<credits>Person</credits>
<credits>A Human</credits>
<credits>Another Human</credits>
<credits>Another totally different human</credits>

Instead the two duplicates containing "A Human" are removed completely from the credits section and the credits section of the new file shows:

<credits>Person</credits>
<credits>Another Human</credits>
<credits>Another totally different human</credits>

Can anyone suggest how I can correct the below so it works to achieve what I'm looking to do?

$lines = get-content "c:\test\2.txt"
$count = 1;
$matchString = "<credits>";
foreach($line in $lines)
{
if($line.ToString().IndexOf($matchString) -gt 0)
{
if($count -eq 1)
{
$line | out-string | add-content "c:\test\new.txt";
$count = $count + 1;
}
}
else
{
$line | out-string | add-content "c:\test\new.txt";
}

}

Thanks

Windows for business Windows Server User experience PowerShell
{count} votes

2 answers

Sort by: Most helpful
  1. Andreas Baumgarten 123.4K Reputation points MVP Volunteer Moderator
    2022-05-03T21:49:58.563+00:00

    Hi @AD ,

    maybe this helps:

    Content of File1.txt:

    <credits>Person</credits>  
    <credits>A Human</credits>  
    <credits>Another Human</credits>  
    <credits>Another Human</credits>  
    <credits>Another totally different human</credits>  
    <credits>A Human</credits>  
    

    Script:

    $in = Get-Content -Path .\Junk\File1.txt  
    [array]$out = ""  
    foreach ($line in $in) {  
      Write-Output "Processing $line"  
      if ($out -notcontains $line) {  
        $out = $out + $line  
      }  
    }  
    $out  
    

    The output will be like this:

    <credits>Person</credits>  
    <credits>A Human</credits>  
    <credits>Another Human</credits>  
    <credits>Another totally different human</credits>  
    

    ----------

    (If the reply was helpful please don't forget to upvote and/or accept as answer, thank you)

    Regards
    Andreas Baumgarten


  2. Rich Matheisen 47,901 Reputation points
    2022-05-04T02:36:47.183+00:00

    If the assumption is that the "<credits>.....<\credits>" always appear on one line, and that the value between the two tags doesn't vary in format (e.g., this won't happen "<credits>MyCredit<\credits>" and <credits>[space]MyCredit<\credits>"), then this will quickly eliminate the duplicates:

    $credits = @{ }
    Get-Content "c:\test\2.txt" |
        ForEach-Object {
            if ($_ -notmatch '^<credits>[^<]+?</credits>') {
                $_
            }
            else{
                if (-NOT $credits.ContainsKey($_)){
                    $credits[$_] = ""
                    $_
                }
            }
        } | Out-File "c:\test\new.txt";
    

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.