Powershell: Select and Delete everything that is framed in the comments html tags with some exceptions

Suzana Eree 811 Reputation points
2021-05-13T07:54:27.263+00:00

I have this html code, that is which is placed in multiple pages on location c:\Folder2

And I want to select/delete everything that falls between the comments <!-- ARTICOL START --> and <!-- ARTICOL FINAL --> except all those <p class=..</p> lines. Can this be done with Powershell?

  <!-- ARTICOL START -->

<div align="justify">
        <table width="682" border="0">
          <tr>
            <td><h1 class="den_articol" itemprop="sfe">My text here</h1></td>
          </tr>
          <tr>
            <td class="text_dreapta">On Ianuarie 14, 2014, in <a href="https://neculaifantanaru.com/en/qualities-of-a-leader.html" title="See al articles from  Qualities of a leader" class="external" rel="category tag">Qualities of a leader</a>, by Author</td>
          </tr>
        </table>
      <h2 class="text_obisnuit2"><img src="index_files/sfa.jpg" width="718" height="605" id="sfs" usemap="#m_dgrnt" alt="hip" /><map name="tfAbonament" id="m_34">
<area shape="rect" coords="259,545,457,582" href="#plata" alt="" />
</map></h2>
        <p class="den_articol">Why this text text?</p>
<p class="text_obisnuit">test text text</p>
        <p class="text_obisnuit">test text text</p>
  <p class="text_obisnuit2">test text text</p>
    </div>
    <p align="justify" class="text_obisnuit style3">&nbsp;</p>

       <!-- ARTICOL FINAL -->

The output should be:

       <!-- ARTICOL START -->

        <p class="den_articol">Why this text text?</p>
<p class="text_obisnuit">test text text</p>
        <p class="text_obisnuit">test text text</p>
  <p class="text_obisnuit2">test text text</p>

       <!-- ARTICOL FINAL -->
Not Monitored
Not Monitored
Tag not monitored by Microsoft.
37,797 questions
Windows Server PowerShell
Windows Server PowerShell
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.PowerShell: A family of Microsoft task automation and configuration management frameworks consisting of a command-line shell and associated scripting language.
5,462 questions
0 comments No comments
{count} votes

Accepted answer
  1. Ian Xue (Shanghai Wicresoft Co., Ltd.) 34,271 Reputation points Microsoft Vendor
    2021-05-13T11:37:41.53+00:00

    Hi,

    Please see if this works.

    $sourcedir = "C:\Folder2\"  
    $resultsdir = "C:\Output\"  
    Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object{  
        $output=@()  
        $content = Get-Content -Path $_.FullName  
        $start = $content | Where-Object {$_ -match '<!-- ARTICOL START -->'}   
        $final = $content | Where-Object {$_ -match '<!-- ARTICOL FINAL -->'}   
        for($i=0;$i -lt $content.Count;$i++){  
            if(($i -gt $content.IndexOf($start)) -and ($i -lt $content.IndexOf($final))){  
                if($content[$i] -notmatch '<p class='){  
                    continue  
                }  
            }  
            $output += $content[$i]  
        }  
        $output | Out-File -FilePath $resultsdir\$($_.name)  
    }  
    

    Best Regards,
    Ian Xue

    ============================================

    If the Answer is helpful, please click "Accept Answer" and upvote it.
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


1 additional answer

Sort by: Most helpful
  1. Ian Xue (Shanghai Wicresoft Co., Ltd.) 34,271 Reputation points Microsoft Vendor
    2021-05-13T09:26:28.33+00:00

    Hi,

    You can try something like below.

    $file = 'c:\Folder2\file.html'  
    $output=@()  
    $content = Get-Content -Path $file  
    $start = $content | Where-Object {$_ -match '<!-- ARTICOL START -->'}   
    $final = $content | Where-Object {$_ -match '<!-- ARTICOL FINAL -->'}   
    for($i=0;$i -lt $content.Count;$i++){  
        if(($i -gt $content.IndexOf($start)) -and ($i -lt $content.IndexOf($final))){  
            if($content[$i] -notmatch '<p class='){  
                continue  
            }  
        }  
        $output += $content[$i]  
    }  
    $output  
    

    Best Regards,
    Ian Xue

    ============================================

    If the Answer is helpful, please click "Accept Answer" and upvote it.
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.