How to remove non printable characters in XML file using PowerShell

Syed Manzar Abbas Zaidi 20 Reputation points
2024-08-24T20:17:08.49+00:00

Hi all,

I have a XML file have Non Printable Characters that is causing job failure while loading into SQL table using SSIS!

Non Printable Characters: STX,SOH,GS,FF,ZWSP,DC3,SUB,ENQ,US,DC4,NAK,ACK,BEL,VTl,ESC,CAN

Ref: https://en.wikipedia.org/wiki/ASCII#Control_characters

Any help?

Thanks

PS: I can open the file in Notepad++ to remove using REGEX... wanted to use powershell script to automate this process.

PowerShell
PowerShell
A family of Microsoft task automation and configuration management frameworks consisting of a command-line shell and associated scripting language.
2,584 questions
0 comments No comments
{count} votes

Accepted answer
  1. Rich Matheisen 46,811 Reputation points
    2024-08-31T21:18:11.3633333+00:00

    -creplace '\P{IsBasicLatin}' doesn't accomplish what you say you want. :-)

    That will remove all characters in the string whose value is greater than 127 (decimal). It will also leave non-printable characters that exist in the ASCII character set in your string.

    There's no need to resort to using Unicode groups, though. What you want is something as simple as this:

    -creplace "[^\x20-\x7E]", ""

    0 comments No comments

3 additional answers

Sort by: Most helpful
  1. Rich Matheisen 46,811 Reputation points
    2024-08-24T21:43:44.9133333+00:00

    If you're seeing ZWSP characters in your data then you're not dealing with ASCII. That's a Unicode character with a hex value of 0x200B.

    Your file is likely written with UTF-8 encoding.

    If you have a regex that works you should be able to use something as simple as this:

    $x=get-content c:\junk\xxx.txt
    $y = $x -replace '[ab]','X'
    $y | Set-Content c:\junk\xxxNew.txt
    

    Replace the '[ab]','X' with your regex.


  2. Syed Manzar Abbas Zaidi 20 Reputation points
    2024-08-31T04:43:38.5066667+00:00

    I am able to resolve the issue!

    Used

    -creplace '\P{IsBasicLatin}'
    

    Ref: https://stackoverflow.com/questions/68326094/removing-unicode-character-using-powershell

    0 comments No comments

  3. Rich Matheisen 46,811 Reputation points
    2024-08-31T19:56:45.1733333+00:00

    -creplace '\P{IsBasicLatin}' doesn't accomplish what you say you want. :-)

    That will remove all characters in the string whose value is greater than 127 (decimal). It will also leave non-printable characters that exist in the ASCII character set in your string.

    There's no need to resort to using Unicode groups, though. What you want is something as simple as this:

    -creplace "[^\x20-\x7E]", ""

    EDIT: Corrected regex to include space character. Also, this answer will be seen twice. While trying to edit it earlier I kept receiving a 404 Page Not Found error.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.