translating unusual characters back to normal characters

Question

translating unusual characters back to normal characters

Anonymous

I receive a file over which I have no control and I need to process the data in it with Excel. The file comes to me as a comma delimited file (.csv).

Apparently the system that produces the file uses UTF-8 encoding, but by the time it gets to me some characters have been converted to ASCII?

Examples of individual characters: â € ™ “ ¢ œ

I can use the CODE function in Excel to get the underlying number for the individual characters (226, 128, 153, 147, 162 and 156 for the characters above).

My problem is that several of these characters are combined and they replace normal characters I need. For example, â€™ evidently represents the apostrophe character; â€“ appears to represent the hyphen character. Other combined characters include: â€¢ â€œ and â€ , but I don’t know what normal characters they represent.

If I know that â€“ should be a hyphen I can use Excel’s Find and Replace to fix the data in my spreadsheets. But I don’t always know what the correct normal character is.

Is there a function or other Excel tool that will tell me the normal character that â€œ and â€¢ correspond to?

Locked Question. This question was migrated from the Microsoft Support Community. You can vote on whether it's helpful, but you can't add comments or replies or follow the question.

0 comments

Answer accepted by question author

3 additional answers

Answer 1

The thing is that by definition the multi-byte groups are not ASCII. When a byte (as you read the file in sequence 1 byte at a time from start to finish) has a value of less than decimal 128 then it IS an ASCII character. But if when you read a byte and it's anything other than an ASCII character it indicates that it is either a byte in the middle of a multi-byte stream or it is the 1st byte of a mult-byte string. In order to even attempt to come up with a direct conversion you'd almost have to know the language (page code) that is in use on the computer that created the file. It may be using Turkish while on your machine you're trying to translate into Italian, so the same characters wouldn't even appear properly - but at least they should appear improperly in a consistent manner. I think you're just going to have to sit down and spend a lot of time 'decoding' what you're getting and create your own table. Either that or get with who ever owns the system building the files and tell them that they are NOT sending out pure ASCII comma separated files and ask for their assistance in deciphering what you are seeing at your end.

If you check out the read on this page http://www.joelonsoftware.com/articles/Unicode.html you'll find that the author is of the opinion that lots of people (developers) THINK they know about character sets, but are actually about as clueless as I generally am about them. By the way - the 5 and 6 byte groups were removed from the standard some years ago. Did you try running a test file through my code and looking at the output to see if it even looked reasonably close? Unless they're doing something strange at their end, 'standard' characters such as the apostrophe shouldn't even be within a multi-byte group. An apostrophe ' has ASCII decimal value of 39, while the grave ` has an ASCII decimal value of 96.

Here's the entire ASCII character set - some such as 7 (bell) and 10 and 13 are not-printable since most below decimal value 27 are considered to be "command" codes. You'll see that nothing is really visible until 41 - the ! mark, although the character at the right end of the row above is a true ASCII space character (right above the ( symbol). 1 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ 127 I cannot believe that this system threw away the entire grid and aligned them up as a single string of text characters! :( and even though it says it's accepting a graphic of the grid - it's not showing up either! And it seems to have removed all of the line feeds in the post making 1 huge paragraph out of what was written as at least 6 separate paragraphs.

Answer 2

Hi again and thanks for the additional reply,

I think the best solution is to go upstream in the process and try to get the people at the source data end to send only ASCII comma delimited files, as you suggest. Problem is that those same guys who tend to be clueless about character sets are also often clueless about quality and customer service!

My fallback will be to create my own translation table manually, as you also suggest.

I did run your code and it seemed to work as expected. It removed the offending characters just fine. I think I can adapt it to use my manually created translation table and, instead of eliminating the offending characters, replace them with the correct ones.

Many thanks for your help.

Answer 3

UTF-8 has potential multi-byte characters, to represent some characters in some languages I believe it can take up to 6 bytes to do so.

But in UTF-8, the first set of ASCII characters can be represented as a single byte, and are. See this page: http://www.fileformat.info/info/unicode/utf8.htm Note that this page contradicts my statement about up-to-6-bytes, and for the moment I'll accept it as factual that only 4-byte groups are supported, although I can cite at least one reference to the 6-byte potential.

So the problem becomes one of Excel, when reading a CSV file, not knowing that it's in UTF-8 and thinking it's ASCII and treating each byte as a separate character instead of determining how many bytes comprise a character.

My only suggestion of a solution would be a routine that pre-processes the CSV file and basically drops out all characters that are multi-character sets. The ASCII characters would be left behind. The key is identifying the multi-character byte sets. At the bottom of it all this means first opening the file with the pre-processor and reading it byte-by-byte, writing the ASCII characters to another file and identifying and skipping over the multi-byte groups. Then you import that new file as your real CSV file.

Here is what is probably a crude attempt to do that. Drop the code into a regular code module and give it a trial run. It is non-destructive, so it shouldn't corrupt your source file - and then you can try bringing in the converted file to see if it actually worked as intended.

[EDIT] New code provided for a more robust conversion that actually takes into account the 'first byte of' values, should prevent mis-interpreting a mid-group value as something to keep. Although that shouldn't have been an issue with the original version, but better safe than sorry.

Sub UTF8ToASCII()

'takes a file presumed to be in UTF-8 format and

'reduces it to ASCII only characters within it

'Basically it is going to work through the source file

'examining it byte by byte and ignoring all bytes

'with a decimal value greater than 127.

'for bytes with a value from 0 to 127, inclusive, it

'will write the byte to a new file.

'

'it asks for user to identify the file to convert;

'the output file will have similar name and be

'placed in the same folder, but the filename will

'have '_UTF8-2-ASCII' added to it.

'Example: source file name CrypticMessage.csv

'output file: CrypticMessage_UTF8-2-ASCII.csv

Dim sourceFile As String

Dim destFile As String

Dim inBuff As Integer

Dim outBuff As Integer

Dim rawData As String * 1

sourceFile = Application.GetOpenFilename("Text Data Files (*.csv; *.txt; *.dat),*.csv;*.txt;*.dat", , _

"Select Source File")

If sourceFile = "False" Then

Exit Sub ' user cancelled operation

End If

destFile = Left(sourceFile, InStrRev(sourceFile, ".") - 1) & "_UTF8-2-ASCII" & Right(sourceFile, 4)

'setup to read the file a byte at a time

inBuff = FreeFile()

Open sourceFile For Binary As #inBuff Len = 1

'setup the output file

outBuff = FreeFile()

Open destFile For Binary As #outBuff

Do While Not EOF(inBuff)

Get #inBuff, , rawData

Select Case Asc(rawData)

Case Is < 128

'single byte, ASCII compatible character

'save it

Put #outBuff, , rawData

Case Is < 194

'some continuing byte, ignore it

Case Is < 224

'first byte of 2-byte sequence:

'throw it away and throw away next byte also

Get #inBuff, , rawData

Case Is < 240

'first byte of 3-byte sequence:

'throw it and next 2 bytes away

Get #inBuff, , rawData

Case Else

'value is 240-255 decimal

'first byte of 4-byte sequence

'throw it and next 3 bytes away

Get #inBuff, , rawData

End Select

Loop

Close #inBuff

Close #outBuff

MsgBox "File has been processed."

End Sub

Answer 4

Thanks JLLatham for the reply. While simply eliminating the offending characters as your code does might be useful in some situations, I really need those characters to be translated into normal ASCII characters. I need a translator solution rather than an elimination solution.

Is there a table somewhere that can translate the multi-byte groups from into normal ASCII characters (example: â€™ = ' (apostrophe))? Then I could either use find &replace in the Excel user interface or modify the code you generously provided to do the substitutions.

Share via

translating unusual characters back to normal characters

3 additional answers