Store data with accent caracters

Question

Hi, I am facing issue to store Unicode car with accent in CosmosDB, after loading the data from csv file using Azure CosmoDB data migration tool é for example is converted to �, so Héctor is converted to H�ctor !

In the first place I thought it's Azure portal render issue, but I was wrong, I have the same result using Azure Storage Explorer and in Xamarin app, any idea how to fix this issue, seems there is no concept off collation in azure CosmoDB

Best regards

Accepted Answer

@Mohamed B

Our team has suggested few things for you to test:

"There is no “team responsible for the data migration tool”. It is provided as a free, open-source example with community contributions and no intrinsic support provision. Customers with a Premier contract can lean on their Premier support contacts for paid assistance with custom code (including this).

From my keyboard to the customer, it doesn’t seem to have come through quite how complicated this is at the byte level if they are relying on screenshots. LibreOffice has its own encoding habits, expectations, and bugs. Sharing a screenshot doesn’t tell us what bytes are landing on disk to be then interpreted by the migration tool; instead, you can use this tiny PowerShell command: Get-Content -Encoding Byte -Path "path: o\file.csv"

To put some concrete examples to this, I can produce four files with different bytes on disk that all open in Notepad appearing to show “Héctor”.
In the simplest case (new file, copy-paste the string, default save encoding), Notepad writes the file with no header, encodes the e-with-rising-accent as bytes 0xC3A9, and encodes the other 5 letters with their ANSI values such that the full file is 0x48C3A963746F72.
The same string, saved with the ANSI encoding setting: still no header, but the e-with-rising-accent uses its ANSI character 233 / 0xE9 with a full file of 0x48E963746F72.
The same string, saved with UTF-8 with BOM: header of 0xEFBBBF, e-with-rising-accent encoded as 0xC3A9 and all five other characters encoded with ANSI bytes resulting in: 0xEFBBBF48C3A963746F72
The same string pasted into PowerShell and written using Out-File: header of 0xFFFE, all six letters using their ANSI values but padded to 2 bytes (little endian), resulting in a full file of: 0xFFFE4800E900630074006F007200

I will reiterate that all four files look identical when opened in a GUI like Notepad. The only step forward here is to use a tool like the PowerShell command I mentioned above to extract the raw bytes, which may be able to pin down where the tools are disagreeing."

Please check this and get back to us with your questions.

Regards
Navtej S

Answer

@Mohamed B

We had raised this issue with our internal team. Here is the response:

Cosmos will emit whatever bytes were sent to it. What those bytes mean is dependent on the client application(s) and if you have two client applications with different encoding expectations, then you will see issues like this.
Storing Unicode in flat files has a long and sordid history of problems and incompatible implementation tricks, and that’s where I suspect your issue lies. The migration tool you’re referring to uses a .Net System.StreamReader to do (most of) the heavy lifting of reading file bytes and interpreting them into .Net characters and strings which uses the System.Text.Encoding to guess at the intended encoding for the file – which may or may not match the expectations and encoding used by whatever tool created your test file. And just to make things even more confusing, some text editors understand multiple of these encoding methods and will preserve whatever the previously-seen encoding method was in the file during editing but will use a different (preferred) one when creating a new file from scratch.

Please let us know if you need any further info.

Regards
Navtej S

Store data with accent caracters

1 additional answer