How to fix error displaying long utf8 character sequences in files using type from windows command line?
I noticed some display errors while trying to view the content of some utf8 text files using the "type" command from inside the Windows command line.
I included 3 text files to easier replicate the problem, just remove the ".txt" extension from the attachment and extract the files to produce the output seen in the screenshot - if this type of attachment is against the policy: sorry.
There are 3 files inside, 2 example text files (example1.txt = file without error, example2.txt=file with error) and a batch file to change the codepage and display the files using the "type" command.
In case there is a problem with the attachment, here the steps to reproduce the issue manually:
- create a text file (e.g. "example.txt") in utf8 format with no BOM that contains at least 171 '█' characters (U+2588, Full block). This character uses 3 bytes each, the file should be at least 513 bytes long - the error does not appear with 170 characters (170*3 = 510).
- open command line "cmd.exe" in the directory your text file is located in
- change codepage to unicode with the command "mode con cp select=65001"
- display the file with the command "type example.txt"
The error should look like in the screenshot, in this case the two square boxes before the EOF marker.
From some further tests it seems that the problem will occur if a multibyte character (in this case 3) crosses the 512 byte border of an internal buffer when trying to decode the character sequence. I just reproduced it using a sequence of 2 byte characters preceded by a single 1 byte character - the error occurs each 512 bytes.
A bit of background to how I encountered this problem:
I wrote a program to rasterize an image and display it as a "text"-version in the console, using primarily block element characters (U+2580 - U+259F). The console output from the program itself was as intended, but the displayed content of a created file showed unexplainable errors.
Workaround for my problem will probably be a padding of line ends with single byte space characters.