When reading a UTF-8 file with CreateFileW and ReadFile, do I need to convert it to wchar_t*?

Question

When reading a UTF-8 file with CreateFileW and ReadFile, do I need to convert it to wchar_t*?

chw 271

I'm trying to read a file that contains various paths. I'm unsure of the languages used in these paths. After reading about the Windows API, I thought using wchar_t* might be a good idea to ensure all characters are handled correctly.

Here's what I'm doing:

Read a file using CreateFileW.
Read the content into a char buffer[1024].
Parse the buffer until I find new lines, then copy the content into a char lineBuffer[1024].
Convert it from char* to wchar_t* using MultiByteToWideChar.
Print the wchar_t* buffer using wprintf(L"%ls\n");.

I've looked at various code examples online and thought this should work. However, the output in the terminal only shows ??.

I've tried this on Windows Terminal using both cmd and pwsh, and the issue persists. I've also tried changing the codepage with chcp from 850 to 65001 without success.

Here's a minimal example that reproduces my issue:

#include <windows.h>
#include <stdio.h>

int main() {
    const char* utf8String = "Hello, 世界!";
    int utf8Length = strlen(utf8String); // this is 14

    // Calculate the size needed for the wide character buffer
    int wideCharSize = MultiByteToWideChar(CP_UTF8, 0, utf8String, utf8Length, NULL, 0);

    // Allocate the wide character buffer
    wchar_t* wideBuffer = (wchar_t*)malloc((wideCharSize + 1) * sizeof(wchar_t));
    if (wideBuffer == NULL) {
        printf("Memory allocation failed.\n");
        return 1;
    }

    // Perform the conversion
    MultiByteToWideChar(CP_UTF8, 0, utf8String, utf8Length, wideBuffer, wideCharSize); // return value is not 0

    // Null-terminate the wide character string
    wideBuffer[wideCharSize] = L'\0';

    // Use the wide character string (for demonstration, print it)
    wprintf(L"%ls\n", wideBuffer);

    // Free the allocated memory
    free(wideBuffer);

    return 0;
}

What am I missing? Is such a conversion even necessary?

Accepted answer

1 additional answer

Your answer

Answer 1

Viorel 122.6K

To display the string without conversion, try this example:

const char* utf8String = u8"Hello, 世界!";

SetConsoleOutputCP( CP_UTF8 );
printf( "%s\n", utf8String );

To display the converted string:

DWORD n;
WriteConsoleW( GetStdHandle( STD_OUTPUT_HANDLE ), wideBuffer, lstrlenW( wideBuffer ), &n, NULL );
WriteConsoleW( GetStdHandle( STD_OUTPUT_HANDLE ), L"\r\n", 2, &n, NULL );

To view the value during debugging in Watch or Immediate windows, enter this expression: utf8String,s8.

chw 271 Reputation points

2023-09-30T09:29:17.47+00:00

Your first example only works when I remove the u8 before the string.

The second example works. So I guess the conversion works but wprintf does not work as I expected it?
Should I just use WriteConsoleW instead of wprintf when printing things?

chw 271

Now the first one works even without the SetCP function. I am not sure what I changed.

#include <windows.h>
#include <stdio.h>

int main() {
    const char* utf8String = "Hello, 世界!";
    int utf8Length = strlen(utf8String);
    printf("%s\n", utf8String);

    // Calculate the size needed for the wide character buffer
    int wideCharSize = MultiByteToWideChar(CP_UTF8, 0, utf8String, utf8Length, NULL, 0);

    // Allocate the wide character buffer
    wchar_t* wideBuffer = (wchar_t*)malloc((wideCharSize + 1) * sizeof(wchar_t));
    if (wideBuffer == NULL) {
        printf("Memory allocation failed.\n");
        return 1;
    }

    // Perform the conversion
    int succeed = MultiByteToWideChar(CP_UTF8, 0, utf8String, utf8Length, wideBuffer, wideCharSize);
    if (succeed == 0) {
        printf("Did not succeeded\n");
    }

    // Null-terminate the wide character string
    wideBuffer[wideCharSize] = L'\0';

    // Use the wide character string (for demonstration, print it)
    //wprintf(L"%ls\n", wideBuffer);

    DWORD n;
    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), wideBuffer, lstrlenW(wideBuffer), &n, NULL);
    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"\r\n", 2, &n, NULL);

    // Free the allocated memory
    free(wideBuffer);

    return 0;
}

This now works as expected. But the question is stil relevant: should I ditch wprintf and just use WriteConsoleW instead?

Viorel 122.6K Reputation points

2023-09-30T09:52:55.17+00:00

According to https://social.msdn.microsoft.com/Forums/vstudio/en-US/b941b3c4-7c00-4f77-9d2f-acb9e9085449, try to execute _setmode(_fileno(stdout), _O_U16TEXT) before wprintf. But it seems to affect the std::cout and the printf function too.

As a workaround, you can use the WriteConsoleW and maybe StringCchVPrintfW to write your own Wprintf function.

Answer 2

David Lowndes 2,640 MVP

If you want to output utf-16, you can do this

    _setmode( _fileno( stdout ), _O_U16TEXT );

    const char* utf8String = u8"Hello, 世界!";

Also you need to ensure that the console uses a font that has the characters. NSimSun or SimSun-ExtB work for me with your example.

Share via

When reading a UTF-8 file with CreateFileW and ReadFile, do I need to convert it to wchar_t*?

1 additional answer

Your answer