When reading a UTF-8 file with CreateFileW and ReadFile, do I need to convert it to wchar_t*?

chw 266 Reputation points
2023-09-30T08:21:36.5866667+00:00

I'm trying to read a file that contains various paths. I'm unsure of the languages used in these paths. After reading about the Windows API, I thought using wchar_t* might be a good idea to ensure all characters are handled correctly.

Here's what I'm doing:

  1. Read a file using CreateFileW.
  2. Read the content into a char buffer[1024].
  3. Parse the buffer until I find new lines, then copy the content into a char lineBuffer[1024].
  4. Convert it from char* to wchar_t* using MultiByteToWideChar.
  5. Print the wchar_t* buffer using wprintf(L"%ls\n");.

I've looked at various code examples online and thought this should work. However, the output in the terminal only shows ??.

I've tried this on Windows Terminal using both cmd and pwsh, and the issue persists. I've also tried changing the codepage with chcp from 850 to 65001 without success.

Here's a minimal example that reproduces my issue:

#include <windows.h>
#include <stdio.h>

int main() {
    const char* utf8String = "Hello, 世界!";
    int utf8Length = strlen(utf8String); // this is 14

    // Calculate the size needed for the wide character buffer
    int wideCharSize = MultiByteToWideChar(CP_UTF8, 0, utf8String, utf8Length, NULL, 0);

    // Allocate the wide character buffer
    wchar_t* wideBuffer = (wchar_t*)malloc((wideCharSize + 1) * sizeof(wchar_t));
    if (wideBuffer == NULL) {
        printf("Memory allocation failed.\n");
        return 1;
    }

    // Perform the conversion
    MultiByteToWideChar(CP_UTF8, 0, utf8String, utf8Length, wideBuffer, wideCharSize); // return value is not 0

    // Null-terminate the wide character string
    wideBuffer[wideCharSize] = L'\0';

    // Use the wide character string (for demonstration, print it)
    wprintf(L"%ls\n", wideBuffer);

    // Free the allocated memory
    free(wideBuffer);

    return 0;
}

What am I missing? Is such a conversion even necessary?

Developer technologies | C++
0 comments No comments
{count} votes

Accepted answer
  1. Viorel 122.6K Reputation points
    2023-09-30T09:20:38.04+00:00

    To display the string without conversion, try this example:

    const char* utf8String = u8"Hello, 世界!";
    
    SetConsoleOutputCP( CP_UTF8 );
    printf( "%s\n", utf8String );
    

    To display the converted string:

    DWORD n;
    WriteConsoleW( GetStdHandle( STD_OUTPUT_HANDLE ), wideBuffer, lstrlenW( wideBuffer ), &n, NULL );
    WriteConsoleW( GetStdHandle( STD_OUTPUT_HANDLE ), L"\r\n", 2, &n, NULL );
    

    To view the value during debugging in Watch or Immediate windows, enter this expression: utf8String,s8.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. David Lowndes 2,640 Reputation points MVP
    2023-09-30T09:49:56.0366667+00:00

    If you want to output utf-16, you can do this

        _setmode( _fileno( stdout ), _O_U16TEXT );
    
        const char* utf8String = u8"Hello, 世界!";
    
    

    Also you need to ensure that the console uses a font that has the characters. NSimSun or SimSun-ExtB work for me with your example.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.