C++: UTF-8 in Win32

thebluetropics 1,046 Reputation points
2022-09-28T09:13:21.307+00:00

I have been programming in C++ with Win32 API.

Nowadays, I need to understand how does UTF-8 works in Win32/C++.

These types is what I know as "UNICODE" types: wchar_t, LPCWSTR, std::wstring

However, what UNICODE is exactly within these types? UTF-8, UTF-16, or even UTF-32?

   PCWSTR text = L"テスト";  

With the code above, I can properly output the string to debug log using OutputDebugStringW() and write text on window client area using Direct2D's DrawTextW().

This doesn't work:

   wchar_t singleUtf8Character = L'私';  

As far as I know, wchar_t is an UTF-8. But it's not, I just realized it some moments ago. A big question, then what is wchar_t?, if it is not an UTF-8?

   wchar_t text[] = L"私";  
   int textSize = sizeof(text); // I thought it was 1, but it is 4!  

I need a type that capable of storing UTF-8 characters with Win32 API or existing standard C++. This including:

  • UTF-8 version of char
  • UTF-8 version of char array
  • UTF-8 version of string (probably LPCWSTR?)

With these types, I can use them for functions with LPCWSTR parameters.

I know that LPCWSTR and PCWSTR is the same thing today.

Regards from a new C++ programmer, @thebluetropics

Windows API - Win32
Windows API - Win32
A core set of Windows application programming interfaces (APIs) for desktop and server applications. Previously known as Win32 API.
2,449 questions
C++
C++
A high-level, general-purpose programming language, created as an extension of the C programming language, that has object-oriented, generic, and functional features in addition to facilities for low-level memory manipulation.
3,563 questions
0 comments No comments
{count} votes

2 additional answers

Sort by: Most helpful
  1. RLWA32 41,046 Reputation points
    2022-09-28T09:39:45.287+00:00

    You may also find this helpful -- https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page.

    And Windows default unicode encoding is UTF-16LE.

    0 comments No comments

  2. Tong Xu - MSFT 2,116 Reputation points Microsoft Vendor
    2022-09-29T03:51:06.93+00:00

    Hi,@thebluetropics
    Welcome to Microsoft Q&A!
    The documents about capable of storing UTF-8 characters with WIn32 are the above respondents said. However, there are some details that I need to correct and add.
    First of all, If a character error occurs when the program runs, you can modify the character set.
    245885-%E5%AD%97%E8%8A%82%E5%B1%9E%E6%80%A7.png

    The code page used for code saving can be set in the advanced save options
    245825-%E9%AB%98%E7%BA%A7%E4%BF%9D%E5%AD%98%E9%80%89%E9%A1%B9.png

    Similar to the hexadecimal prefix has 0x, UTF-8 and UTF-16 have their own prefix identifiers in the assembler, and you can open the disassembly window during debugging to view, the prefix of UTF-8 is EF BB BF, UTF-16LE's is FF FE, UTF-16BE's is FE FF. And UTF-8 must be aligned every two bytes, and UTF-16 can be arbitrarily aligned. So the value in memory of the character"私",you thought it was 1, but it is 4. Because this character prefix occupies 3 bytes and is EF BB BF.

    Now, let‘s talk about what is wchar_t?
    wchar_t is a character type of multibyte wide character. There are many basic characters Chinese, Japanese, Korean, etc., which is why multibyte character types are used in the programming.
    Hope my answer can help you.
    Tong