Some MSDN articles explain this in details, like C++
Unicode Encoding Conversions with STL Strings and Win32 APIs
C++: UTF-8 in Win32
I have been programming in C++ with Win32 API.
Nowadays, I need to understand how does UTF-8 works in Win32/C++.
These types is what I know as "UNICODE" types: wchar_t
, LPCWSTR
, std::wstring
However, what UNICODE is exactly within these types? UTF-8, UTF-16, or even UTF-32?
PCWSTR text = L"テスト";
With the code above, I can properly output the string to debug log using OutputDebugStringW()
and write text on window client area using Direct2D's DrawTextW()
.
This doesn't work:
wchar_t singleUtf8Character = L'私';
As far as I know, wchar_t
is an UTF-8. But it's not, I just realized it some moments ago. A big question, then what is wchar_t
?, if it is not an UTF-8?
wchar_t text[] = L"私";
int textSize = sizeof(text); // I thought it was 1, but it is 4!
I need a type that capable of storing UTF-8 characters with Win32 API or existing standard C++. This including:
- UTF-8 version of
char
- UTF-8 version of
char
array - UTF-8 version of string (probably LPCWSTR?)
With these types, I can use them for functions with LPCWSTR parameters.
I know that LPCWSTR and PCWSTR is the same thing today.
Regards from a new C++ programmer, @thebluetropics
-
Castorix31 86,311 Reputation points
2022-09-28T09:23:18.2+00:00
2 additional answers
Sort by: Most helpful
-
RLWA32 45,941 Reputation points
2022-09-28T09:39:45.287+00:00 You may also find this helpful -- https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page.
And Windows default unicode encoding is UTF-16LE.
-
Tong Xu - MSFT 2,471 Reputation points Microsoft Vendor
2022-09-29T03:51:06.93+00:00 Hi,@thebluetropics
Welcome to Microsoft Q&A!
The documents about capable of storing UTF-8 characters with WIn32 are the above respondents said. However, there are some details that I need to correct and add.
First of all, If a character error occurs when the program runs, you can modify the character set.
The code page used for code saving can be set in the advanced save options
Similar to the hexadecimal prefix has 0x, UTF-8 and UTF-16 have their own prefix identifiers in the assembler, and you can open the disassembly window during debugging to view, the prefix of UTF-8 is EF BB BF, UTF-16LE's is FF FE, UTF-16BE's is FE FF. And UTF-8 must be aligned every two bytes, and UTF-16 can be arbitrarily aligned. So the value in memory of the character"私",you thought it was 1, but it is 4. Because this character prefix occupies 3 bytes and is EF BB BF.
Now, let‘s talk about what is wchar_t?
wchar_t is a character type of multibyte wide character. There are many basic characters Chinese, Japanese, Korean, etc., which is why multibyte character types are used in the programming.
Hope my answer can help you.
Tong