c – 为什么设置了中文代码页的Windows控制台可以显示UTF-16编码字符?


MSDN

“For the Microsoft C/C++ compiler, the source and execution character sets are both ASCII.”

C 03

2.1翻译阶段

“..Any source file character not in the basic source character set
(2.2) is replaced by the universal-character-name that designates that
character. (An implementation may use any internal encoding, so long
as an actual extended character encountered in the source file, and
the same extended character expressed in the source file as a
universal-character-name (i.e. using the \uXXXX notation), are handled
equivalently.)”

2.13.2字符文字

“A universal-character-name is translated to the encoding, in the
execution character set, of the character named. If there is no such
encoding, the universal-character-name is translated to an
implementation-defined encoding.”

为了测试MSVC使用哪个执行字符集,我编写了以下代码:

wchar_t *str = L"中";
unsigned char *p = reinterpret_cast<unsigned char*>(str);
for (int i = 0; i < sizeof(L"中"); ++i)
{
   printf ("%x ", *(p + i));
}

输出显示2d 4e 0 0和0x4e2d是该中文字符的UTF-16 encoding.所以我的结论是:UTF-16被用作MSVC的执行字符集(我的版本:2012 4.5.50709)

之后,我尝试将此角色打印到Windows控制台.由于控制台使用的默认语言环境是“C”,因此我将语言环境设置为代表简体中文字符的代码页936.

// use the execution environment locale setting, which is 936
wchar_t *str = L"中";
char* locale = setlocale(LC_ALL, "");
wprintf (L"%ls\n", str);

哪个输出:

我很好奇的是,UTF-16中编码的字符如何通过其控制台(解码器)设置为非UTF-16(MS代码页936)的Windows控制台进行解码?怎么会发生这种情况?

最佳答案 我想我明白了.

在Microsoft C 2008(可能是2005年)中,CRT用作wprintf,wcout被实现为使得它们将宽字符串文字转换为在UTF-16中编码的L“中”,以匹配当前的语言环境/代码页设置.那么这里发生的是L“中”在代码页936中被转换为字节D6 D0以用于简体中文.

我错了,setlocale设置了控制台代码页.它只设置CRT功能在“转换”期间使用的当前程序代码页.要更改控制台代码页,请执行命令chcp或Win API SetConsoleOputputCP().

由于我的控制台的默认页面是936,因此可以正确显示该字符没有问题.

点赞