在RFC4627中描述了一种用于在不存在BOM时识别Unicode编码的方法.这依赖于
JSON文本中的前2个字符始终为ASCII字符.但在RFC7159中,规范将JSON文本定义为“ws value ws”;暗示单个字符串值也有效.因此第一个字符将是开头引号,但随后字符串中允许的任何Unicode字符都可以跟随.考虑到RFC7159也不鼓励使用BOM;并且不再描述从前4个八位字节(字节)检测编码的过程,应该如何检测它? UTF-32应该仍然可以正常工作,如RFC4627中所述,因为第一个字符是四个字节,仍然应该是ASCII,但是UTF-16呢?第二个(2字节)字符可能不包含零字节,以帮助识别正确的编码. 最佳答案 在看了几年前我做过的实现之后,我可以告诉我们可以从一个字符中明确地检测给定的Unicode Scheme,给出以下假设:
>输入必须是Unicode
>第一个字符必须是ASCII
>必须没有BOM
考虑一下:
假设第一个字符是“[”(0x5B) – 一个ASCII.
然后,我们可能得到这些字节模式:
UTF_32LE: 5B 00 00 00
UTF_32BE: 00 00 00 5B
UTF_16LE: 5B 00 xx xx
UTF_16BE: 00 5B xx xx
UTF_8: 5B xx xx xx
其中“xx”是EOF或任何其他字节.
我们还应该注意,根据RFC7159,最短的有效JSON可以只是一个字符.也就是说,它可能是1,2或4字节 – 取决于Unicode方案.
所以,如果它有帮助,这里是C中的一个实现:
namespace json {
//
// Detect Encoding
//
// Tries to determine the Unicode encoding of the input starting at
// first. A BOM shall not be present (you might check with function
// json::unicode::detect_bom() whether there is a BOM, in which case
// you don't need to call this function when a BOM is present).
//
// Return values:
//
// json::unicode::UNICODE_ENCODING_UTF_8
// json::unicode::UNICODE_ENCODING_UTF_16LE
// json::unicode::UNICODE_ENCODING_UTF_16BE
// json::unicode::UNICODE_ENCODING_UTF_32LE
// json::unicode::UNICODE_ENCODING_UTF_32BE
//
// -1: unexpected EOF
// -2: unknown encoding
//
// Note:
// detect_encoding() requires to read ahead a few bytes in order to deter-
// mine the encoding. In case of InputIterators, this has the consequences
// that these iterators cannot be reused, for example for a parser.
// Usually, this requires to reset the istreambuff, that is using the
// member functions pubseekpos() or pupseekoff() in order to reset the get
// pointer of the stream buffer to its initial position.
// However, certain istreambuf implementations may not be able to set the
// stream pos at arbitrary positions. In this case, this method cannot be
// used and other edjucated guesses to determine the encoding may be
// needed.
template <typename Iterator>
inline int
detect_encoding(Iterator first, Iterator last)
{
// Assuming the input is Unicode!
// Assuming first character is ASCII!
// The first character must be an ASCII character, say a "[" (0x5B)
// UTF_32LE: 5B 00 00 00
// UTF_32BE: 00 00 00 5B
// UTF_16LE: 5B 00 xx xx
// UTF_16BE: 00 5B xx xx
// UTF_8: 5B xx xx xx
uint32_t c = 0xFFFFFF00;
while (first != last) {
uint32_t ascii;
if (static_cast<uint8_t>(*first) == 0)
ascii = 0; // zero byte
else if (static_cast<uint8_t>(*first) < 0x80)
ascii = 0x01; // ascii byte
else if (*first == EOF)
break;
else
ascii = 0x02; // non-ascii byte, that is a lead or trail byte
c = c << 8 | ascii;
switch (c) {
// reading first byte
case 0xFFFF0000: // first byte was 0
case 0xFFFF0001: // first byte was ASCII
++first;
continue;
case 0xFFFF0002:
return -2; // this is bogus
// reading second byte
case 0xFF000000: // 00 00
++first;
continue;
case 0xFF000001: // 00 01
return json::unicode::UNICODE_ENCODING_UTF_16BE;
case 0xFF000100: // 01 00
++first;
continue;
case 0xFF000101: // 01 01
return json::unicode::UNICODE_ENCODING_UTF_8;
// reading third byte:
case 0x00000000: // 00 00 00
case 0x00010000: // 01 00 00
++first;
continue;
//case 0x00000001: // 00 00 01 bogus
//case 0x00000100: // 00 01 00 na
//case 0x00000101: // 00 01 01 na
case 0x00010001: // 01 00 01
return json::unicode::UNICODE_ENCODING_UTF_16LE;
// reading fourth byte
case 0x01000000:
return json::unicode::UNICODE_ENCODING_UTF_32LE;
case 0x00000001:
return json::unicode::UNICODE_ENCODING_UTF_32BE;
default:
return -2; // could not determine encoding, that is,
// assuming the first byte is an ASCII.
} // switch
} // while
// premature EOF
return -1;
}
}