准确的JSON文本编码检测

2019年7月22日 206次阅读

在RFC4627中描述了一种用于在不存在BOM时识别Unicode编码的方法.这依赖于
JSON文本中的前2个字符始终为ASCII字符.但在RFC7159中,规范将JSON文本定义为“ws value ws”;暗示单个字符串值也有效.因此第一个字符将是开头引号,但随后字符串中允许的任何Unicode字符都可以跟随.考虑到RFC7159也不鼓励使用BOM;并且不再描述从前4个八位字节(字节)检测编码的过程,应该如何检测它？ UTF-32应该仍然可以正常工作,如RFC4627中所述,因为第一个字符是四个字节,仍然应该是ASCII,但是UTF-16呢？第二个(2字节)字符可能不包含零字节,以帮助识别正确的编码. 最佳答案在看了几年前我做过的实现之后,我可以告诉我们可以从一个字符中明确地检测给定的Unicode Scheme,给出以下假设：

>输入必须是Unicode
>第一个字符必须是ASCII
>必须没有BOM

考虑一下：

假设第一个字符是“[”(0x5B) – 一个ASCII.
然后,我们可能得到这些字节模式：

UTF_32LE:    5B 00 00 00  
UTF_32BE:    00 00 00 5B
UTF_16LE:    5B 00 xx xx
UTF_16BE:    00 5B xx xx
UTF_8:       5B xx xx xx

其中“xx”是EOF或任何其他字节.

我们还应该注意,根据RFC7159,最短的有效JSON可以只是一个字符.也就是说,它可能是1,2或4字节 – 取决于Unicode方案.

所以,如果它有帮助,这里是C中的一个实现：

namespace json {

    //
    //  Detect Encoding
    //
    // Tries to determine the Unicode encoding of the input starting at 
    // first. A BOM shall not be present (you might check with function 
    // json::unicode::detect_bom() whether there is a BOM, in which case 
    // you don't need to call this function when a BOM is present).
    //
    // Return values:
    // 
    //   json::unicode::UNICODE_ENCODING_UTF_8
    //   json::unicode::UNICODE_ENCODING_UTF_16LE
    //   json::unicode::UNICODE_ENCODING_UTF_16BE
    //   json::unicode::UNICODE_ENCODING_UTF_32LE
    //   json::unicode::UNICODE_ENCODING_UTF_32BE
    //
    //  -1:     unexpected EOF
    //  -2:     unknown encoding
    //
    // Note:
    // detect_encoding() requires to read ahead a few bytes in order to deter-
    // mine the encoding. In case of InputIterators, this has the consequences
    // that these iterators cannot be reused, for example for a parser.
    // Usually, this requires to reset the istreambuff, that is using the 
    // member functions pubseekpos() or pupseekoff() in order to reset the get 
    // pointer of the stream buffer to its initial position.
    // However, certain istreambuf implementations may not be able to set the    
    // stream pos at arbitrary positions. In this case, this method cannot be
    // used and other edjucated guesses to determine the encoding may be
    // needed.

    template <typename Iterator>    
    inline int 
    detect_encoding(Iterator first, Iterator last) 
    {
        // Assuming the input is Unicode!
        // Assuming first character is ASCII!

        // The first character must be an ASCII character, say a "[" (0x5B)

        // UTF_32LE:    5B 00 00 00
        // UTF_32BE:    00 00 00 5B
        // UTF_16LE:    5B 00 xx xx
        // UTF_16BE:    00 5B xx xx
        // UTF_8:       5B xx xx xx

        uint32_t c = 0xFFFFFF00;

        while (first != last) {
            uint32_t ascii;
            if (static_cast<uint8_t>(*first) == 0)
                ascii = 0; // zero byte
            else if (static_cast<uint8_t>(*first) < 0x80)
                ascii = 0x01;  // ascii byte
            else if (*first == EOF)
                break;
            else
                ascii = 0x02; // non-ascii byte, that is a lead or trail byte
            c = c << 8 | ascii;
            switch (c) {
                    // reading first byte
                case 0xFFFF0000:  // first byte was 0
                case 0xFFFF0001:  // first byte was ASCII
                    ++first;
                    continue;
                case 0xFFFF0002:
                    return -2;  // this is bogus

                    // reading second byte
                case 0xFF000000:    // 00 00 
                    ++first;
                    continue;
                case 0xFF000001:    // 00 01
                    return json::unicode::UNICODE_ENCODING_UTF_16BE;
                case 0xFF000100:    // 01 00
                    ++first;
                    continue;
                case 0xFF000101:    // 01 01
                    return json::unicode::UNICODE_ENCODING_UTF_8;

                    // reading third byte:    
                case 0x00000000:  // 00 00 00
                case 0x00010000:  // 01 00 00  
                    ++first;
                    continue;                    
                    //case 0x00000001:  // 00 00 01  bogus
                    //case 0x00000100:  // 00 01 00  na
                    //case 0x00000101:  // 00 01 01  na
                case 0x00010001:  // 01 00 01 
                    return json::unicode::UNICODE_ENCODING_UTF_16LE;

                    // reading fourth byte    
                case 0x01000000:
                    return json::unicode::UNICODE_ENCODING_UTF_32LE;
                case 0x00000001:
                    return json::unicode::UNICODE_ENCODING_UTF_32BE;

                default:
                    return -2;  // could not determine encoding, that is,
                                // assuming the first byte is an ASCII.
            } // switch
        }  // while 

        // premature EOF
        return -1;
    }
}