java – 程序如何决定xml文件的编码？

2023年4月14日 254次阅读

我在处理(Unmarshall)xml文件时对xml编码有疑问.

我们在文件的开头指定xml文件的编码,如下所示.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

我的问题是在程序读取此行之后,它决定以下内容以UTF-8编码.但要阅读第一行,程序如何确定它是以UTF-8编码的？我的意思是在读取字节流时,程序如何知道它需要使用哪个编码用于第一行？

问候,
Mayuran

最佳答案它写在F.1节中. xml规范：

F.1 Detection Without External Encoding Information
Because each XML entity not accompanied by external encoding
information and not in UTF-8 or UTF-16 encoding must begin with an XML
encoding declaration, in which the first characters must be <?xml,
any conforming processor can detect, after two to four octets of
input, which of the following cases apply. In reading this list, it
may help to know that in UCS-4, < is #x0000003C and ? is
#x0000003F, and the Byte Order Mark required of UTF-16 data streams is #xFEFF. The notation ## is used to denote any byte value except
that two consecutive ##s cannot be both 00.

基本上,有两种选择：

>有一个字节顺序标记(BOM)
>没有BOM.

然后,specification清楚地记录了特定八位字节流的表,处理器应该使用这些表来确定用于查看编码声明的编码.