我在处理(Unmarshall)xml文件时对xml编码有疑问.
我们在文件的开头指定xml文件的编码,如下所示.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
我的问题是在程序读取此行之后,它决定以下内容以UTF-8编码.但要阅读第一行,程序如何确定它是以UTF-8编码的?我的意思是在读取字节流时,程序如何知道它需要使用哪个编码用于第一行?
问候,
Mayuran
最佳答案 它写在F.1节中. xml规范:
F.1 Detection Without External Encoding Information
Because each XML entity not accompanied by external encoding
information and not in UTF-8 or UTF-16 encoding must begin with an XML
encoding declaration, in which the first characters must be<?xml
,
any conforming processor can detect, after two to four octets of
input, which of the following cases apply. In reading this list, it
may help to know that inUCS-4
,<
is#x0000003C
and?
is
#x0000003F
, and the Byte Order Mark required ofUTF-16
data streams is#xFEFF
. The notation ## is used to denote any byte value except
that two consecutive ##s cannot be both 00.
基本上,有两种选择:
>有一个字节顺序标记(BOM)
>没有BOM.
然后,specification清楚地记录了特定八位字节流的表,处理器应该使用这些表来确定用于查看编码声明的编码.