java – XML文档读入为Latin1但一半转换为UTF-8

2019年8月3日 243次阅读

我撞到了一堵砖墙,有一个奇怪的问题,我知道会有一个明显的答案,但我看不出是否为了我的生命.这与编码有关.在代码之前,一个简单的描述：我想要一个
XML文档,它是Latin1(ISO-8859-1)编码,然后在HttpURLConnection上发送完全不变的东西.我有一个小测试类和原始XML,它显示了我的问题. XML文件包含一个Latin1字符0xa2(一个字符),这是无效的UTF-8 – 我故意将它作为我的测试用例. XML声明是ISO-8859-1.我可以毫不费力地阅读它,但是当我想将org.w3c.dom.Document转换为byte []数组以向下发送HttpURLConnection时,0xa2字符将转换为UTF-8编码的分数字符(0xc2) 0xa2),声明保持为ISO-8859-1.换句话说,它被转换为两个字符 – 完全错误.

执行此操作的代码：

FileInputStream input = new FileInputStream( "input-file" );
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware( true );
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( input );

Source source = new DOMSource( document );
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Result result = new StreamResult( baos );
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform( source, result );
byte[] bytes = baos.toByteArray();

FileOutputStream fos = new FileOutputStream( "output-file" );
fos.write( bytes );

我现在只是把它写到一个文件中,而我弄清楚究竟是什么在转换这个角色.输入文件有0xa2,输出文件包含0xc2 0xa2.解决此问题的一种方法是将此行放在第二个最后一个块中：

transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");

但是,并非所有我将要处理的XML文档都是Latin1;实际上,当它们进来时,大多数情况下都是UTF-8.我假设我不应该弄清楚编码是什么,这样我就把它输入到变压器中了？我的意思是,它肯定应该为自己解决这个问题,而我只是在做其他错误的事情？

我想到了一个想法,我可以查询文档以找出编码,因此额外的行可以做到这一点：

transformer.setOutputProperty(OutputKeys.ENCODING, document.getInputEncoding());

然而,我确定这不是答案,因为document.getInputEncoding()返回一个不同的String,如果我在linux盒子的终端上运行它,而不是我在Mac上的Eclipse中运行它.

任何提示将不胜感激.我完全接受我错过了一些显而易见的事情.

最佳答案是的,默认情况下,xml文档被写为utf-8,因此您需要明确告诉Transformer使用不同的编码.你最后的编辑是这样做的“技巧”,它总是匹配输入的xml编码：

transformer.setOutputProperty(OutputKeys.ENCODING, document.getXmlEncoding());

唯一的问题是,你真的需要维护输入编码吗？