java – ZipEntry.STORED用于已经压缩的文件？

2019年7月27日 323次阅读

我正在使用ZipOutputStream压缩一堆文件,这些文件是已压缩格式的混合,以及许多大型高度可压缩格式,如纯文本.

大多数已经压缩的格式都是大文件,因为它们永远不会变小,有时会在极少数情况下变得稍微大一些,因此在重新压缩它们时花费cpu和内存是没有意义的.

当我检测到预压缩文件时,我试图使用.setMethod(ZipEntry.STORED),但它抱怨我需要提供这些文件的大小,compressedSize和crc.

我可以使用以下方法使其工作,但这需要我读取文件两次.一旦计算CRC32然后再次实际将文件复制到ZipOutputStream中.

// code that determines the value of method omitted for brevity
if (STORED == method)
{
    fze.setMethod(STORED);
    fze.setCompressedSize(fe.attributes.size());
    final HashingInputStream his = new HashingInputStream(Hashing.crc32(), fis);
    ByteStreams.copy(his,ByteStreams.nullOutputStream());
    fze.setCrc(his.hash().padToLong());
}
else
{
    fze.setMethod(DEFLATED);
}
zos.putNextEntry(fze);
ByteStreams.copy(new FileInputStream(fe.path.toFile()), zos);
zos.closeEntry();

有没有办法提供这些信息而无需两次读取输入流？

最佳答案简答：

我无法确定一种只读取文件一次的方法,并根据我必须解决此问题的时间用标准库计算CRC.

我确实找到了一个优化,平均减少了约50％的时间.

我预先计算要与ExecutorCompletionService同时存储的文件的CRC,该ExecutorCompletionService仅限于Runtime.getRuntime().availableProcessors()并等待它们完成.其有效性取决于需要CRC计算的文件数.随着文件越多,效益越大.

然后在.postVisitDirectories()中,在PipedOutputStream中围绕PipedOutputStream包装一个PipedOutputStream来运行临时线程,将ZipOutputStream转换为InputStream我可以传入HttpRequest将ZipOutputStream的结果上传到远程服务器连续编写所有预先计算的ZipEntry / Path对象.

这对于处理300 GB的即时需求来说已经足够了,但是当我完成10TB工作时,我将着眼于解决它并尝试找到更多优势而不会增加太多复杂性.

如果我想出一些明智的时间,我会用新的实现来更新这个答案.

答案很长：

我最后写了一个洁净室ZipOutputStream,支持多部分zip文件,智能压缩级别与STORE,并且能够在我读取时计算CRC,然后在流的末尾写出元数据.

为什么ZipOutputStream.setLevel()交换不起作用：

The ZipOutputStream.setLevel(NO_COMPRESSION/DEFAULT_COMPRESSION)
hack is not a viable approach. I did extensive tests on hundreds of
gigs of data, thousands of folders and files and the measurements were
conclusive. It gains nothing over calculating the CRC for the
STORED files vs compressing them at NO_COMPRESSION. It is actually
slower by a large margin!
In my tests the files are on a network mounted drive so reading
the files already compressed files twice over the network to
calculate the CRC then again to add to the ZipOutputStream was as
fast or faster than just processing all the files once as DEFLATED
and changing the .setLevel() on the ZipOutputStream.
There is no local filesystem caching going on with the network access.
This is a worse case scenario, processing files on the local disk will
be much much faster because of local filesystem caching.
So this hack is a naive approach and is based on false assumptions. It is processing the
data through the compression algorithm even at NO_COMPRESSION level
and the overhead is higher than reading the files twice.