While working on a small command line Java application (which will appear in a later post) I found myself needing to concurrently save lots and lots of files to a ZIP archive. Prior to the NIO package that debuted in Java 7 this wasn't something easily done. In this post we'll use some of the fundamental APIs from the NIO package along with concurrency utilities to quickly traverse a directory tree and copy file contents into a ZIP archive. However, to truly appreciate the speed boost from NIO's asynchronous file channels, we'll first visit and benchmark a more "vintage" approach leveraging the java.util.zip and java.io packages.

A Quick Tour Of The Project

The source code for this project can be found on github, and contains four class files along with the Maven pom.xml file. A simple abstract base class named Archiver contains the boilerplate code for the two implementations, and the Main class times the execution of either implementation. If you inspect the pom.xml file you'll notice I am using the org.apache.commons.io third party library.

The Classic ZipOutputStream

public class SimpleArchiver extends Archiver {
    private ZipOutputStream zos;

    SimpleArchiver(String inputDir, String outputFile) {
        super(inputDir, outputFile);
    }

    private void zipFile(File file) throws IOException {
        ZipEntry ze = new ZipEntry(file.getName());
        zos.putNextEntry(ze);
        FileInputStream fis = new FileInputStream(file);
        IOUtils.copy(fis, zos);
        fis.close();
        incrementCount();
    }

    @Override
    public void run() {
        try {
            // create output file if it doesn't exist
            File out = new File(getOutputFile());
            if (!out.exists()) {
                out.createNewFile();
            }

            FileOutputStream fos = new FileOutputStream(getOutputFile());
            zos = new ZipOutputStream(fos);

            // iterate through input directory files
            // and copy anything that's a file
            File[] fileList = new File(getInputDir()).listFiles();
            for (File file : fileList) {
                if (!file.isDirectory()) {
                    zipFile(file);
                }
            }

            zos.close();
        }
        catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Here we iterate over the list of files within inputDir, and anything that isn't a directory gets passed along to the zipFile method. Before we can copy the file to the ZipOutputStream we call the required putNextEntry method, passing it the file name. Finally when we're done we clean up after ourselves closing the output stream.

Using the SimpleArchiver class above I copied a sequence of 1000 low resolution images in 3.515416 seconds.

ZIPping It Up NIO-Style

public class ThreadedArchiver extends Archiver {
    private FileSystem zipfs;
    private ExecutorService es = Executors.newFixedThreadPool(4);
    private Visitor visitor = new Visitor();

    class Callable implements java.util.concurrent.Callable<Integer> {
        private Path file;

        Callable(Path file) {
            super();
            this.file = file;
        }

        @Override
        public Integer call() throws Exception {
            // copy input file to ZipFileSystem
            FileInputStream in = new FileInputStream(file.toFile());
            OutputStream out = Files.newOutputStream(zipfs.getPath(file.getFileName().toString()));

            IOUtils.copy(in, out);

            in.close();
            out.close();

            return 0;
        }
    }

    class Visitor extends SimpleFileVisitor<Path> {
        @Override
        public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
            // ignore directories and symbolic links
            if (attrs.isRegularFile()) {
                es.submit(new Callable(file));
                incrementCount();
            }
            return FileVisitResult.CONTINUE;
        }
    }

    ThreadedArchiver(String inputDir, String outputFile) {
        super(inputDir, outputFile);
    }

    private void createZipFileSystem() throws IOException {
        // setup ZipFileSystem
        Map<String, String> env = new HashMap<>();
        env.put("create", "true");
        URI zipURI = URI.create(String.format("jar:file:%s", this.getOutputFile()));
        zipfs = FileSystems.newFileSystem(zipURI, env);
    }

    public void setEs(ExecutorService es) {
        this.es = es;
    }

    @Override
    public void run() {
        try {
            this.createZipFileSystem();
        } catch (IOException e) {
            e.printStackTrace();
        }

        // walk input directory using our visitor class
        FileSystem fs = FileSystems.getDefault();
        try {
            Files.walkFileTree(fs.getPath(this.getInputDir()), visitor);
        } catch (IOException e) {
            e.printStackTrace();
        }

        // shutdown ExecutorService and block till tasks are complete
        es.shutdown();
        try {
            es.awaitTermination(10, TimeUnit.SECONDS);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        try {
            zipfs.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

There's several new APIs being used in the ThreadedArchiver. Firstly, we use the jar URI format to create a ZipFileSystem, which we'll use in place of the old ZipOutputStream. Using the walkFileTree method of the default FileSystem we can traverse the inputDir and delegate results to the Visitor inner class. Similar to the previous example, we want to copy any file thats not a directory. This time around, file copying will be performed by the Callable implementation and handled concurrently by a threaded ExecutorService. Our tasks copy files into the ZipFileSystem which is safe for concurrent access.

Putting this class to the same test as above, I compressed the same 1000 images in 1.989857 seconds.

Analyzing The Results

Our ThreadedArchiver ran about 56% faster than the synchronous SimpleArchiver. This may not seem that notable, but keep in mind we're dealing with a relatively small set of files that only average about 25KB in size. By tweaking the number of threads I was able to increase the speed even more, though with too many threads performance eventually started to plateau or even decrease. That said, I think it's clear that NIO's concurrency safe APIs can offer notable performance boosts over the old school java.io package and are worth considering if Java 7 is an option in your next project.

comments powered by Disqus