libvips and nip2: Automatic computation reordering

libvips 8.5 (due out in March 2017) has just gained an interesting new feature thanks to the developer of PhotoFlow, a non-destructive RAW editor that uses libvips. It can now automatically reorder calculations to reduce recomputation. In some cases this can produce a dramatic speedup.

The problem

Let's have an example first. Consider this tiny Python program:

#!/usr/bin/env python

import sys
import time

import gi
gi.require_version('Vips', '8.0')
from gi.repository import Vips

kernel = Vips.Image.new_from_array([[1, 1, 1]], scale = 3)

for j in range(0, 20): 
    start = time.time()
    
    im = Vips.Image.new_from_file(sys.argv[1])

    for i in range(0, j):
        im = im.convsep(kernel)

    im.write_to_file(sys.argv[2])
    
    end = time.time()
    
    print '%d, %g' % (j, end - start)

It makes 20 image processing pipelines which repeatedly blur an image, each pipeline longer than the one before. If I run it on my laptop I see what looks like a simple linear relationship between pipeline length and run time:

The kink in the middle was just my web browser doing something, I think.

But look what happens if we make what ought to be a simple change to the program. If we change the line:


im = im.convsep(kernel)

to be:


im = im - im.convsep(kernel)

ie. change the blur to a high-pass filter, our graph changes to this horror:

It looks like exponential growth. By the time you reach length 20, it's taking more than 100x longer than before, and using more than 100x as much memory too. Even stranger, change that line again to the equivalent:


im = -im.convsep(kernel) + im

And we see:

A bit slower than the initial version, but we have nice linear behaviour back again. What is going on?

The answer lies in exactly what happens as libvips evaluates a binary arithmetic operator like - or +. Each time a new region of pixels is needed from the operator, it first computes the matching region on the left-hand side of the operator, then the matching region on the right, then loops over the two sets of pixels computing the result region.

Suppose it is asked to calculate an area of M x N pixels of the subtract in the line


im - im.convsep(kernel)

It'll first calculate M x N pixels of im (which will in turn require evaluating all of the earlier pipeline stages), then it'll try to get M x N pixels from convsep. Convolution is an area operation with (in this case) a two pixel margin, so convsep will in turn request M + 2 x N + 2 pixels from im. libvips will spot that it has just computed an M x N region for im, but sadly this region is two pixels too small for convsep, so it'll throw that result away and recompute the larger area.

This means that at every stage in the pipeline, libvips is unable to reuse the pixels it computes on the left when it computes the right. Instead of just adding an extra stage, we will have two complete and separate computations. It's an O(2 ** N) operation, not O(N)!

Now it should be clear why reversing the order of the operation speeds things up so much. If you do the convsep first, there will be enough pixels to allow simple reuse.

The lower speed compared to the first version is just down to the arithmetic type. libvips has a very fast vector path for convolution on 8-bit int data. The - operator forces libvips to use 32-bit int data for most of the pipeline, causing a slowdown.

How to fix this

One solution is obviously to be careful about the order in which you do computations. This is pretty easy for simple cases, but in complex programs, especially ones which construct pipelines dynamically in response to input, it can be hard to decide in advance what the best order will be.

PhotoFlow has a neat trick for automatic reordering, which is now in libvips 8.5. It keeps an extra data structure for every image. This thing tracks all original source images (ie. images loaded from files, or created by operations) and the extra margins that have accumulated on that image at this point in the pipeline. When an operation has more than one input image, it combines the source image tables on the input images and removes duplicates. For every match found, it gives those inputs a score based on the difference in cumulative margin. Finally, it reorders the inputs by score.

It's not always possible to avoid recomputation, but this system will pick the order that has a good chance of minimising it. And it's a relatively simple change, just a few hundred lines of code internal to libvips, and no changes to the API.

Benchmarks

Here's a more complex example, abstracted from PhotoFlow. Here two images are loaded, then blurred and blended in a loop. This is roughly what happens when you combine and mask a set of five layers.


#!/usr/bin/python

import sys

import gi
gi.require_version('Vips', '8.0')
from gi.repository import Vips

if len(sys.argv) != 4:
    print "usage: %s in1 in2 out" % sys.argv[0]
    sys.exit(1)

Vips.leak_set(True)

a = Vips.Image.new_from_file(sys.argv[1])
b = Vips.Image.new_from_file(sys.argv[2])

for i in range(0, 5):
    a_prime = a.gaussblur(2)
    b_prime = b.gaussblur(1)

    c = 0.9 * b + 0.1 * a_prime

    d = 0.9 * b_prime + 0.1 * c

    a = d
    b = c

b.write_to_file(sys.argv[3])

If I run this with reordering disabled and a couple of large images, I see:

$ time ./reorder.py ~/pics/k2.jpg ~/pics/k4.jpg x.jpg
memory: high-water mark 270.29 MB
real 0m4.076s
user 0m14.504s
sys 0m0.152s

Now enable reordering and I see:

$ time ./reorder.py ~/pics/k2.jpg ~/pics/k4.jpg x.jpg
memory: high-water mark 155.93 MB
real 0m1.497s
user 0m4.804s
sys 0m0.076s

Almost three times faster and 100MB less memory use, a very useful improvement.

Monday, 16 January 2017

Automatic computation reordering

The problem

How to fix this

Benchmarks

No comments:

Post a Comment