libvips and nip2: Profiling libvips

This is a programming post about some time I spent trying to analyse the insides of libvips. I enjoy this sort of detail, perhaps someone else will find it interesting too.

libvips used to have a pretty simple linear speedup as you added more threads. We measured a 27 x speedup with 32 threads a few years ago when we had some time on a large computer. Recently this seems to have dropped off a bit, for example when running vips sharpen on my two-core i5 laptop I see about a 1.75 x speedup, not very impressive. And, depressingly, this is despite quite a bit of work on the threading system to improve efficiency and scalability.

Looking back at some earlier benchmarks, libvips was able to get over 1.9 on similar tests. What's going on? If you're an expert on this you've probably already guessed. If not, read on.

I first wondered if there was some over-computation, that is, if threads were not cooperating correctly and some pixels were being calculated more than once. I added a little instrumentation to count pixels at various points and ruled that out: exactly the same number of calculations were being performed in both modes. That left threading overhead as a prime culprit. I needed to get a really detailed, accurate picture of exactly what was happening inside the code as it ran.

It's surprisingly difficult to get good thread timing information on a running program. You want to gather a lot of data and you want to affect the running program as little as possible. Most debuggers and profilers add too much overhead, so the usual solution is to add a small amount of instrumentation to the code.

Years ago libvips had a simple thread profiling system, and it was still sort-of there, but it needed some work to get it functional again. I decided to throw it away and try something better.

The idea is to use glib's thread-private storage system to attach a set of timing arrays to each thread. Threads can get to their private storage very quickly and without a mutex, so other threads won't be disturbed.

glib also has a very nice microsecond monotonic timer. This returns a time in microseconds, and guarantees that time will never go backwards. This is more difficult than it sounds: microsecond timers are registers run by the CPU. If a thread is suspended by the operating system and restarted on another core it will see a different timing register. If the clocks on the CPUs in your computer are not exactly synchronised (which they never are on large systems) time can appear to suddenly jump backwards or forwards on a reschedule. The monotonic timer contains code to automatically detect and fix this.

I wanted to be able to time events, not just get the time at probe points. The new code in libvips lets you annotate the library like this:

VIPS_GATE_START( "vips_sequential_generate: wait" );

/* Exit the loop on timeout or condition passes. We have to
* be wary of spurious wakeups.
*/
while( r->top > sequential->y_pos )
if( !g_cond_wait_until( sequential->ready, sequential->lock, time ) )
break;

VIPS_GATE_STOP( "vips_sequential_generate: wait" );

This code is part of the system vips uses to keep a set of threads running in order. Something like a JPEG file will supply pixels strictly top-to-bottom. Threads arrive at the JPEG reader asking for chunks of image and they may not be in the correct order: the request for line 500 might arrive before the request for line 400. This code will stall an ahead-thread until a certain amount of time passes or until the line it needs becomes available.

This kind of waiting is obviously a bad thing -- this is time we could have spent computing some useful result. The start and stop gates here surround the wait and mean that the gate system will measure the length of time spent in this section of code. The "wait" at the end of the gate name means something, we'll come to it later.

The gate start and stop macros expand to something like:

if( vips__profile )
vips__thread_gate_start( gate-name );

that is, a test of a global variable and a function call. The gate timing function itself fetches a thread-private pointer with a hash-table lookup, fetches a timing block with another hash-table lookup on the gate name (using the address of the string as the hash, we rely on the linker commoning-up static strings), gets the monotonic time, and stores it in a linked list of timing buffers. It's fast enough to have no effect on thread timing, as long as you keep the gates out of inner loops.

The gate system keeps a separate stack for start and for stop and a separate set of buffers for every gate name, so gates can be nested and can even overlap. The only requirement is that every start must have a matching stop.

I've also instrumented periods when a thread is doing useful work. For example:

static int
vips_colour_gen( VipsRegion *or,
void *seq, void *a, void *b, gboolean *stop )
{
VipsRegion **ir = (VipsRegion **) seq;
VipsColour *colour = VIPS_COLOUR( b );
VipsColourClass *class = VIPS_COLOUR_GET_CLASS( colour );
VipsRect *r = &or->valid;

int i, y;
VipsPel *p[MAX_INPUT_IMAGES], *q;

for( i = 0; ir[i]; i++ )
    if( vips_region_prepare( ir[i], r ) )
      return( -1 );

VIPS_GATE_START( "vips_colour_gen: work" );

for( y = 0; y < r->height; y++ ) {
    for( i = 0; ir[i]; i++ )
      p[i] = VIPS_REGION_ADDR( ir[i], r->left, r->top + y );
    p[i] = NULL;
    q = VIPS_REGION_ADDR( or, r->left, r->top + y );

    class->process_line( colour, q, p, r->width );
}

VIPS_GATE_STOP( "vips_colour_gen: work" );

return( 0 );
}

This is the work function for the base class of all the colourspace conversions. It comes in two main parts: first, it requests the necessary set of input pixels from the source images, then it loops over the output region calculating it a line at a time using a simple buffer-processing function supplied by the derived class.

The setup phase is just housekeeping, so we don't time that, but the loop that makes the pixels is doing useful work, so we put a gate around it. Again, the "work" at the end of the gate name has a meaning we'll come to.

I've added a third type of gate: one that covers long work periods. For example, vips opens a gate called main during vips_init() and closes it during vips_shutdown(). These are handy for following the bigger structure of a program run.

Running vips with the flag --vips-profile turns on profiling and dumps the timing information to a file called vips-profile.txt on program exit. There's a new Python program called vipsprofile which parses one of these things, calculates some simple statistics, and generates an SVG diagram of program execution using cairo.

Here's what it looks like during a run.

$ time vips sharpen k2.jpg x.jpg --radius 20 --vips-profile
recording profile in vips-profile.txt
real 0m0.546s
user 0m1.921s
sys 0m0.039s
$ vipsprofile
reading from vips-profile.txt
loaded 8537 events
total time = 0.530361
name alive wait% work% unkn%    memory peakm
worker 20 0.48   1.14   96.1   2.77     3.93    7.08
worker 21 0.49   0.542 94.5   4.97     3.5     6.64
worker 22 0.49   2.23   94.5   3.28     3.83    6.98
worker 23 0.49   1.36   95.9   2.73     4.56    7.71
wbuffer 24 0.49   97     2.95   0.00367 0       0
wbuffer 25 0.49   97.2   2.82   0.00387 0       0
main 26    0.53   92.1   0      7.9      -15.8   1.06
peak memory = 29.5 MB
writing to vips-profile.svg

The four workers at the top are the workers in the thread pool that is doing the sharpen operation. The two wbuffer threads run the buffers of pixels that are being calculated and write those buffers to the output file in the background. main is obviously the main thread of execution.

alive is how many seconds that thread was alive for, wait% is the percentage of time that thread was blocked in a wait gate, work% is the percentage of time it was calculating a useful result, and unknown% is unmarkedup time, so housekeeping, cache management, and other general overheads.

Here's the diagram generated by vipsprofile:

Here's the original SVG as well, if your browser supports it.

Time runs left-to-right. The red/green bar at the top of each thread shows work and wait, with work in green, wait in red, and memory allocation as a white tick. Blue lines show other gates. U is a work unit in a thread pool, so in this case a single tile being processed. The green blocks in the two wbuffer threads are the background writes of complete tile buffers.

It all looks rather good, I think. The four threads fought a little during startup. You can see larger red bars for their first work unit as they queue up on the JPEG reader to fetch some pixels. There are some slight hiccoughs, but for most of the processing time they were all producing computed tiles at a fairly steady rate. This is reflected in the utilisation statistics calculated by vipsprofile: all threads were doing useful work around 95% of the time, and were waiting for less than 1 - 2% of run time.

The first 0.04s are spent in startup, then the caches rapidly fill to maximum size and malloc activity slows. By about 0.2s, memory is stable and there are no further allocations.

This is for a relatively simple operation on a JPEG image where read and write have to be serialized. If you do a more CPU-intensive operation or use a more thread-friendly image format, you get better utilization times.

It seems to scale well to larger systems. Here's a 12-thread desktop PC:

$ time vips sharpen k2.jpg x.jpg --radius 50 --vips-profile
recording profile in vips-profile.txt
real 0m0.657s
user 0m6.808s
sys 0m0.124s
$ vipsprofile
reading from vips-profile.txt
loaded 9056 events
last time = 0.622665
name         alive wait% work% unknown%
worker 60 0.57   1.44   97.3   1.29
worker 61 0.57   1.01   97.5   1.48
worker 62 0.57   0.7    97.9   1.41
worker 63 0.58   0.0926 98.5   1.43
worker 64 0.58   1.32   97.5   1.18
worker 65 0.58   0.765 97.8   1.43
worker 66 0.59   0.926 97.7   1.36
worker 67 0.59   0.719 98     1.3
worker 68 0.59   0.471 98.1   1.47
worker 69 0.59   0.477 98.2   1.3
worker 70 0.6    1.12   97.5   1.38
worker 71 0.6    1.15   97.5   1.39
wbuffer 72 0.61   98.3   1.67   0.00114
wbuffer 73 0.61   97.9   2.14   0.0018
    main 74 0.65   92.8   0      7.23
writing to vips-profile.svg

I had to increase the radius of the sharpen to stop it bottle-necking horribly on JPEG read.

So if the libvips caching and synchronisation system is working well (thank goodness), why aren't we seeing more than a 1.75 x speedup?

Looking at the traces, we spend about 40ms in startup. If we knock 40ms off the two times we had right at the start, we get about 1.8 x speedup with two threads, which is a bit better.

Another factor which I feel very foolish for not having thought of before is Turbo Boost. For a while now most many-core CPUs have been able to adjust their clock speed up and down with load. The idea is that if one core is idle, the other can speed up past the usual peak frequency while keeping the overall heat output of the chip within the allowed maximum. Intel brand this idea as Turbo Boost. My laptop i5 has a 20% clock boost for single-threaded operation.

If we lock all cores at 2.5GHz, the rated peak speed for the i5-3210M, we see this:

$ for i in {0..3}; do sudo cpufreq-set -f 2.5GHz -c $i; done
$ time vips sharpen k2.jpg x.jpg --radius 20 --vips-concurrency=1
real 0m1.301s
user 0m1.330s
sys 0m0.008s
$ time vips sharpen k2.jpg x.jpg --radius 20 --vips-concurrency=2
real 0m0.687s
user 0m1.331s
sys 0m0.020s

And taking the 40ms startup time off the two times here gives 1.95 x scaling, a bit better still.

Another factor is processor caching. The i5-3210M has a single 3MB L2/L3 cache shared between the two cores. If one core is idle, the other can use the entire cache. This will give a further boost to single-threaded performance, though it's hard to quantify.

So in conclusion ... it's a hardware problem.

Friday, 29 November 2013

Profiling libvips

No comments:

Post a Comment