Archive for August, 2008

Speeding Ticket

August 15, 2008 - 9:43 pm No Comments

Computers today have ridiculous computational power. Not only do modern processors have high clock speeds, they often have multiple cores (multiprocessor architecture). Every one of the cores on a processor is also optimized to perform multiple operations at once (vector instructions) and to intelligently order execution order (instruction level parallelism). The net effect of these features makes processors too fast.

There is a problem getting information to the processor quick enough. Most processors can chug along enormous amounts of computation, but they can only compute if they have the data in front of them. Processor manufacturers try to always keep the processor busy by providing blazingly fast memory directly on the processor: this is not system memory (RAM). However, when the data you’re working with does not fit in the one or two megabytes of processor cache — the processor has to sit and bite its thumbs while waiting for data to come in from the sloth like system memory. This causes slowdowns in some DSP (Digital Signal Processing) algorithms when relatively large amounts of data are moved around. I spent three weeks at Drexel University researching the effects of certain design choices (programming language, compiler optimizations, iterative versus recursive control, loop unrolling) on the performance of the Walsh Hadamard Transform (WHT).

All that the fancy name of the WHT means is that you’re multiplying a vector (a single column matrix) by another matrix to get a resultant vector. Simple multiplication: (Walsh Matrix) * (Data) = Output. This is the bottom line. If you desire more detail, take a look at the poster.

The most illustrative example of slowdown caused by moving data is between iterative and recursive algorithms. In a perfect, theoretical world both iterative (going straight down a list) and recursive (splitting the list into smaller parts, then doing it) methods would work exactly alike. They’re both performing the same amount of work, just in a different order. Both algorithms perform in O (n log n) time: this means it will take n log n operations (and time) for n operations. The graph below was normalized to an n log n scale. This means that both algorithms should be straight, horizontal lines and should be as high as the theoretical maximum number of operations that the processor can perform: in our case it was 8 billion operations per second (8 GFLOPS). Reality is a bit different.

The higher the points on the graph, the quicker the processor is working. Ideally you would want the processor to be crunching data as fast as it theoretically can (eight billion operations per second) but the memory access bottlenecks processor speed. The upward slope for small amounts of data suggests that the processor does not get “warmed up” quickly enough. It wastes time doing housekeeping chores in anticipation that there will be lots of processing to do, which significantly slows down speed.

The most important trend in the graph is the huge decrease in performance at a transform size of 2^19. What’s curious is that 2^19 corresponds to a memory usage of one megabyte, which fits in the processor L2 cache (2 megabytes in our case). What’s even more interesting is that 2^20 corresponds to a size of 2 megabytes, which no longer fits in the processor’s 2 megabyte cache memory. Suspense rises even more as you realize that one of the algorithms (both of them should’ve been IDENTICAL!) suffers a lot more from running out of cache memory, but is faster initially. This suggests that the iterative (blue) algorithm is effected the most by cache size. The recursive method, on the other hand, works only on small chunks of the data and has a high overhead because of the functional calls inherent to recursion — but is resilient to cache misses because it doesn’t need to store the whole transform.
So how do you combine the best of both worlds — the quickness of the iterative algorithm for small transforms, and the resilience to cache misses (when you run out of processor cache and need to use system RAM) of the recursive algorithm? This is something that the SPIRAL project that is based in Carnegie Mellon University is researching. They’ve already developed a method that combines the properties of both the iterative and recursive algorithms that does not suffer from cache limits.

If you’re interested in finding out more, take a look at my poster.