Make your CUDA-development fly

The other day I came by Thrust and I immediately became a fan – without even trying it out first!

The idea of this library is to provide CUDA-developers the most important set of tools to handle those simple, but tricky parallel operations. The idea of CUDA is that the modern GPU is actually more capable in handling computations involving large data-sets than the modern CPU in practically all cases that matter and it is even quite easy to use – until you have to run some operator on the dataset that has global dependencies between the data.

For example finding the maximum of a large dataset is surprisingly difficult to do efficiently in CUDA – Mark Harris of nvidia has a made a nice paper explaining how to get good performance on the same algorithm with the maximum-operator replaced by the sum operator (the algorithm itself is exactly the same, and therefore lends itself to be abstracted by the binary-operator using templates).

Even more difficult is to sort a dataset efficiently on the GPU – here is a paper about it by Nadathur Satish, Michael Garland and Mark Harris. (Btw. in this paper you can see that the GPU is not super great at this kind of tasks and here for some quite short lists (less than one million entries) the CPU can be little faster than the GPU – but keep in mind also that normally sorting is just a small piece of a bigger problem and that this benchmark was done on quite old hardware (an Intel 4-core CPU against GTX 280).)

The point I’m trying to make, is that implementing these auxiliary functions to “real work” can sometimes be notoriously difficult and doing it wrong can bottleneck your computation due to silliness – so what to do? Code them anyways always from scratch, benchmark, test and improve? This can take days, if not weeks even for good programmers!

And TADAA – in comes to rescue the Thrust library! The idea of this library is to give you versatile versions of all these algorithms in the same way that STL does with good performance. And, as far as I can tell, it just works!

Here is a blog post about a benchmark which shows thrust:sort() beating stl:sort() with ten times better perf! And this even includes copying the data from the main system memory to the GPU and back, which is just silliness: when you do GPU compute, you keep the data on the GPU!

The great thing about this thrust is that the algorithms are completely general! You can replace the data-types with any types and the operators are of course overloadable by standard template-techniques.

This example is from the thrust quickstart guide page, which computes z = ax + y for vectors, or the SAXPY-operator from BLAS:

struct saxpy_functor
    const float a;

    saxpy_functor(float _a) : a(_a) {}

    __host__ __device__
        float operator()(const float& x, const float& y) const { 
            return a * x + y;

void saxpy_fast(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y)
    // Y <- A * X + Y
    thrust::transform(X.begin(), X.end(), Y.begin(), Y.begin(), saxpy_functor(A));

All I need to do now, is to put it to good use!


3 Responses to “Make your CUDA-development fly”

  1. Nice website!!

  2. I am so happy to read this. This is the kind of manual that needs to be given and not the random misinformation that is at the other blogs. Appreciate your sharing this best doc.

  3. Truly actually very good web site article which has got me considering. I never looked at this from your point of view.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: