I started learning real-time 3D graphics as soon as I started programming – that is, about 12 years ago. At the time, OpenGL 3 was still new, and most online tutorials covered the (now ancient)
immediate mode commands (glBegin/glEnd
and friends). Over the years, I've learned a lot: VBOs & VAOs, shaders, deferred shading, various optimization techniques. Here's a shameless plug of what I got for one of those scope-is-too-large-for-me-to-ever-finish-it projects with my own pure C++ & OpenGL engine:
OpenGL 4.3 which appeared in 2012 introduced a new shiny thing: compute shaders. Despite this having happened 10 years ago, I've still never actually done anything with them. However, they've become an essential part of modern rendering: all modern solutions for overdraw (visibility buffer, software rasterization), order-independent transparency (per-pixel linked lists), fast convolutions, etc. use compute shaders, while I'm still stuck to OpenGL 3.3!
In my recent twitter post I asked for some beginner-friendly graphics-oriented resources for learning compute shaders, and unfortunately there aren't many. Most resources are either too advanced, or are dedicated to general computing (GPGPU), which isn't a problem by itself, but is less favourable when one's main interest is graphics.
So, I dediced to document my own attempts at learning this stuff in the form of a blog post series, in the hope that it helps someone like me. In this post, I'll try to implement Gaussian blur using compute shaders and compare the performance to classic fragment shader-based implementation (spoiler: the results aren't great at all). All the code for this post is available at this repository. I'm using my pet C++ engine, but the OpenGL-related stuff should be clear for the most part. I'm using CMake as a build system, so you should be able to easily clone the repo and run the examples.
I'm testing everything on my GeForce GTX 1060, with a screen resolution of 1920x1080, without any downsampling.
Disclaimer: as I've said, I'm only learning all this stuff, and this is literally the first time I'm writing compute serious shaders and analyzing their performance. Beware of dramatic ignorance ahead :)
First, let's render something cheap & animated that we will blur later. It doesn't matter much what to blur, since the algorithm is exactly the same whatever pixel values we have. I think a bunch of shiny rotating icosahedra would do:
The first (and the simplest) way to implement blur is a full-screen post-process single-pass effect:
I'll take a 33×33 Gaussian kernel with σ=10. The kernel size is chosen to be nice for future compute implementations (33 = one pixel + 16 on the left + 16 on the right).
The shader looks relatively simple: we loop over all pixels in a 33×33 square centered at the current pixel and average them with appropriate weights:
uniform sampler2D u_input_texture;
uniform vec2 u_texture_size_inv;
layout (location = 0) out vec4 out_color;
in vec2 texcoord;
const int M = 16;
const int N = 2 * M + 1;
const float coeffs[N] = float[N](...); // generated kernel coefficients
void main()
{
vec4 sum = vec4(0.0);
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
vec2 tc = texcoord + u_texture_size_inv
* vec2(float(i - M), float(j - M));
sum += coeffs[i] * coeffs[j]
* texture(u_input_texture, tc);
}
}
out_color = sum;
}
and this is what the result looks like:
(the full code is available here).
Doesn't look that impressive, but it doesn't need to – making the original image less interesting is exactly what blur does :)
Note that we don't do anything special about transparency here. For transparency-aware blur, you might want to additionally weight the colors by their alpha channel and normalize (divide by total accumulated alpha).
Let's measure how well it performs. The cheap and dirty way to measure your rendering performance is just to disable vsync and look at the time you spend from frame to frame. A better way is to use OpenGL timer queries, which are a little tricky due to asynchronous nature of the GPU – you don't know when a particular query result will be ready. Thankfully, this can be easily dealt with by using a pool of queries that creates a new query object if all others are still busy. I'll use both methods: timer queries for measurement, and frame durations for sanity checks (e.g. to make sure that I'm measuring the right thing and there's nothing else occupying half the frame time).
The performance turns out to be pretty sad: 15ms for the whole blur. This is clearly unsatisfactory (we typically have just 16ms for the whole frame!), so let's move on to a better approach.
If you've ever read any resources on Gaussian blur, you know that this single-pass implementation is the wrong way of doing it. It so happens that Gaussian filters are separable, meaning that the corresponding weight matrix has rank one and can be expressed as an outer product of two vectors:
\[ C = x \otimes y \] \[ C_{ij} = x_i \cdot y_j \](see also this amazing post about low-rank filter approximations).
In the case of a Gaussian filter, the corresponding decomposition is
\[ \exp\left(-\frac{i^2+j^2}{\sigma^2}\right)=\exp\left(-\frac{i^2}{\sigma^2}\right)\cdot\exp\left(-\frac{j^2}{\sigma^2}\right) \](I'm ignoring normalization here).
In fact, this is exactly how the weights were defined in the naive implementation: a single array of N coefficients is enough to reconstruct the full N×N coefficient matrix.
What it means is that we can replace our single-pass algorithm with a two-pass algorithm: instead of replacing every pixel with a weighted average of all N×N neigbouring pixels, we do
This way, vertical averaging deals with pixels that have already been averaged horizontally, thus producing a full blur. Specifically, say W[i]
is the one-dimensional weight vector, meaning that
W[i] * W[j]
is the coefficient of some pixel's P[x,y]
contribution to its neighbour P[x+i,y+j]
when performing full blur. Then, the horizontal pass will add W[i] * P[x,y]
to the value of P[x+i,y]
, then the vertical blur will add W[j] * P[x+i,y]
to P[x+i,y+j]
, so the value of P[x,y]
will get multiplied by W[i] * W[j]
upon reaching P[x+i,y+j]
, which is exactly what we want. Not sure who needs yet another explanation of separable filters, but, hey, it never hurts.
The point of this separation is that, compared to the naive version which uses N² texture accesses (per fragment), this version only uses 2N texture accesses (2 passes, each having N accesses per pixel). This sounds like a drastic increase in performance: assuming everything else is negligible compared to texture access, for N=33 we'll get about N²/2N = 33/2 = 16.5 times better performance! Of course, things aren't that simple, but we still can hope for a huge improvement.
On the OpenGL side, we'll have to use two framebuffers this time. The algorithm is as follows:
Note that we don't need a depth attachment for the second framebuffer.
Implementing a separable kernel is even easier than the full blur: we only need a single loop over neighbouring pixels, and the direction of blur can be constrolled by a uniform variable:
uniform sampler2D u_input_texture;
uniform vec2 u_direction;
layout (location = 0) out vec4 out_color;
in vec2 texcoord;
const int M = 16;
const int N = 2 * M + 1;
// sigma = 10
const float coeffs[N] = float[N](...); // generated kernel coefficients
void main()
{
vec4 sum = vec4(0.0);
for (int i = 0; i < N; ++i)
{
vec2 tc = texcoord + u_direction * float(i - M);
sum += coeffs[i] * texture(u_input_texture, tc);
}
out_color = sum;
}
(the full code is here).
The result is significally better: 1ms for the whole blur. A 15x improvement, meaning our 16.5x estimate wasn't that bad after all :)
There's this neat trick that I've learnt long ago but didn't have the opportunity to try. See, we're still bound by memory accesses: all the shader does is fetch texture data and do a pitiful amount of arithmetics. Modern GPUs are famously superior in terms of computing performance while not so much in terms of memory speed, so anything that decreases our texture access counts is a win.
What we've been ignoring up to now is that GPUs have a whole zoo of special hardware dedicated to various graphical needs. In particular, it can do linear filtering on textures almost for free! Linear filtering is just averaging neighbouring pixels, – sounds a lot like what we're trying to do!
So, the idea goes like this: when doing a weighted average of neighbouring pixels, instead of computing W[i] * P[i] + W[i+1] * P[i+1]
, let's ask the GPU to mix P[i]
and P[i+1]
with appropriate weights and return the result to us via a single texture fetch. What happens when we read a texture with linear filtering at coordinate i+t
(where t
is in the range [0..1)
) is it returns lerp(P[i], P[i+1], t) = P[i] * (1-t) + P[i+1] * t
, which we'll have to scale by some weight W
. This gives us a system of equations:
which leads to
\[ W = W_i + W_{i+1} \] \[ t = W_{i+1} / W \]Given this values of W
and t
, we can compute the full contribution of two neighbouring pixels with a single texture fetch! The only change happens in the shader. Since the kernel size is odd (N=33), I chose to always fetch the pixel with offset 0
independently, and group other pixels in pairs to apply the aforementioned trick:
uniform sampler2D u_input_texture;
uniform vec2 u_direction;
layout (location = 0) out vec4 out_color;
in vec2 texcoord;
const int M = 16;
// sigma = 10
const float coeffs[M + 1] = float[M + 1](...); // generated kernel coefficients
void main()
{
vec4 sum = coeffs[0] * texture(u_input_texture, texcoord);
for (int i = 1; i < M; i += 2)
{
float w0 = coeffs[i];
float w1 = coeffs[i + 1];
float w = w0 + w1;
float t = w1 / w;
sum += w * texture(u_input_texture, texcoord + u_direction * (float(i) + t));
sum += w * texture(u_input_texture, texcoord - u_direction * (float(i) + t));
}
out_color = sum;
}
So, instead of N=33 texture fetches we now have 1+(N-1)/2=17 of them, leading to a theoretical 1.95x performance increase. Indeed, what I got was 0.54ms for the whole blur, about 1.85x increase! It always fascinates me when theoretical performance predictions match reality. I wish this happened more often :)
Now that we've settled on our baseline OpenGL 3.3 implementation, let's start doing compute shaders! For these, you need either OpenGL 4.3 or the ARB_compute_shader
extension (I'm using the latter since I want the engine to work on older devices that only support OpenGL 3.3).
I won't dive deep into explaining how compute shaders work, but the TL;DR is:
At first, let's write a naive implementation: a full N×N loop that does averaging with Gaussian weights. Instead of reading a texture via the texture
GLSL function, we'll use the imageLoad
function. Instead of writing pixels to the framebuffer as part of the usual rendering pipeline, we'll manually write pixels to the output texture with the imageStore
function.
Note that these functions have image
in their name, not texture
. OpenGL makes a clear distinction between images and textures: an image is just that, an array of pixels, while a texture is roughly a set of images (e.g. mipmap levels) together with a whole bunch of sampling options (linear/nearest/trilinear filtering, swizzling, clamping/repeating, anisotropy, etc). One can use textures in compute shaders, but we simply don't need that now since we're going to read & write single pixels without any filtering or other special effects. Binding images to shaders is done using glBindImageTexture
.
So, we're going to select some workgroup size (say, 16×16), and have each shader invocation write exactly one pixel. For a screen resolution W×H
we'll dispatch a grid of ceil(W/16) × ceil(H/16)
workgroups to make sure that these workgrous cover the entire screen. Additionally, if, say, the screen height is not a multiple of 16, we'll insert some checks in the shader so that the shader invocations corresponding to off-screen pixels won't do anything (most importantly, they won't try to write to the output image).
So, here's the full compute shader:
layout(local_size_x = 16, local_size_y = 16) in;
layout(rgba8, binding = 0) uniform restrict readonly image2D u_input_image;
layout(rgba8, binding = 1) uniform restrict writeonly image2D u_output_image;
const int M = 16;
const int N = 2 * M + 1;
// sigma = 10
const float coeffs[N] = float[N](...); // generated kernel coefficients
void main()
{
ivec2 size = imageSize(u_input_image);
ivec2 pixel_coord = ivec2(gl_GlobalInvocationID.xy);
if (pixel_coord.x < size.x && pixel_coord.y < size.y)
{
vec4 sum = vec4(0.0);
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
ivec2 pc = pixel_coord + ivec2(i - M, j - M);
if (pc.x < 0) pc.x = 0;
if (pc.y < 0) pc.y = 0;
if (pc.x >= size.x) pc.x = size.x - 1;
if (pc.y >= size.y) pc.y = size.y - 1;
sum += coeffs[i] * coeffs[j] * imageLoad(u_input_image, pc);
}
}
imageStore(u_output_image, pixel_coord, sum);
}
}
(the full code is here).
On the CPU side, we'll do the following:
As far as I know, we can't use the compute shader to write directly onto the screen, so we write to a framebuffer-attached texture, and then use glBlitFramebuffer
to copy the result onto the screen.
Let's talk about memory barriers. Normally, OpenGL is designed in such a way that it's pretty obvious to the driver what commands read/write what data (e.g. a draw command reads vertex buffers & bound textures and writes to currently bound framebuffer, etc). Even if the GPU decides to shuffle around user's commands (and it certainly will!), it can use that knowledge about data dependence to prevent erroneous ordering of commands. It basically means that OpenGL guarantees that all data written by a command will be seen by any following command that tries to read that data, and it happens magically, and you don't even need to think about that.
Unfortunatelly, compute shaders are a bit trickier: they can read and write arbitrary portions of arbitrary bound objects, so the GPU has a hard time trying to guess the exact data dependencies. That's what memory barriers are for: they tell the GPU what data depends on what. For example, we want all the data renderered at step 1 to be visible to the image load-store operations perfomed at step 3. This is what GL_SHADER_IMAGE_ACCESS_BARRIER_BIT
is for. Next, we want the pixels written by step 3 to be visible for the framebuffer blit operation at step 5, this is what GL_FRAMEBUFFER_BARRIER_BIT
does. I hope I didn't mess that up :)
Combining all this, I got the following performance for various workgroup sizes:
4x4: 40.0ms | 4x8: 25.3ms | 4x16: 25.7ms | 4x32: 26.0ms | 4x64: 26.5ms |
8x8: 24.2ms | 8x16: 24.7ms | 8x32: 25.2ms | 8x64: 25.4ms | |
16x16: 25.0ms | 16x32: 25.5ms | 16x64: 26.0ms | ||
32x32: 30.7ms |
(there's no 32x64 or 64x64 workgroup size since this exceeds the maximum workgroup size for my GPU).
Performance is bad for 4x4, but it stays pretty much the same for all other workgroup sizes, although there are some trends. This correlates nicely with my understanding that Nvidia GPUs typically execute shaders in groups of 32 (so-called warps): any workgroup size which is a multiple 32 threads is fine (it gets spread into several warps), while 4x4 occupies only half of the warp and thus a lot of computational power is wasted (serious handwaving here!).
Anyways, that's still about 1.6 times worse than the non-compute naive implementation. Let's get deeper.
My first instinct for improving the naive compute implementation was to use workgroup shared memory. After all, that's one of the unique features of compute shaders!
The general idea of using LDS is to exploit data locality, i.e. that many threads in the same workgroup require the same data. Instead of each thread reading the memory it needs, we'd make the threads read all the data they all need into shared memory, then proceed the computations as usual using this shared memory instead (which is supposedly very fast!).
Specifically for a convolution filter with size N = 2*M+1
(like Gaussian blur), a single thread accesses NxN
pixels, while a whole workgroup of size GxG
accesses (G+2*M)x(G+2*M)
pixels:
(here, M=2
, N=5
and G=4
; red is a single pixel or a 4x4 workgroup, blue are accessed pixels outside workgroup).
So, the new algorithm is:
There are a lot of details here, including
I won't go into much detail, the full code is here. Sad news – this variant has incredibly poor performance:
4x4: 175ms | 8x8: 134ms | 16x16: 117ms |
It didn't let me create a 32x32 workgroup since the shared array size would exceed available shared workgroup memory size.
Fine, probably compute shaders aren't as miraculously fast as I thought! Let's try the first optimization we did – doing two-pass blur – with compute shaders. Nothing essentially new here, except that this time we need three framebuffers:
In fact, framebuffer 2 is optional, since we don't use it as a framebuffer – only write to and read from its color texture.
The shader is relatively simple:
layout(local_size_x = 16, local_size_y = 16) in;
layout(rgba8, binding = 0) uniform restrict readonly image2D u_input_image;
layout(rgba8, binding = 1) uniform restrict writeonly image2D u_output_image;
uniform ivec2 u_direction;
const int M = 16;
const int N = 2 * M + 1;
// sigma = 10
const float coeffs[N] = float[N](...); // generated kernel coefficients
void main()
{
ivec2 size = imageSize(u_input_image);
ivec2 pixel_coord = ivec2(gl_GlobalInvocationID.xy);
if (pixel_coord.x < size.x && pixel_coord.y < size.y)
{
vec4 sum = vec4(0.0);
for (int i = 0; i < N; ++i)
{
ivec2 pc = pixel_coord + u_direction * (i - M);
if (pc.x < 0) pc.x = 0;
if (pc.y < 0) pc.y = 0;
if (pc.x >= size.x) pc.x = size.x - 1;
if (pc.y >= size.y) pc.y = size.y - 1;
sum += coeffs[i] * imageLoad(u_input_image, pc);
}
imageStore(u_output_image, pixel_coord, sum);
}
}
(the full code is here).
And here are the timings for different workgroup sizes:
4x4: 2.79ms | 8x8: 1.59ms | 16x16: 1.66ms | 32x32: 2.30ms |
We're finally close to the performance we had with the non-compute implementation, although still about 3x slower.
Let's combine both optimizations! A two-pass blur, with each pass using a shared array to prefetch all the accessed input pixels. This time, I'm using Gx1 workgroups for horizontal blur, and 1xG for the vertical pass, since e.g. during a horizontal pass a pixel shares a lot of accessed input pixels with its horizontal heighbours, but none with its vertical neighbours.
The code is here and the results are reasonably satisfying:
4x1: 10.8ms | 8x1: 4.5ms | 16x1: 2.15ms | 32x1: 1.08ms |
64x1: 1.07ms | 128x1: 1.06ms | 256x1: 1.06ms |
This is still slower than our non-compute implementation, though :)
The final idea I wanted to give a try is to get rid of the intermediary buffer used to store the result of horizontal blur. We have workgroup shared memory, why not use it? The algorithm is:
The code is here and here are the numbers:
4x4: 75ms | 8x8: 29ms | 16x16: 16ms |
Well, either I did something really wrong, or it was a bad idea.
So, what's happening here? The problem is I genuinely have no idea. I'm still learning, remember? :)
Maybe I'm doing something really wrong in my compute shaders. Maybe the texture cache is actully so good that it outperforms clever manual shared storage optimizations. Maybe the rasterizer orders it's own fragment shader workgroups in some clever way (e.g. using Hilbert or Morton curves) to improve data locality (there are ways to do this with compute shaders as well, see e.g. this paper). Anyways, what this definitely does prove is that GPU optimization is a hard topic! Who would have thought :)
If you have any corrections, suggestions, or explanations, feel free to reach me in any convenient way. And, well, thanks for reading :)