Notes on Writing Recursive Shaders


If you’re reading this, you probably know that

  1. Shaders can’t do recursion
  2. Recursion can be implemented with a stack

I recently had to put this knowledge to the test when porting a number of recursive algorithms to compute shaders. The gist is very simple: Starting out with a recursive function

void foo(uint a, uint b)
{
    if (cond(a, b))
    {
        return;
    }
    
    // ...
    
    foo(a - 1, 3);
    foo(a - 1, 5);
}

first create a stack of call arguments

struct Args
{
    uint a;
    uint b;
};

Args stack[MAX];
uint stackPos = 0;

and instead of calling foo, push arguments onto the stack. The entire algorithm can then be written as a loop:


// push initial arguments
stack[stackPos++] = Args(a, b);

do
{
    // pop arguments
    const Args args = stack[--stackPos];

    if (cond(args.a, args.b))
    {
        continue;
    }

    // work ...

    // recurse
    stack[stackPos++] = Args(args.a - 1, 3);
    stack[stackPos++] = Args(arg2.a - 1, 5);

} while(stackPos > 0);

This is a great, simple scheme, as long the original function

  1. is tail recursive, meaning the recursive calls are the very last actions
  2. doesn’t return anything

Tail recursion is required because the order of traversal was changed by the conversion. In a real recursive function, things after self-calls occur after those entire subtrees have finished executing. In the stack scheme however, calls are simply pushing onto the stack, nothing has “happened” yet.

Implementing Return Values

While tail recursion is hard to bring about in general, return values can be implemented (with some constraints). I had to convert a function something like this:

int createTreeNode(int currentIndex)
{
    if (currentIndex == -1)
    {
        // reached a leaf
        int newIndex = gNumTreeNodes++;
        return newIndex;
    }

    gTree[currentIndex].childL = createTreeNode(gTree[currentIndex].childL);
    gTree[currentIndex].childR = createTreeNode(gTree[currentIndex].childR);
    return currentIndex;
}

A simple alternative to returning values is using out parameters. While those exist in HLSL, we can’t store them in a stack since there are no real pointers or references in the language:

void func(out float x); // fine

struct Args
{
    out float x; // no such thing
};

A solution: Add a “pointer” to the arguments where to write the result to:

void createTreeNode2(int currentIndex, int nodeToWriteTo, int childToWriteTo)
{
    if (currentIndex == -1)
    {
        // reached a leaf
        int newIndex = gNumTreeNodes++;
        gTree[nodeToWriteTo].children[childToWriteTo] = newIndex;
        return;
    }

    createTreeNode2(gTree[currentIndex].childL, currentIndex, 0);
    createTreeNode2(gTree[currentIndex].childR, currentIndex, 1);
}

Note that we also solved the missing tail-recursiveness in one go here - reacting to the return value is pushed to the “return site” instead. At this point, the stack scheme can simply be applied, and we end up with a purely iterative algorithm.

Stack Location

We have three options where to place the stack in a compute shader:

  1. Registers (local variables, globals with static)
  2. Main Memory (any UAV, ie. RWStructuredBuffer)
  3. Group-shared Memory (globals with groupshared)

Registers are the obvious candidate, but if the shader is already rather heavy and you’re running low on them, group-shared memory is a good choice, even though we do not actually share any data semantically:

groupshared Args sRecursionStack[STACKSIZE][GROUPSIZE];

// access as sRecursionStack[stackPos][SV_GroupIndex];

Note the unintuitive ordering - this 2D array is contiguous in groups, not in the stacks. Shared memory is split into “banks” which can be accessed simultaneously, effectively multiplying the bandwidth when accessing multiple locations in the stack at once (There is more depth to this, see this NVidia article).

Implementation and Debugging

I like to create utility functions using a “Global context” like this for interaction with the stack:

struct Globals
{
    uint groupIndex;
    int stackPos;
    // ..
};

void pushRecursionArgument(inout Globals globals, float3 arg1, float3, arg2 /* ... */)
{
    Args newArgs;
    newArgs.arg1 = arg1;
    newArgs.arg2 = arg2;
    // ...
    sRecurionStack[globals.stackPos++][globals.groupIndex];
}

Do not forget to guard against stack overflow:

do 
{
    // ..
    if (globals.stackPos >= STACKSIZE)
    {
        // stack overflow
        break;
    }
    pushRecursionArgument(/*..*/);
} while(globals.stackPos > 0);

To port an algorithm, i can recommend setting it up naively in C++, next removing return values, and finally applying the stack scheme, verifying along the way, before porting to HLSL. To debug the shader, compile it with -Od -Zi -Qembed_debug to embed PDB information into the DXIL binary and avoid hassle with PDB lookup paths. While i generally prefer RenderDoc, PIX is much more consistent for shader debugging in my experience.

Since we’re dealing with many loops and nontrivial control flow, TDRs (device removals because of GPU hangs) are common problems as well and inconvenient to debug. Microsoft recommends PIX Remoting to track down TDRs, but i found it more effective to capture the TDR on a single machine, and only using remote analysis once the capture file is created to let Dr. PIX find the dispatch at fault.

A simpler way of course is brute-forcing by adding emergency breaks after a set number of iterations to your shaders to find out which one’s to blame, then debugging it regularly.

Wrap Up

Using recursive algorithms on GPUs isn’t something you do every day, but very feasible. I was consistently surprised by how well shaders containing an ungodly amount of branches and non-unrolled loops performed in practice when working on them here. Hopefully you could take away some useful bits about the topic in this post, and thanks for reading.