# Category Archives: GPU

## Correlated Multi-Jittered Sampling on GPU

### TL;DR

I went down the rabbit hole too deep trying to improve sampling in my light mapper running on ray tracing hardware. Found some random generators suitable for GPU usage, and also adapted a sampling method on the way.

### Motivation

This material is a byproduct of a ray tracing research project I was working on quite a while ago. The project goal was to write a light mapper using IMG’s ray tracing hardware. We wanted to publish our results in a short paper, but it just did not survived the review process unfortunately. Since hardware accelerated ray tracing has became a hype lately big time, I thought it would worth it to share some of the ideas I had during and after the project.

### Overview

It started with a textbook implementation of a path tracer to compute diffuse radiance, using some typical cosine weighted, low-discrepancy random sampling to shoot rays around from each light map texel. The hardware did its job nicely, but obviously, the process was far from optimal. After some performance analysis, it became clear that the biggest issue was the system memory bandwidth caused by the large number of incoherent rays. Even if the hardware is designed to handle incoherence efficiently, the overall memory bandwidth is greatly affected while tracing complex scenes.

So, the first idea was to break down the integration into smaller batches, and try to group rays into spatially more coherent batches. Light maps consists of islands of unwrapped triangles in UV space, the adjacent texels within an island typically have very similar world-space position and tangent space.

If we split the sampling hemisphere into minimal sized disjunctive solid angles, and use sampling directions within the same solid angle for each texel, we will have rays with similar origin and direction in world space, therefore the overall coherence will be much higher than fully random rays. Since subsequent batches will iterate through each solid angle, the process will integrate through the entire hemisphere for each texel.

The above process generates much more coherent first rays, but does not help with further rays on random paths. Because of this, I did not use recursion. Ray tracing is stopped at the first hit, and light map values from the previous bounce are re-used. While this would increase memory bandwidth even further, the gain we had from increasing the coherence of first level rays greatly compensates the loss.

### Generating Sample Directions

In order to implement the batching, we need a sampling method which meets the following criteria:

- It should generate
**low discrepancy random**set of sampling directions. This guarantees fast approximation of the integral and avoids banding artifacts. - The algorithm should make it possible to
**selectively generate****spatially coherent subset**of these sample directions to keep the ray coherence high. - The generated sample point set should be
**unique**from each hemisphere (or light map texel). We cannot re-use the exact same direction set over and over again, since it would cause banding artefacts. - The total number of sample points should be specified at run-time.
- The algorithm should run in a parallel way. Traditional random number generators produce sequence of random numbers using a “seed” value as an internal state and modify it during each call. The problem is that we want to run the generation on many texels at once, with no interaction between the invocations. Additionally, it should not use any pre-computed data, or use any external memory, since memory access on our GPU is (still) expensive.
- The generated point set should have
**cosine-weighted**distribution on an unit hemisphere. It makes approximation faster, so we can use less rays for the same quality. - Obviously, the algorithm should be fast and simple enough to make it feasible to use on GPU, without compromising the overall performance of the light mapper. Not using branching is a benefit.

#### The Original Algorithm

I have found a technical memo [1] from Pixar, which described an algorithm which seemed to be a good starting point. It already fulfils many of our criteria, and it seems to be simple enough to be a good candidate for use on GPUs after some optimisation. It generates good quality, cosine-weighted, low discrepancy pseudo-random point set on a hemisphere, with no “clumping”. It does not use any pre-computed data, it allows to generate points selectively/individually from the full set of points. Both the total number and the number of selected points can be set at run-time.

In a nutshell, it generates points on a unit 2D square first. It does it by subdividing the square into cells, and putting one point at a regular location within each cell. To randomise the samples without clumping, it **reorders the columns and the rows randomly**. To enhance randomness of samples, it adds some jitter to each sample, so they will be randomised even further within a substrata of the cell, which prevents it from produce banding, even if low number of cells are used. It projects the 2D sample point set to a unit radius disk using polar mapping which preserves the spatial distribution of the samples after the projection well. Next, it projects the points on the disk onto a hemisphere above the disk which ensures cosine weighted distribution.

While the overall quality of the generated points is good, the algorithm has some drawbacks, from our point of view:

- It generates points in scanline-order, so it is already possible to selectively address points within the sample set, but if we select a few consecutive points, they will lie on one or more lines somewhere on the hemisphere, instead of forming a coherent group.

- It is designed to run on CPU, in an offline renderer, it is a bit too complex to be fast enough for a GPU, it does not even run in deterministic time, so I had to do something with that.

### Increasing Sample Locality

To address the first problem, I replaced the scan-line ordering of samples with the good old Z-ordering (Morton code) [3]. It made possible to achieve better spatial coherence of consecutive sample points:

Obviously, it will not work on arbitrary number of rows or columns, I have chosen that because it was the simplest to experiment with. If a mapping for an arbitrary number of rows and columns is necessary, K-means clustering [4] of the hemisphere-projected center points could be useful, where is the number of batches. The index mapping calculation can be done offline, and the subset corresponding to the current batch can uploaded to the GPU as constant/uniform values.

### Replacing the Randomisation and Permutation Functions

The second problem is that the original algorithm uses some custom made hash mixing functions to generate random jitter, and permutation of rows and columns.

In their implementations, they followed the method described in Bret Mulvey’s article about Hash Functions [2] to build their permutation functions, and used hill-climbing algorithm to fine-tune their functions. Additionally, their permutation function uses a technique called “Cycle Walking” in order to reduce the output value range width of the hash function without messing up the uniformity of values (more details in [6]). The problem is that it contains a non-deterministic loop in which the number of iterations depend on the range and the produced random value.

### Mixing functions

Before we continue, some explanation might be useful here. A mixing function is a function which maps an integer input range to the same range of values in a random-like manner, where the n is the number of bits used. (A mixing function is a special case of the S-Box [5] in cryptography, where the range of the input and output are different, usually not even power of two). Each input value is mapped to a randomly chosen value in the same range. The mapping is **reversible**, so every unique input is mapped to an unique output value and vice versa. This property guarantees complete collision free mapping, so the value distribution will be uniform.

Mixing functions consist of so called mixing steps performed one after another. They are reversible: their mapping is collision-free, any number of these steps can be combined together without losing overall reversibility.

i. hash ^= constant; ii. hash *= constant; // if constant is odd iii. hash += constant; iv. hash -= constant; v. hash ^= hash >> constant; vi. hash ^= hash << constant; vii. hash += hash << constant; viii. hash -= hash << constant; iix. hash = (hash 5); // if 32 bits

The proof of reversibility is discussed in more detail in the post.

(The steps can be performed on any word width, assuming that they are done in modulo , where n is the number of bits in a word, so explicit modulo step is necessary if n is less than the actual machine word width.)

They measured the quality of a mixing function by evaluating the so-called avalanche property. From the original post: “A function is said to satisfy the *strict avalanche criterion *if, whenever a single input bit is complemented (toggled between 0 and 1), each of the output bits should change with a probability of one half for an arbitrary selection of the remaining input bits.”

To quantify this property, the method calculates *avalanche matrix* for a given mixing function, is the number of bits in the input and the output. Every element in the matrix tells the probability of change of the bit in the output when bit is flipped in the input. A theoretically sound way of calculating the matrix is to exhaustively iterate through all the possible input values and check changes of the output for every input bit change. Since this can lead to impractical run times ( test is needed), so [2] proposes to check the function against a limited number (practically few millions) of inputs, which gives a acceptably accurate approximation.

To evaluate the overall fitness of a mixing function, a [8] value is calculated on the elements of the matrix:

The resulting number directly reflects the quality of the function (the lower the better). This makes possible to compare two different mixing functions and decide which provides better random mapping.

This all works for arbitrary word length , though [2] makes an important observation that we cannot use a mixing function of word width to generate random numbers on a arbitrary range , simply by calculating modulo , this would ruin the distribution of values in the output range. They use so-called “Cycle Walking” to alleviate this problem [6], they repeat calling the mixing function until the output value falls into the desired range . While this method keeps the nice reversible property and gives uniform distribution, it is not desired for us, since it contains a loop with non-deterministic condition.

Since my experiments were not sensitive the exact range (I could easily go on with power of two number of sample directions), I simply omitted the requirement of having arbitrary ranges for my PRNG for a while. Later I modified the algorithm to make it possible to generate values on an arbitrary range (see below).

## Automated Mixing Function Synthesis

[1] used manually crafted mixing functions, which were fine-tuned using a small framework which employs hill-climbing to find more effective coefficients in the functions. To bring this thought one step further, I thought it might worth a try to *synthesise the entire mixing function from scratch in an automated process*.

To help finding a suitable function, I have implemented a small framework in C# which utilises a genetic-like algorithm in order to “evolve” good mixing functions from an initial set of randomly generated functions. The implementation relies on .NET System.Linq.Expressions [7] facility which makes possible to construct and evaluate expressions using ASTs [9] in run-time with an acceptable speed.

The outline of the algorithm is the following:

- Generate an initial pool of functions (few hundreds or thousands) by picking random steps from the elementary steps above, set up random values for constants in the expressions.
- Randomly pick two expression from the pool and evaluate their avalanche property using χ2 test and compare the results.
- Keep the better and throw away the weaker with probability proportional to the quality difference, using logistic function to calculate the acceptance threshold. Create a new one by “mutating” the better one, make a new copy of it, and then perform one of the following modifications on the new instance:
- Alter a constant in one of the steps
- Swap two, randomly selected steps
- Remove a randomly selected step
- Insert a new, randomly selected step at a random location (if the maximum number of steps are not reached)

- Put the new function back to the pool and go to 2, or terminate if enough iteration is done (or on user request).

This algorithm assumes that mixing functions have a correlation between their structure and their quality, we can find a better quality mixing function by tweaking or slightly modifying an existing one. That is, we can reach better quality functions by performing simple improvements on the ones we already know. It is like our search space have some gradient-like property. If we found a mixing function which has higher quality, we can expect that it might lead to an even better one. While it is based on sole intuition (I could not find a theoretical backing to prove that assumption yet), but it apparently works fine. The chi-square test guarantees the quality of the functions we can find with this method.

Steps 3.3 and 3.4. try to help the algorithm to avoid getting stuck in a local optimum. To make it work even better, it might be useful to add an entirely new function from scratch from time to time.

The implementation is able to handle mixing functions with arbitrary power-of-two ranges, up to 32 bits, simply by applying the after steps when necessary.

In order to make avalanche evaluation quicker, the framework uses preliminary testing of a mutated function, calculating an approximate value with significantly lower iterations first. If the result is worse than a proportionally higher threshold, the new function will be rejected quickly, saving a lot of time.

The algorithm led me to some interesting findings. First of all, it finds good candidates relatively quickly, even if the maximum number of allowed steps is low, like 5 or 6.

Second, it seems like that functions with good quality tend to have similar structures, a lot of those consists of alternating set of steps ii. and v., for example:

(result *= 2078995439); (result ^= (result >> 16)); (result *= 3646025353); (result ^= (result >> 10)); (result *= 3911234291); (result ^= (result >> 16));

Additionally, the constants in step ii. are the best if they are large primes (or they can be factored into two primes), closer to the upper limit of the current range . The average of the constants in step v. is around the half of the word bit count. (Readers with proper background in mathematics feel free to send me your insights for these, I am really curious about it).

When the word width is reduced, an extra bitwise AND operation is added. For example, a 7 bit mixing function looks like this:

(result ^= (result >> 4)); (result = ((result * 67) & 127)); (result ^= (result >> 3)); (result = ((result * 101) & 127)); (result ^= (result >> 4))

The constants used in the steps are generated in respect to the word length as well, for example, the shift operators will not shift more or equal to the word length, similarly, the primes chosen for multiplications will be below the upper limit of the current range, etc.

The function will map different input values to different output values within its range, without using external data. The input value can be anything which is unique for a given invocation, like a pixel ID or similar. This makes possible to invoke the function in a parallel way, and each invocation will result in different random numbers. Moreover, it will fulfil the requirement of seeding.

The framework dumps the pool to a text file from time to time to make it possible to monitor the process, or extract the best functions in a simple way.

Here are a few functions found by the framework with their corresponding approximate values, when the maximum number of steps is limited to 5, and :

// X^2 = 0.0042641116: var result = seed; (result ^= (result >> 16)); (result *= 322022693); (result ^= (result >> 14)); (result *= 2235360983); (result ^= (result >> 19)); // X^2 = 0.00438070260000001: var result = seed; (result ^= (result >> 16)); (result *= 3968764237); (result ^= (result >> 13)); (result *= 324413651); (result ^= (result >> 18)); // X^2 = 0.0043953292: var result = seed; (result ^= (result >> 16)); (result *= 2019204679); (result ^= (result >> 12)); (result *= 559761739); (result ^= (result >> 18));

A few other functions, which the same length limit, but :

// X^2 = 0.10595703125: result = seed; (result ^= (result >> 1)); (result = ((result * 15) & 255)); (result ^= (result >> 3)); (result = ((result * 221) & 255)); (result ^= (result >> 4)); // X^2 = 0.11474609375: var result = seed; (result ^= (result >> 1)); (result = ((result * 15) & 255)); (result ^= (result >> 3)); (result = ((result * 93) & 255)); (result ^= (result >> 4)); // X^2 = 0.12060546875: var result = seed; (result ^= (result >> 1)); (result = ((result * 15) & 255)); (result ^= (result >> 3)); (result = ((result * 157) & 255)); (result ^= (result >> 4));

## Mixing Functions on the GPU

With this framework, it is relatively easy to find different mixing functions which consist of the same kind of consecutive steps, so it is possible to pick four, and combine them and implement them vectorized on the GPU. One of those I have chosen for my experiments is implemented in the following GLSL function:

uvec4 Permute4(uint x, uint y, uint seed) { uvec4 value = uvec4(x, seed, seed, seed) & uvec4(127, 127, 0xffffffff, 0xffffffff); value ^= value >> uvec4(3, 3, 16, 16); value = (value * uvec4(107, 67, 322022693, 322022693)) & uvec4(127, 127, 0xffffffff, 0xffffffff); value ^= value >> uvec4(3, 4, 14, 13); value = (value * uvec4(69, 21, 3968764237, 2235360983)) & uvec4(127, 127, 0xffffffff, 0xffffffff); value ^= value >> uvec4(4, 3, 19, 18); return value; }

The function above is four mixing functions combined into one. They all consists of the the same kind of steps performed through all the x, y, z, w channels: x and y are 5 bit, z and w are 32 bit mixing functions. The only difference between the individual channels is the coefficients and bit masks used.

There is a problem though. Using 7 bit mixing function here will result in a repeating random pattern in sampling (which we wanted to avoid). To fix this, we can use an extra step to hide this repetition:

uvec4 Permute4(uint x, uint y, uint seed) { uvec4 value = uvec4(x ^ seed, y ^ seed, seed, seed) & uvec4(127, 127, 0xffffffff, 0xffffffff); value ^= value >> uvec4(3, 3, 16, 16); value = (value * uvec4(107, 67, 322022693, 322022693)) & uvec4(127, 127, 0xffffffff, 0xffffffff); value ^= value >> uvec4(3, 4, 14, 13); value = (value * uvec4(69, 21, 3968764237, 2235360983)) & uvec4(127, 127, 0xffffffff, 0xffffffff); value ^= value >> uvec4(4, 3, 19, 18); return value; }

The first step is “seeding” the x and y values with the values in z and w, which are unique for every unique seed value, so it does an extra, unique permutation for both x and y values.

This single function provides both row/column index permutation and jitter random values. The 5 bit functions are used for row/column permutation (for 1024 samples per texel), the 32 bit functions are for the jitter. Compared to the original ones, it is much more compact, does not use any branching, contains no loops at all, and the overall length is less than the half of the original ones. The obvious drawback is that we need a different permutation function for each word length, and functions with limited word width have short repeating period.

## Generating Permutation for an Arbitrary Range

While it was easy to work with power of two ranges in my experiments, it was necessary to have a function which works on an arbitrary range in practice, so a single function can be used for all purposes.

Since [2] has proven that a simple modulo operation will ruin uniform distribution, I had to come up something else. I do not know whether it is a well known idea, and maybe I reinvented the wheel here, but here is my solution for this problem, which works on arbitrary ranges of , where . (Which is practically since will produce all zeros).

The basic idea is probably not new, it uses a few basic integer operations, when :

uvec4 Rand(uvec4 seed, uvec4 range) { seed ^= uvec4(seed.x >> 19, seed.y >> 19, seed.z >> 13, seed.w >> 19); seed *= uvec4(1619700021, 3466831627, 923620057, 3466831627); seed ^= uvec4(seed.x >> 16, seed.y >> 12, seed.z >> 18, seed.w >> 17); seed *= uvec4(175973783, 2388179301, 3077236438, 2388179301); seed ^= uvec4(seed.x >> 15, seed.y >> 17, seed.z >> 18, seed.w >> 18);uint f = seed & 0x0000ffff; return (r * f) >> 16}

What it does is basically takes the full 32-bit hash value and uses it as a 16.16 fixed point integer, takes the fraction part (which represents values from ), and multiplies it with the requested range, and returns the result as an integer. That is, remaps the fraction values to the interval.

The good avalanche property of the mixing function guarantees that the fraction will have uniform distribution as well (the probability of a particular bit of being either 0 or 1 is very close to 0.5 – 0.5).

It also has a nice property that it will have a full repeating period of instead of the requested range.

It works, but it is not perfect. The distribution of mapped values are still not totally uniform: since is not integer multiple of , sometimes different number of fractional values will be mapped to a particular value in the requested range . For example, in the case of , we request values in the range of while using seed which has values of , but we use the half of it for the fraction values, so the range we map is . The corresponding mapping is the following:0 -> 0, 1 -> 0, 2 -> 1, 3 -> 2, that is two values from range are mapped to the same value – zero – in the range , due to truncating.

To alleviate this problem, all we need to do is add some “jitter” to the mapping:

uvec4 Rand(uvec4 seed, uvec4 range) { seed ^= uvec4(seed.x >> 19, seed.y >> 19, seed.z >> 13, seed.w >> 19); seed *= uvec4(1619700021, 3466831627, 923620057, 3466831627); seed ^= uvec4(seed.x >> 16, seed.y >> 12, seed.z >> 18, seed.w >> 17); seed *= uvec4(175973783, 2388179301, 3077236438, 2388179301); seed ^= uvec4(seed.x >> 15, seed.y >> 17, seed.z >> 18, seed.w >> 18); uint f = seed & 0x0000ffff;return (range * f + (seed % range)) >> 16;}

After the multiplication we add a small value picked from , which pushes the rounding threshold within each interval. (Here we assume seed and consequently has an uniform distribution as well). Even if the total distribution of seed is not perfectly even, it still improves the overall distribution of the values picked from the requested range.

## Random Demo

There is a sample implementation on Shadertoy demonstrating some of the permutation functions found by the framework. It generates not just image but sound as well. The results are quite in accordance of the theoretical expectations (or at least good enough for me :P).

## Conclusion

With these ideas above, I have managed to adapt the original CMJS method, it became possible to generate good quality sampling direction on the GPU with low shader cost, break down the overall process into multiple batches and increase the overall ray coherence. Obviously, there are many ways to go further, but this is how far I could get on my limited time.

## Links

[1] Correlated Multi-Jittered Sampling – A. Kensler

[2] Hash Functions – B. Mulvey

[4] K-Means Clustering – Wikipedia

[6] Ciphers with Arbitrary Finite Domains – J. Black, P. Rogaway

[7] System.Linq.Expression – Microsoft .NET Framework

[8] Chi-Squared Test – Wikipedia

[9] Abstract Syntax Trees – Wikipedia

[10] Logistic Function – Wikipedia