Correlated Multi-Jittered Sampling on GPU


I went down the rabbit hole too deep trying to improve sampling in my light mapper running on ray tracing hardware. Found some random generators suitable for GPU usage, and also adapted a sampling method on the way.


This material is a byproduct of a ray tracing research project I was working on quite a while ago. The project goal was to write a light mapper using IMG’s ray tracing hardware. We wanted to publish our results in a short paper, but it just did not survived the review process unfortunately. Since hardware accelerated ray tracing has became a hype lately big time, I thought  it would worth it to share some of the ideas I had during and after the project.


It started with a textbook implementation of a path tracer to compute diffuse radiance, using some typical cosine weighted, low-discrepancy random sampling to shoot rays around from each light map texel. The hardware did its job nicely, but obviously, the process was far from optimal. After some performance analysis, it became clear that the biggest issue was the system memory bandwidth caused by the large number of incoherent rays. Even if the hardware is designed to handle incoherence efficiently, the overall memory bandwidth is greatly affected while tracing complex scenes.

So, the first idea was to break down the integration into smaller batches, and try to group rays into spatially more coherent batches. Light maps consists of islands of unwrapped triangles in UV space, the adjacent texels within an island  typically have very similar world-space position and tangent space.

If we split the sampling hemisphere into minimal sized disjunctive solid angles, and use sampling directions within the same solid angle for each texel, we will have rays with similar origin and direction in world space, therefore the overall coherence will be much higher than fully random rays. Since subsequent batches will iterate through each solid angle, the process will integrate through the entire hemisphere for each texel.

The above process generates much more coherent first rays, but does not help with further rays on random paths. Because of this, I did not use recursion. Ray tracing is stopped at the first hit, and light map values from the previous bounce are re-used. While this would increase memory bandwidth even further, the gain we had from increasing the coherence of first level rays greatly compensates the loss.

Generating Sample Directions

In order to implement the batching, we need a sampling method which meets the following criteria:

  1. It should generate low discrepancy random set of sampling directions. This guarantees fast approximation of the integral and avoids banding artifacts.
  2. The algorithm should make it possible to selectively generate spatially coherent subset of these sample directions to keep the ray coherence high.
  3. The generated sample point set should be unique from each hemisphere (or light map texel). We cannot re-use the exact same direction set over and over again, since it would cause banding artefacts.
  4. The total number of sample points should be specified at run-time.
  5. The algorithm should run in a parallel way. Traditional random number generators produce sequence of random numbers using a “seed” value as an internal state and modify it during each call. The problem is that we want to run the generation on many texels at once, with no interaction between the invocations. Additionally, it should not use any pre-computed data, or use any external memory, since memory access on our GPU is (still) expensive.
  6. The generated point set should have cosine-weighted distribution on an unit hemisphere. It makes approximation faster, so we can use less rays for the same quality.
  7. Obviously, the algorithm should be fast and simple enough to make it feasible to use on GPU, without compromising the overall performance of the light mapper. Not using branching is a benefit.

The Original Algorithm

I have found a technical memo [1] from Pixar, which described an algorithm which seemed to be a good starting point. It already fulfils many of our criteria, and it seems to be simple enough to be a good candidate for use on GPUs after some optimisation. It generates good quality, cosine-weighted, low discrepancy pseudo-random point set on a hemisphere, with no “clumping”. It does not use any pre-computed data, it allows to generate points selectively/individually from the full set of points. Both the total number and the number of selected points can be set at run-time.

In a nutshell, it generates points on a unit 2D square first. It does it by subdividing the square into cells, and putting one point at a regular location within each cell. To randomise the samples without clumping, it reorders the columns and the rows randomly. To enhance randomness of samples, it adds some jitter to each sample, so they will be randomised even further within a substrata of the cell, which prevents it from produce banding, even if low number of cells are used.  It projects the 2D sample point set to a unit radius disk using polar mapping which preserves the spatial distribution of the samples after the projection well. Next, it projects the points on the disk onto a hemisphere above the disk which ensures cosine weighted distribution.

While the overall quality of the generated points is good, the algorithm has some drawbacks, from our point of view:

  1. It generates points in scanline-order, so it is already possible to selectively address points within the sample set, but if we select a few consecutive points, they will lie on one or more lines somewhere on the hemisphere, instead of forming a coherent group.
  2. It is designed to run on CPU, in an offline renderer, it is a bit too complex to be fast enough for a GPU, it does not even run in deterministic time, so I had to do something with that.

Increasing Sample Locality

To address the first problem, I replaced the scan-line ordering of samples with the good old Z-ordering (Morton code) [3]. It made possible to achieve better spatial coherence of consecutive sample points:

Obviously, it will not work on arbitrary number of rows or columns, I have chosen that because it was the simplest to experiment with. If a mapping for an arbitrary number of rows and columns is necessary, K-means clustering [4] of the hemisphere-projected center points could be useful, where k is the number of batches. The index mapping calculation can be done offline, and the subset corresponding to the current batch can uploaded to the GPU as constant/uniform values.

Replacing the Randomisation and Permutation Functions

The second problem is that the original algorithm uses some custom made hash mixing functions to generate random jitter, and permutation of rows and columns.

In their implementations, they followed the method described in Bret Mulvey’s article about Hash Functions [2] to build their permutation functions, and used hill-climbing algorithm to fine-tune their functions. Additionally, their permutation function uses a technique called “Cycle Walking” in order to reduce the output value range width of the hash function without messing up the uniformity of values (more details in [6]). The problem is that it contains a non-deterministic loop in which the number of iterations depend on the range and the produced random value.

Mixing functions

Before we continue, some explanation might be useful here. A mixing function is a function which maps an integer input range [0..2^n-1] to the same range of values in a random-like manner, where the n is the number of bits used. (A mixing function is a special case of the S-Box [5] in cryptography, where the range of the input and output are different, usually not even power of two). Each input value is mapped to a randomly chosen value in the same range. The mapping is reversible, so every unique input is mapped to an unique output value and vice versa. This property guarantees complete collision free mapping, so the value distribution will be uniform.

Mixing functions consist of so called mixing steps performed one after another. They are reversible: their mapping is collision-free, any number of these steps can be combined together without losing overall reversibility.

i.    hash ^= constant;
ii.   hash *= constant; // if constant is odd
iii.  hash += constant;
iv.   hash -= constant;
v.    hash ^= hash >> constant;
vi.   hash ^= hash << constant;
vii.  hash += hash << constant;
viii. hash -= hash << constant;
iix.  hash = (hash  5); // if 32 bits

The proof of reversibility is discussed in more detail in the post.

(The steps can be performed on any word width, assuming that they are done in modulo 2^n, where n is the number of bits in a word, so explicit modulo step is necessary if n is less than the actual machine word width.)

They measured the quality of a mixing function by evaluating the so-called avalanche property. From the original post: “A function is said to satisfy the strict avalanche criterion if, whenever a single input bit is complemented (toggled between 0 and 1), each of the output bits should change with a probability of one half for an arbitrary selection of the remaining input bits.”

To quantify this property, the method calculates avalanche matrix \bf{A}_{n,n} for a given mixing function, n is the number of bits in the input and the output. Every \bf{A}_{i, j} element in the matrix tells the probability of change of the bit j in the output when i bit is flipped in the input. A theoretically sound way of calculating the matrix is to exhaustively iterate through all the possible input values and check changes of the output for every input bit change. Since this can lead to impractical run times (n \times 2^n test is needed), so [2] proposes to check the function against a limited number (practically few millions) of inputs, which gives a acceptably accurate approximation.

To evaluate the overall fitness of a mixing function, a \chi^2 [8] value is calculated on the elements of the matrix:

\chi^2_{\bf{A}} = \sum_{i,j}^{}{\frac{(0.5 - \bf{A}_{i,j})^2}{0.5}}

The resulting number directly reflects the quality of the function (the lower the better). This makes possible to compare two different mixing functions and decide which provides better random mapping.

This all works for arbitrary word length n, though [2] makes an important observation that we cannot use a mixing function of word width n to generate random numbers on a arbitrary range [0..m-1], m < 2^n, m \in \bf{N^+}, simply by calculating modulo m, this would ruin the distribution of values in the output range. They use so-called “Cycle Walking” to alleviate this problem [6], they repeat calling the mixing function until the output value falls into the desired range [0..m-1]. While this method keeps the nice reversible property and gives uniform distribution, it is not desired for us, since it contains a loop with non-deterministic condition.

Since my experiments were not sensitive the exact range (I could easily go on with power of two number of sample directions), I simply omitted the requirement of having arbitrary ranges for my PRNG for a while. Later I modified the algorithm to make it possible to generate values on an arbitrary range (see below).

Automated Mixing Function Synthesis

[1] used manually crafted mixing functions, which were fine-tuned using a small framework which employs hill-climbing to find more effective coefficients in the functions. To bring this thought one step further, I thought it might worth a try to synthesise the entire mixing function from scratch in an automated process.

To help finding a suitable function, I have implemented a small framework in C# which utilises a genetic-like algorithm in order to “evolve” good mixing functions from an initial set of randomly generated functions. The implementation relies on .NET System.Linq.Expressions [7] facility which makes possible to construct and evaluate expressions using ASTs [9] in run-time with an acceptable speed.

The outline of the algorithm is the following:

  1. Generate an initial pool of functions (few hundreds or thousands) by picking random steps from the elementary steps above, set up random values for constants in the expressions.
  2. Randomly pick two expression from the pool and evaluate their avalanche property using χ2 test and compare the results.
  3. Keep the better and throw away the weaker with probability proportional to the quality difference, using logistic function to calculate the acceptance threshold. Create a new one by “mutating” the better one, make a new copy of it, and then perform one of the following modifications on the new instance:
    1. Alter a constant in one of the steps
    2. Swap two, randomly selected steps
    3. Remove a randomly selected step
    4. Insert a new, randomly selected step at a random location (if the maximum number of steps are not reached)
  4. Put the new function back to the pool and go to 2, or terminate if enough iteration is done (or on user request).

This algorithm assumes that mixing functions have a correlation between their structure and their quality, we can find a better quality mixing function by tweaking or slightly modifying an existing one. That is, we can reach better quality functions by performing simple improvements on the ones we already know. It is like our search space have some gradient-like property. If we found a mixing function which has higher quality, we can expect that it might lead to an even better one. While it is based on sole intuition (I could not find a theoretical backing to prove that assumption yet), but it apparently works fine. The chi-square test guarantees the quality of the functions we can find with this method.

Steps 3.3 and 3.4. try to help the algorithm to avoid getting stuck in a local optimum. To make it work even better, it might be useful to add an entirely new function from scratch from time to time.

The implementation is able to handle mixing functions with arbitrary power-of-two ranges, up to 32 bits, simply by applying the mod{ } 2^{n-1} after steps when necessary.

In order to make avalanche evaluation quicker, the framework uses preliminary testing of a mutated function, calculating an approximate \chi^2 value with significantly lower iterations first. If the result is worse than a proportionally higher threshold, the new function will be rejected quickly, saving a lot of time.

The algorithm led me to some interesting findings. First of all, it finds good candidates relatively quickly, even if the maximum number of allowed steps is low, like 5 or 6.

Second, it seems like that functions with good quality tend to have similar structures, a lot of those consists of  alternating set of steps ii. and v., for example:

(result *= 2078995439);
(result ^= (result >> 16));
(result *= 3646025353);
(result ^= (result >> 10));
(result *= 3911234291);
(result ^= (result >> 16));

Additionally, the constants in step ii. are the best if they are large primes (or they can be factored into two primes), closer to the upper limit of the current range 2^n-1. The average of the constants in step v. is around the half of the word bit count. (Readers with proper background in mathematics feel free to send me your insights for these, I am really curious about it).

When the word width is reduced, an extra bitwise AND operation is added. For example, a 7 bit mixing function looks like this:

(result ^= (result >> 4));
(result = ((result * 67) & 127));
(result ^= (result >> 3));
(result = ((result * 101) & 127));
(result ^= (result >> 4))

The constants used in the steps are generated in respect to the word length as well, for example, the shift operators will not shift more or equal to the word length, similarly, the primes chosen for multiplications will be below the upper limit of the current range, etc.

The function will map different input values to different output values within its range, without using external data. The input value can be anything which is unique for a given invocation, like a pixel ID or similar. This makes possible to invoke the function in a parallel way, and each invocation will result in different random numbers. Moreover, it will fulfil the requirement of seeding.

The framework dumps the pool to a text file from time to time to make it possible to monitor the process, or extract the best functions in a simple way.

Here are a few functions found by the framework with their corresponding approximate \chi^2 values, when the maximum number of steps is limited to 5, and n = 32:

// X^2 = 0.0042641116:
var result = seed;
(result ^= (result >> 16));
(result *= 322022693);
(result ^= (result >> 14));
(result *= 2235360983);
(result ^= (result >> 19));

// X^2 = 0.00438070260000001:
var result = seed;
(result ^= (result >> 16));
(result *= 3968764237);
(result ^= (result >> 13));
(result *= 324413651);
(result ^= (result >> 18));

// X^2 = 0.0043953292:
var result = seed;
(result ^= (result >> 16));
(result *= 2019204679);
(result ^= (result >> 12));
(result *= 559761739);
(result ^= (result >> 18));

A few other functions, which the same length limit, but n = 8:

// X^2 = 0.10595703125:
result = seed;
(result ^= (result >> 1));
(result = ((result * 15) & 255));
(result ^= (result >> 3));
(result = ((result * 221) & 255));
(result ^= (result >> 4));

// X^2 = 0.11474609375:
var result = seed;
(result ^= (result >> 1));
(result = ((result * 15) & 255));
(result ^= (result >> 3));
(result = ((result * 93) & 255));
(result ^= (result >> 4));

// X^2 = 0.12060546875:
var result = seed;
(result ^= (result >> 1));
(result = ((result * 15) & 255));
(result ^= (result >> 3));
(result = ((result * 157) & 255));
(result ^= (result >> 4));

Mixing Functions on the GPU

With this framework, it is relatively easy to find different mixing functions which consist of the same kind of consecutive steps, so it is possible to pick four, and combine them and implement them vectorized on the GPU. One of those I have chosen for my experiments is implemented in the following GLSL function:

uvec4 Permute4(uint x, uint y, uint seed)
        uvec4 value = uvec4(x, seed, seed, seed) & uvec4(127, 127, 0xffffffff, 0xffffffff);
	value ^= value >> uvec4(3, 3, 16, 16);
	value = (value * uvec4(107, 67, 322022693, 322022693)) & uvec4(127, 127, 0xffffffff, 0xffffffff);
	value ^= value >> uvec4(3, 4, 14, 13);
	value = (value * uvec4(69, 21, 3968764237, 2235360983)) & uvec4(127, 127, 0xffffffff, 0xffffffff);
	value ^= value >> uvec4(4, 3, 19, 18);
	return value;

The function above is four mixing functions combined into one. They all consists of the the same kind of steps performed through all the x, y, z, w channels: x and y are 5 bit, z and w are 32 bit mixing functions. The only difference between the individual channels is the coefficients and bit masks used.

There is a problem though. Using 7 bit mixing function here will result in a repeating random pattern in sampling (which we wanted to avoid). To fix this, we can use an extra step to hide this repetition:

uvec4 Permute4(uint x, uint y, uint seed)
    uvec4 value = uvec4(x ^ seed, y ^ seed, seed, seed) & uvec4(127, 127, 0xffffffff, 0xffffffff);
    value ^= value >> uvec4(3, 3, 16, 16);
    value = (value * uvec4(107, 67, 322022693, 322022693)) & uvec4(127, 127, 0xffffffff, 0xffffffff);
    value ^= value >> uvec4(3, 4, 14, 13);
    value = (value * uvec4(69, 21, 3968764237, 2235360983)) & uvec4(127, 127, 0xffffffff, 0xffffffff);
    value ^= value >> uvec4(4, 3, 19, 18);
    return value;

The first step is “seeding” the x and y values with the values in z and w, which are unique for every unique seed value, so it does an extra, unique permutation for both x and y values.

This single function provides both row/column index permutation and jitter random values. The 5 bit functions are used for row/column permutation (for 1024 samples per texel), the 32 bit functions are for the jitter. Compared to the original ones, it is much more compact, does not use any branching, contains no loops at all, and the overall length is less than the half of the original ones. The obvious drawback is that we need a different permutation function for each word length, and functions with limited word width have short repeating period.

Generating Permutation for an Arbitrary Range

While it was easy to work with power of two ranges in my experiments, it was necessary to have a function which works on an arbitrary range in practice, so a single function can be used for all purposes.

Since [2] has proven that a simple modulo operation will ruin uniform distribution, I had to come up something else. I do not know whether it is a well known idea, and maybe I reinvented the wheel here, but here is my solution for this problem, which works on arbitrary ranges of [0..r], where 0 <= r <= 2^{n/2}. (Which is practically 1 < r <= 2^{n/2} since r <= 1 will produce all zeros).

The basic idea is probably not new, it uses a few basic integer operations, when n = 32:

uvec4 Rand(uvec4 seed, uvec4 range)
    seed ^= uvec4(seed.x >> 19, seed.y >> 19, seed.z >> 13, seed.w >> 19);
    seed *= uvec4(1619700021, 3466831627, 923620057, 3466831627);
    seed ^= uvec4(seed.x >> 16, seed.y >> 12, seed.z >> 18, seed.w >> 17);
    seed *= uvec4(175973783, 2388179301, 3077236438, 2388179301);
    seed ^= uvec4(seed.x >> 15, seed.y >> 17, seed.z >> 18, seed.w >> 18);

    uint f = seed & 0x0000ffff;
    return (r * f) >> 16

What it does is basically takes the full 32-bit hash value and uses it as a 16.16 fixed point integer, takes the fraction part (which represents values from [0..1-1/2^{16}]), and multiplies it with the requested range, and returns the result as an integer. That is, remaps the fraction values to the [0...r - 1] interval.

The good avalanche property of the mixing function guarantees that the fraction will have uniform distribution as well (the probability of a particular bit of being either 0 or 1 is very close to 0.5 – 0.5).

It also has a nice property that it will have a full repeating period of 2^{32} instead of the requested range.

It works, but it is not perfect. The distribution of mapped values are still not totally uniform: since r is not integer multiple of 2^{16}, sometimes different number of fractional values will be mapped to a particular value in the requested range r. For example, in the case of r = 3, n = 4, we request values in the range of [0..2] while using seed which has values of [0..2^n] = [0..15], but we use the half of it for the fraction values, so the range we map is [0..3]. The corresponding mapping is the following:0 -> 0, 1 -> 0, 2 -> 1, 3 -> 2, that is two values from range [0..3] are mapped to the same value – zero – in the range [0..2], due to truncating.

To alleviate this problem, all we need to do is add some “jitter” to the mapping:

uvec4 Rand(uvec4 seed, uvec4 range) 
    seed ^= uvec4(seed.x >> 19, seed.y >> 19, seed.z >> 13, seed.w >> 19);
    seed *= uvec4(1619700021, 3466831627, 923620057, 3466831627);
    seed ^= uvec4(seed.x >> 16, seed.y >> 12, seed.z >> 18, seed.w >> 17);
    seed *= uvec4(175973783, 2388179301, 3077236438, 2388179301);
    seed ^= uvec4(seed.x >> 15, seed.y >> 17, seed.z >> 18, seed.w >> 18); 
    uint f = seed & 0x0000ffff;
    return (range * f + (seed % range)) >> 16; 

After the multiplication we add a small value picked from [0..1 / r], which pushes the rounding threshold within each interval.  (Here we assume seed and consequently seed mod r has an uniform distribution as well). Even if the total distribution of seed is not perfectly even, it still improves the overall distribution of the values picked from the requested range.

Random Demo

There is a sample implementation on Shadertoy demonstrating some of the permutation functions found by the framework. It generates not just image but sound as well. The results are quite in accordance of the theoretical expectations (or at least good enough for me :P).


With these ideas above, I have managed to adapt the original CMJS method, it became possible to generate good quality sampling direction on the GPU with low shader cost, break down the overall process into multiple batches and increase the overall ray coherence. Obviously, there are many ways to go further, but this is how far I could get on my limited time.


[1] Correlated Multi-Jittered Sampling  – A. Kensler

[2] Hash Functions – B. Mulvey

[3] Z-Order Curve – Wikipedia

[4] K-Means Clustering – Wikipedia

[5] S-Box – Wikipedia

[6] Ciphers with Arbitrary Finite Domains – J. Black, P. Rogaway

[7] System.Linq.Expression – Microsoft .NET Framework

[8] Chi-Squared Test – Wikipedia

[9] Abstract Syntax Trees – Wikipedia

[10] Logistic Function – Wikipedia


Importing Quake 1 Level Geometry from .map Files


I have this nostalgic relation with Quake/HL series, they were one of my greatest motivation towards game tech. I really admire how they built their worlds upon simple basic concepts, like plane bounded convex polyhedron, what they call “brushes”. The idea is to take a bunch of planes (in a mathematical sense), and use them as boundaries of a convex subspace, which is the intersection of the half-spaces behind them.

Some Basics

The equation, Ax + By + Cz - D = 0 is one practical way to define a plane. Points \bold{p} = (x, y, z) satisfying this equation lie on a plane in 3D Euclidean space, where (A, B, C) = \bold{n} is the coordinates of the normal vector, perpendicular to the plane, and D is the shortest signed distance to the plane from the origin. A different way to visualise it is to take a plane having normal vector \bold{n}, running through the origin, and then shift to the normal direction with D/|\bold{n}| units. It is practical to keep the normal vector \bold{n} unit length, that is, normalise the plane equation parameters.

Using a plane defined by this equation, we can classify the relation of a point to the plane. Substituting a \bold{p_0}=(x_0, y_0, z_0) point to the equation Ax_0 + By_0 + Cz_0 - D = d we will obtain a scalar value d, which is the shortest signed distance between the point and the plane. If it is positive, the point is in front of the plane, if it is negative, it is behind it. In some application, it makes sense to classify points whose distance is zero as they are on the plane. For us, it is unnecessary now. (By the way, due to the numeric inaccuracies of floating point arithmetic, it is not practical to test against exact zero values to determine whether a point lies on the plane. Instead we check if the signed distance is below a small threshold, \epsilon: |Ax_0 + By_0 + Cz_0 - D| < \epsilon.)

Now we have a simple, handy tool to classify points against planes, so we can move on to construct convex polyhedrons using it. We know, if a \bold{p} = (x, y, z) point satisfies the inequality Ax + By + Cz - D < 0 we know that it lies behind the plane.

If we setup a system of m inequalities using plane equations, we can define closed convex polyhedrons within 3D space:

\begin{array}{rcr} A_1x + B_1y + C_1z - D_1 & < & 0 \\ A_2x + B_2y + C_2z - D_2 & < & 0 \\ \vdots & \vdots & \vdots \\ A_mx + B_my + C_mz - D_m & < & 0 \end{array}

where the ith inequality represents a \bold{P_i} = (A_i, B_i, C_i, D_i) plane. If a \bold{p}=(x, y, z) point satisfies all the inequalities then the point is inside the polyhedron. We consider the subspace defined like this valid only if the subspace has a finite, positive volume. This excludes empty, or degenerate forms of subspaces.

Map File Format

[2] gives a detailed description how brushes are defined in Quake 1 map files. .MAP files are simple text files, listing level brushes and entities in a simple structure. A brush consists of a list of face definitions. Each face definition contains three points p_1, p_2, p_3, and some additional properties – texture name, UV offsets, rotation, etc. The three points span the plane of the given face, but is not guaranteed that any of them is at the vertex position. To derive the vertices of the polyhedron, we have to convert these three points to plane equations first, calculating the  normal vector \bold{n'} = (\bold{p_2} - \bold{p_1}) \times  (\bold{p_3} - \bold{p_1}), normalise it: \bold{n} = (A, B, C) = \bold{n'} / |\bold{n'}|, and calculate the parameter D = -\bold{p_1}\bold{n}.

Converting Brushes to Meshes

Once we have the face planes ready, we should convert them to vertices and indices in order to make them suitable to render. [2] proposes a simple method, originated in the Q1 bsp tool: they take a large [-4096..4096]^3 cube, and “slice” it with all the planes of the brush into two parts, and keep only the part behind the bounding planes. What remains is the convex polyhedron we are looking for.

We start with a list of faces defined by vertices which form a closed (and convex) loop around the face. Two consecutive vertices define an edge. To slice a face against a plane, we test all the face edges against it. If both vertices are behind the plane, we keep both. If they are in front, then we omit them. If one of them is behind and the other is in front, we keep the one behind, and store the intersection point between the edge and the plane, instead of the vertex in front of the plane. (This method also provides an easy method to detect redundant planes, i. e. which does not influence the final brush geometry at all.)

This approach seems quite simple, but making it numerically was not entirely straightforward. After all, I came up with the following:

Let M = \{ f_0, f_1, ... , f_m \} is an initial set of faces. Each face is defined by an ordered list of vertices f_i = (v_{i0}, v_{i1}, .., v_{in})M initially contains faces of a [-4096..4096]^3 cube. Let B = \{ p_0, p_1, ... , p_b, f'_0, f'_1, ..., f'_b \} is a set of p_i planes of the input brush we want to convert, with the corresponding f'_i faces, which are also an ordered list of vertices forming an edge loop around the face, initially empty. The clipping algorithm is:

For all p_i \in B:
    Let C is an ordered list of clipped vertices
    // C will contain the vertices of the edge loop 
    // of the new face we get after clipping
    For all f_j \in M:
        Let V is an ordered list of vertices
        // V will contain the vertices of the edge loop
        // around f_j after the clipping is done
        For all v_k \in f_j:
            If v_k is behind p_i:
                Append v_k to the end of V
            If v_{k+1} is on the opposite side of p_i than v_k:
                Let v'_k the intersection point between p_i and the section \overline{v_kv_{k+1}}.
                Append v'_k to the end of both V and C
        Replace vertices in face f_j with V
    If C is not empty:
        For all c_i \in C:
            If c_i \notin  f_j:
                Insert it where it follows the previous vertices 
                    in the right winding order

This will result in a set of faces with all the vertices in the right order, which is easy to assemble into an actual mesh.

Some images of the results, without any textures or materials for now:

The lights are simple point lights, with shadow maps using Poisson disc filter. The light energy and radius values are calculated by an empirical formula, hand tweaked to give somewhat similar results to the original lighting, using the light parameters specified the map file.

The next step is to import all the materials. To be continued….

[1] Plane – WolframAlpha

[2] Quake Map Format

Yet Another Level Demo

This is a test level as well as the previous one. It is a relatively large map so the portal culling comes quite in handy in this case.

Another Indoor Test Level

I have done a quick video by request. Check it out. 🙂

Shadow Maps on Omnidirectional Lights

Theoretically, using six shadow mapped spot lights to piece up a shadow casting point light is not a big deal. Especially if the light geometry calculation for deferred lighting is correct. Well, in my engine, it was not the case. 🙂

There was a hidden calculation error that resulted in a frustum with incorrect corner points that made it bigger than desired. This error caused no problem so far because projected mask texture on spotlights and “discard” codes in fragment shader prevented the extra pixels from being seen. But once I have used spot lights with no mask to render point light, the result was erroneous. It looked so strange and mysterious that it took few hours to find the root of the problem.

Finally, the shadows on point light are working now, and I proudly present some screenshots here.

Shadow map on omnilight. Shadow map on omnilight. Shadow map on omnilight. Shadow map on omnilight.

Scripting is Fully Functional

In an earlier post, I have written that I have implemented Lua script support in the engine. Now it reached its full power since I have added a proper callback mechanism to it. It makes possible for the triggers in the engine to call user specified Lua functions on firing. Unfortunately, SWIG does not help to implement such callbacks, so I had to do it by myself.

Read the rest of this entry

Box Projected Cubemap Environment Mapping

I have added a new technique called … I’d rather not write  it down again…. So, the great thing is that it mimics reflections on shiny surfaces using a simple, but yet quite accurate approximation based on environment cubemap textures.

For more details, see

The hard part was to find a hidden bug in the engine that prevented me to render in-scene cubemap snapshots using a dedicated off-screen compositor chain. But eventually I have managed to find it and it seems that it worth the effort.

The engine automatically places light probes in each room in the level at load time, then it renders a fully lit HDR cubemap texture for each of them. The engine uses these cubemaps and some extra parameters assigned to the probes, to calculate the reflections with the aforementioned method.

First I have experimented with a similar idea that uses spheres to distort the cubemap look-up coordinates, but it was inaccurate. BPCEM gives a more accurate results in box-shaped rooms.

There are some images with the results:

A demo video:

The First Indoor Level

Now the exporter supports nearly all the things required to create an indoor level for the engine. After several days of work and a helping artistic hand from Endre Barath (, we have created the first explorable level in the engine. There is no much to say, aside from that it was a large amount of work from me to make things work. It was mostly debugging and optimization but some new features were also added like color grading, tangent/bitangent generation at loading time,  and multi-material support per model.

Of course, the job is not finished yet, so many things should be fixed, but there is a short video showing the results.

A Blender Exporter is in Progress

Currently, I am working on an exporter for Blender that greatly speeds up content creation process for the engine. It is not complete yet, but I thought that it worth a post.  It supports meshes, instances, materials, textures, point and spotlights.

The exporter saves all meshes in obj format (ugly, I know, but works). It creates Merlin3D scene files with all mesh and light instances. Moreover, it generates effects files automatically, configuring the call of an appropriate GLSL shader and configures the parameters for it. Some of these parameters are the textures taken from texture slots in Blender materials. The exporter supports diffuse, normal, specular and glow maps at this time, but support for further texture types can be easily added. There are also shader parameters coming from material texture settings like specular and glow coefficient, and things like that.

The script extracts several light parameters, such as color, energy, radius (or distance in Blender terminology) spotlight angle or falloff type.

There are a lot of things missing, but the results achieved now are promising… 🙂

Finally, let’s see some screenshots.

The First Teleports

Again, I had some time to play with the code. And again, I have rewritten the game object representation structure to make things simpler inside the engine. I have added a new object type “function”. It  is a non-rendering scene node, making different things to other objects. It can heal, give damage, etc., and teleport. It is activated a physics trigger by touching.

I have implemented  some teleports in the scene using these new objects. The following video demonstrates how it works.