POST :: 07 MAY 2026

Building a BVH for a Rust Raytracer Without Hating Yourself

A practical walkthrough of constructing a Bounding Volume Hierarchy for a CPU raytracer in Rust — from the naive O(n^2) version to a SAH-built tree that pays its complexity off.

#graphics#rust#raytracing#bvh#data-structures

Building a BVH for a Rust Raytracer Without Hating Yourself

A raytracer's runtime is dominated by ray-vs-scene intersection. With a million triangles and a million rays per frame, a naive for triangle in scene loop is 10^12 operations per frame. Nothing renders. The fix, every time, is a spatial acceleration structure, and for triangle soups the standard answer is a Bounding Volume Hierarchy.

I built one for my Rust raytracer. The first version was correct and bad. The second version was correct and 40x faster. This is what I wish I had known going in.

What a BVH actually is

A binary tree where:

Each leaf holds one or a few primitives (triangles, spheres).
Each internal node holds an axis-aligned bounding box (AABB) that fully contains everything in its subtree.
The two children's AABBs may overlap. They are not a partition of space. They are a partition of primitives.

To intersect a ray, you start at the root, test the ray against the AABB, and recurse into children whose AABB the ray hits. Hits propagate up; misses prune entire subtrees.

The cost analogue of a BST is roughly O(log n) per ray for well-built trees on coherent rays. The structure beats kd-trees in build speed and is friendlier to dynamic geometry, which is why nearly every modern renderer (PBRT, Embree, OptiX) ships BVHs.

Step 1: the data layout

The temptation in Rust is to write:

enum BvhNode {
    Internal { bbox: Aabb, left: Box<BvhNode>, right: Box<BvhNode> },
    Leaf { bbox: Aabb, prims: Vec<usize> },
}

This is correct, easy, and slow. Each Box is a pointer chase through an unrelated cache line. For an inner-loop data structure traversed millions of times per frame, this is a waste.

The right shape is a flat array of nodes with indices:

#[derive(Copy, Clone)]
#[repr(C)]
pub struct BvhNode {
    pub bbox: Aabb,           // 24 bytes (min: Vec3, max: Vec3)
    pub left_or_first: u32,   // child index, or first prim index if leaf
    pub prim_count: u32,      // 0 = internal, >0 = leaf
}

prim_count == 0 flags an internal node, and the node is exactly 32 bytes — a single cache line on x86. Children are stored at adjacent indices when possible to keep traversal cache-friendly.

Primitive indices live in a separate Vec<u32> that the leaves index into. The leaves do not own primitives; they reference a flat permutation of primitive IDs.

Step 2: the dumb build, to make sure the rest works

Get traversal working before you optimize the build. The dumbest construction:

fn build_recursive(prims: &mut [u32], depth: u32) -> NodeId {
    let bbox = compute_bbox(prims);
    if prims.len() <= 4 || depth > 32 {
        return push_leaf(bbox, prims);
    }
    let axis = bbox.longest_axis();
    let mid = bbox.centroid()[axis];
    let split = partition(prims, |p| centroid(p)[axis] < mid);
    let left = build_recursive(&mut prims[..split], depth + 1);
    let right = build_recursive(&mut prims[split..], depth + 1);
    push_internal(bbox, left, right)
}

Split along the longest axis at the midpoint, recurse. This works. The tree it produces is awful for unevenly distributed scenes, which is most of them, but you get a working raytracer.

Step 3: traversal, with a stack you own

Recursion is not your friend in a hot loop. Hand-roll the stack:

pub fn intersect(&self, ray: &Ray, tmax: f32) -> Option<Hit> {
    let mut stack: [u32; 64] = [0; 64];
    let mut sp = 0usize;
    let mut closest = tmax;
    let mut hit: Option<Hit> = None;

    let inv_dir = ray.dir.recip();
    stack[sp] = 0;
    sp += 1;

    while sp > 0 {
        sp -= 1;
        let node = &self.nodes[stack[sp] as usize];
        if !node.bbox.intersect(&ray.origin, &inv_dir, closest) {
            continue;
        }
        if node.prim_count > 0 {
            let start = node.left_or_first as usize;
            let end = start + node.prim_count as usize;
            for &pid in &self.prim_ids[start..end] {
                if let Some(h) = self.prims[pid as usize].intersect(ray, closest) {
                    closest = h.t;
                    hit = Some(h);
                }
            }
        } else {
            // Push far child first so near is visited first.
            let l = node.left_or_first;
            let r = l + 1;
            let (near, far) = self.order_children(l, r, ray);
            stack[sp] = far; sp += 1;
            stack[sp] = near; sp += 1;
        }
    }
    hit
}

Two micro-optimizations that matter:

Precompute 1.0 / ray.dir outside the loop. The slab AABB test divides by dir; doing it once saves a million divides.
Visit the near child first so closest shrinks early and prunes more far-side AABB tests.

A 64-deep stack is enough for any sane tree on a 32-bit primitive count. If you blow it, your tree is degenerate, which is a bug.

Step 4: the build that actually pays

Midpoint splits produce trees with bad expected cost. The standard fix is the Surface Area Heuristic (SAH), which estimates the expected ray-traversal cost of a candidate split:

$$C(\text{split}) = C_{trav} + \frac{A_L}{A} N_L \cdot C_{isect} + \frac{A_R}{A} N_R \cdot C_{isect}$$

Where A is the parent's surface area, A_L and A_R are children's, and N_L/N_R are primitive counts. You pick the split that minimizes C.

Evaluating SAH for every possible split is O(n^2) and impractical past ~10k primitives. The trick is binned SAH: bucket primitives into a fixed number of bins (16 is conventional) along each axis, and only evaluate bins - 1 candidate splits per axis.

const BINS: usize = 16;

#[derive(Copy, Clone, Default)]
struct Bin { bbox: Aabb, count: u32 }

fn find_best_split(prims: &[u32], centroids: &[Vec3], bbox: &Aabb)
    -> Option<(usize, f32)>  // (axis, position)
{
    let mut best_cost = f32::INFINITY;
    let mut best = None;
    for axis in 0..3 {
        let lo = bbox.min[axis];
        let hi = bbox.max[axis];
        if hi - lo < 1e-6 { continue; }
        let scale = BINS as f32 / (hi - lo);

        let mut bins = [Bin::default(); BINS];
        for &p in prims {
            let c = centroids[p as usize][axis];
            let b = ((c - lo) * scale) as usize;
            let b = b.min(BINS - 1);
            bins[b].count += 1;
            bins[b].bbox.expand(&prim_bbox(p));
        }

        // Sweep left-to-right and right-to-left to get cumulative areas.
        let mut left_count = [0u32; BINS - 1];
        let mut left_area  = [0f32; BINS - 1];
        let mut right_count = [0u32; BINS - 1];
        let mut right_area  = [0f32; BINS - 1];
        let mut acc = Bin::default();
        for i in 0..BINS - 1 {
            acc.count += bins[i].count;
            acc.bbox.expand_aabb(&bins[i].bbox);
            left_count[i] = acc.count;
            left_area[i]  = acc.bbox.surface_area();
        }
        let mut acc = Bin::default();
        for i in (1..BINS).rev() {
            acc.count += bins[i].count;
            acc.bbox.expand_aabb(&bins[i].bbox);
            right_count[i - 1] = acc.count;
            right_area[i - 1]  = acc.bbox.surface_area();
        }

        for i in 0..BINS - 1 {
            let cost = left_count[i] as f32 * left_area[i]
                     + right_count[i] as f32 * right_area[i];
            if cost < best_cost {
                best_cost = cost;
                let pos = lo + (i + 1) as f32 / BINS as f32 * (hi - lo);
                best = Some((axis, pos));
            }
        }
    }
    best
}

This is the version that turns a 5-second frame into a 100-millisecond frame on a moderately complex scene. The cumulative-sum trick is what gets it to O(n) per node instead of O(n^2).

A leaf-cost check (leaf_cost = N * C_isect) lets you stop subdividing when SAH says splitting is worse than just leaf-testing the primitives. This is the part that makes the tree balanced for traversal, not balanced by count.

Step 5: things that bit me

Centroid bbox vs primitive bbox. The bin axis range should come from the centroid AABB of the input primitives, not the union AABB. Otherwise primitives clump into one bin and the heuristic does nothing.
Degenerate axes. If all centroids coincide on an axis (a wall of triangles), hi - lo is 0 and the scale becomes inf. Skip the axis.
Float precision in the AABB test. Use the Williams et al. slab test with sign-aware bounds. Otherwise a ray grazing an edge oscillates between hit and miss across frames.
Primitives in two boxes. A BVH does not split primitives; it puts each primitive in exactly one leaf. If a long triangle straddles the split, it goes whole into one side. This is fine — the AABBs overlap, and the parent area covers it. Trying to "split spatially" gets you into SBVH territory, which is a bigger commitment.

What this gets you

For my Rust raytracer (motion blur, BVH, multiple importance sampling), moving from midpoint splits to binned SAH on a 200k-triangle scene took the build from 1.2s to 1.8s but the per-frame trace from ~5s to ~110ms. Three percent of the build is spent making fifty times less work for the trace. That is roughly always the right trade.

The next step after this is either an 8-wide BVH (MBVH) for SIMD-friendly traversal, or upgrading to a GPU BVH with stackless traversal. Both are big projects. The version above is the smallest BVH that earns its complexity.

References

Ingo Wald, On Fast Construction of SAH-based Bounding Volume Hierarchies, 2007.
Matt Pharr, Wenzel Jakob, Greg Humphreys, Physically Based Rendering: From Theory to Implementation, 4th edition.
Jacco Bikker's How to build a BVH series — the most readable practical writeup I know of.

arrow_back[RETURN_TO_LOG]

POST :: 07 MAY 2026

Building a BVH for a Rust Raytracer Without Hating Yourself

A practical walkthrough of constructing a Bounding Volume Hierarchy for a CPU raytracer in Rust — from the naive O(n^2) version to a SAH-built tree that pays its complexity off.

#graphics#rust#raytracing#bvh#data-structures

Building a BVH for a Rust Raytracer Without Hating Yourself

I built one for my Rust raytracer. The first version was correct and bad. The second version was correct and 40x faster. This is what I wish I had known going in.

What a BVH actually is

A binary tree where:

Each leaf holds one or a few primitives (triangles, spheres).
Each internal node holds an axis-aligned bounding box (AABB) that fully contains everything in its subtree.
The two children's AABBs may overlap. They are not a partition of space. They are a partition of primitives.

To intersect a ray, you start at the root, test the ray against the AABB, and recurse into children whose AABB the ray hits. Hits propagate up; misses prune entire subtrees.

Step 1: the data layout

The temptation in Rust is to write:

enum BvhNode {
    Internal { bbox: Aabb, left: Box<BvhNode>, right: Box<BvhNode> },
    Leaf { bbox: Aabb, prims: Vec<usize> },
}

This is correct, easy, and slow. Each Box is a pointer chase through an unrelated cache line. For an inner-loop data structure traversed millions of times per frame, this is a waste.

The right shape is a flat array of nodes with indices:

#[derive(Copy, Clone)]
#[repr(C)]
pub struct BvhNode {
    pub bbox: Aabb,           // 24 bytes (min: Vec3, max: Vec3)
    pub left_or_first: u32,   // child index, or first prim index if leaf
    pub prim_count: u32,      // 0 = internal, >0 = leaf
}

prim_count == 0 flags an internal node, and the node is exactly 32 bytes — a single cache line on x86. Children are stored at adjacent indices when possible to keep traversal cache-friendly.

Primitive indices live in a separate Vec<u32> that the leaves index into. The leaves do not own primitives; they reference a flat permutation of primitive IDs.

Step 2: the dumb build, to make sure the rest works

Get traversal working before you optimize the build. The dumbest construction:

fn build_recursive(prims: &mut [u32], depth: u32) -> NodeId {
    let bbox = compute_bbox(prims);
    if prims.len() <= 4 || depth > 32 {
        return push_leaf(bbox, prims);
    }
    let axis = bbox.longest_axis();
    let mid = bbox.centroid()[axis];
    let split = partition(prims, |p| centroid(p)[axis] < mid);
    let left = build_recursive(&mut prims[..split], depth + 1);
    let right = build_recursive(&mut prims[split..], depth + 1);
    push_internal(bbox, left, right)
}

Split along the longest axis at the midpoint, recurse. This works. The tree it produces is awful for unevenly distributed scenes, which is most of them, but you get a working raytracer.

Step 3: traversal, with a stack you own

Recursion is not your friend in a hot loop. Hand-roll the stack:

pub fn intersect(&self, ray: &Ray, tmax: f32) -> Option<Hit> {
    let mut stack: [u32; 64] = [0; 64];
    let mut sp = 0usize;
    let mut closest = tmax;
    let mut hit: Option<Hit> = None;

    let inv_dir = ray.dir.recip();
    stack[sp] = 0;
    sp += 1;

    while sp > 0 {
        sp -= 1;
        let node = &self.nodes[stack[sp] as usize];
        if !node.bbox.intersect(&ray.origin, &inv_dir, closest) {
            continue;
        }
        if node.prim_count > 0 {
            let start = node.left_or_first as usize;
            let end = start + node.prim_count as usize;
            for &pid in &self.prim_ids[start..end] {
                if let Some(h) = self.prims[pid as usize].intersect(ray, closest) {
                    closest = h.t;
                    hit = Some(h);
                }
            }
        } else {
            // Push far child first so near is visited first.
            let l = node.left_or_first;
            let r = l + 1;
            let (near, far) = self.order_children(l, r, ray);
            stack[sp] = far; sp += 1;
            stack[sp] = near; sp += 1;
        }
    }
    hit
}

Two micro-optimizations that matter:

Precompute 1.0 / ray.dir outside the loop. The slab AABB test divides by dir; doing it once saves a million divides.
Visit the near child first so closest shrinks early and prunes more far-side AABB tests.

A 64-deep stack is enough for any sane tree on a 32-bit primitive count. If you blow it, your tree is degenerate, which is a bug.

Step 4: the build that actually pays

Midpoint splits produce trees with bad expected cost. The standard fix is the Surface Area Heuristic (SAH), which estimates the expected ray-traversal cost of a candidate split:

$$C(\text{split}) = C_{trav} + \frac{A_L}{A} N_L \cdot C_{isect} + \frac{A_R}{A} N_R \cdot C_{isect}$$

Where A is the parent's surface area, A_L and A_R are children's, and N_L/N_R are primitive counts. You pick the split that minimizes C.

const BINS: usize = 16;

#[derive(Copy, Clone, Default)]
struct Bin { bbox: Aabb, count: u32 }

fn find_best_split(prims: &[u32], centroids: &[Vec3], bbox: &Aabb)
    -> Option<(usize, f32)>  // (axis, position)
{
    let mut best_cost = f32::INFINITY;
    let mut best = None;
    for axis in 0..3 {
        let lo = bbox.min[axis];
        let hi = bbox.max[axis];
        if hi - lo < 1e-6 { continue; }
        let scale = BINS as f32 / (hi - lo);

        let mut bins = [Bin::default(); BINS];
        for &p in prims {
            let c = centroids[p as usize][axis];
            let b = ((c - lo) * scale) as usize;
            let b = b.min(BINS - 1);
            bins[b].count += 1;
            bins[b].bbox.expand(&prim_bbox(p));
        }

        // Sweep left-to-right and right-to-left to get cumulative areas.
        let mut left_count = [0u32; BINS - 1];
        let mut left_area  = [0f32; BINS - 1];
        let mut right_count = [0u32; BINS - 1];
        let mut right_area  = [0f32; BINS - 1];
        let mut acc = Bin::default();
        for i in 0..BINS - 1 {
            acc.count += bins[i].count;
            acc.bbox.expand_aabb(&bins[i].bbox);
            left_count[i] = acc.count;
            left_area[i]  = acc.bbox.surface_area();
        }
        let mut acc = Bin::default();
        for i in (1..BINS).rev() {
            acc.count += bins[i].count;
            acc.bbox.expand_aabb(&bins[i].bbox);
            right_count[i - 1] = acc.count;
            right_area[i - 1]  = acc.bbox.surface_area();
        }

        for i in 0..BINS - 1 {
            let cost = left_count[i] as f32 * left_area[i]
                     + right_count[i] as f32 * right_area[i];
            if cost < best_cost {
                best_cost = cost;
                let pos = lo + (i + 1) as f32 / BINS as f32 * (hi - lo);
                best = Some((axis, pos));
            }
        }
    }
    best
}

This is the version that turns a 5-second frame into a 100-millisecond frame on a moderately complex scene. The cumulative-sum trick is what gets it to O(n) per node instead of O(n^2).

Step 5: things that bit me

Centroid bbox vs primitive bbox. The bin axis range should come from the centroid AABB of the input primitives, not the union AABB. Otherwise primitives clump into one bin and the heuristic does nothing.
Degenerate axes. If all centroids coincide on an axis (a wall of triangles), hi - lo is 0 and the scale becomes inf. Skip the axis.
Float precision in the AABB test. Use the Williams et al. slab test with sign-aware bounds. Otherwise a ray grazing an edge oscillates between hit and miss across frames.
Primitives in two boxes. A BVH does not split primitives; it puts each primitive in exactly one leaf. If a long triangle straddles the split, it goes whole into one side. This is fine — the AABBs overlap, and the parent area covers it. Trying to "split spatially" gets you into SBVH territory, which is a bigger commitment.

What this gets you

References

Ingo Wald, On Fast Construction of SAH-based Bounding Volume Hierarchies, 2007.
Matt Pharr, Wenzel Jakob, Greg Humphreys, Physically Based Rendering: From Theory to Implementation, 4th edition.
Jacco Bikker's How to build a BVH series — the most readable practical writeup I know of.

// ./log

Building a BVH for a Rust Raytracer Without Hating Yourself

What a BVH actually is

Step 1: the data layout

Step 2: the dumb build, to make sure the rest works

Step 3: traversal, with a stack you own

Step 4: the build that actually pays

Step 5: things that bit me

What this gets you

References

Building a BVH for a Rust Raytracer Without Hating Yourself

What a BVH actually is

Step 1: the data layout

Step 2: the dumb build, to make sure the rest works

Step 3: traversal, with a stack you own

Step 4: the build that actually pays

Step 5: things that bit me

What this gets you

References