A practical walkthrough of constructing a Bounding Volume Hierarchy for a CPU raytracer in Rust — from the naive O(n^2) version to a SAH-built tree that pays its complexity off.
A raytracer's runtime is dominated by ray-vs-scene intersection. With a million triangles and a million rays per frame, a naive for triangle in scene loop is 10^12 operations per frame. Nothing renders. The fix, every time, is a spatial acceleration structure, and for triangle soups the standard answer is a Bounding Volume Hierarchy.
I built one for my Rust raytracer. The first version was correct and bad. The second version was correct and 40x faster. This is what I wish I had known going in.
A binary tree where:
To intersect a ray, you start at the root, test the ray against the AABB, and recurse into children whose AABB the ray hits. Hits propagate up; misses prune entire subtrees.
The cost analogue of a BST is roughly O(log n) per ray for well-built trees on coherent rays. The structure beats kd-trees in build speed and is friendlier to dynamic geometry, which is why nearly every modern renderer (PBRT, Embree, OptiX) ships BVHs.
The temptation in Rust is to write:
enum BvhNode {
Internal { bbox: Aabb, left: Box<BvhNode>, right: Box<BvhNode> },
Leaf { bbox: Aabb, prims: Vec<usize> },
}
This is correct, easy, and slow. Each Box is a pointer chase through an unrelated cache line. For an inner-loop data structure traversed millions of times per frame, this is a waste.
The right shape is a flat array of nodes with indices:
#[derive(Copy, Clone)]
#[repr(C)]
pub struct BvhNode {
pub bbox: Aabb, // 24 bytes (min: Vec3, max: Vec3)
pub left_or_first: u32, // child index, or first prim index if leaf
pub prim_count: u32, // 0 = internal, >0 = leaf
}
prim_count == 0 flags an internal node, and the node is exactly 32 bytes — a single cache line on x86. Children are stored at adjacent indices when possible to keep traversal cache-friendly.
Primitive indices live in a separate Vec<u32> that the leaves index into. The leaves do not own primitives; they reference a flat permutation of primitive IDs.
Get traversal working before you optimize the build. The dumbest construction:
fn build_recursive(prims: &mut [u32], depth: u32) -> NodeId {
let bbox = compute_bbox(prims);
if prims.len() <= 4 || depth > 32 {
return push_leaf(bbox, prims);
}
let axis = bbox.longest_axis();
let mid = bbox.centroid()[axis];
let split = partition(prims, |p| centroid(p)[axis] < mid);
let left = build_recursive(&mut prims[..split], depth + 1);
let right = build_recursive(&mut prims[split..], depth + 1);
push_internal(bbox, left, right)
}
Split along the longest axis at the midpoint, recurse. This works. The tree it produces is awful for unevenly distributed scenes, which is most of them, but you get a working raytracer.
Recursion is not your friend in a hot loop. Hand-roll the stack:
pub fn intersect(&self, ray: &Ray, tmax: f32) -> Option<Hit> {
let mut stack: [u32; 64] = [0; 64];
let mut sp = 0usize;
let mut closest = tmax;
let mut hit: Option<Hit> = None;
let inv_dir = ray.dir.recip();
stack[sp] = 0;
sp += 1;
while sp > 0 {
sp -= 1;
let node = &self.nodes[stack[sp] as usize];
if !node.bbox.intersect(&ray.origin, &inv_dir, closest) {
continue;
}
if node.prim_count > 0 {
let start = node.left_or_first as usize;
let end = start + node.prim_count as usize;
for &pid in &self.prim_ids[start..end] {
if let Some(h) = self.prims[pid as usize].intersect(ray, closest) {
closest = h.t;
hit = Some(h);
}
}
} else {
// Push far child first so near is visited first.
let l = node.left_or_first;
let r = l + 1;
let (near, far) = self.order_children(l, r, ray);
stack[sp] = far; sp += 1;
stack[sp] = near; sp += 1;
}
}
hit
}
Two micro-optimizations that matter:
1.0 / ray.dir outside the loop. The slab AABB test divides by dir; doing it once saves a million divides.closest shrinks early and prunes more far-side AABB tests.A 64-deep stack is enough for any sane tree on a 32-bit primitive count. If you blow it, your tree is degenerate, which is a bug.
Midpoint splits produce trees with bad expected cost. The standard fix is the Surface Area Heuristic (SAH), which estimates the expected ray-traversal cost of a candidate split:
$$C(\text{split}) = C_{trav} + \frac{A_L}{A} N_L \cdot C_{isect} + \frac{A_R}{A} N_R \cdot C_{isect}$$
Where A is the parent's surface area, A_L and A_R are children's, and N_L/N_R are primitive counts. You pick the split that minimizes C.
Evaluating SAH for every possible split is O(n^2) and impractical past ~10k primitives. The trick is binned SAH: bucket primitives into a fixed number of bins (16 is conventional) along each axis, and only evaluate bins - 1 candidate splits per axis.
const BINS: usize = 16;
#[derive(Copy, Clone, Default)]
struct Bin { bbox: Aabb, count: u32 }
fn find_best_split(prims: &[u32], centroids: &[Vec3], bbox: &Aabb)
-> Option<(usize, f32)> // (axis, position)
{
let mut best_cost = f32::INFINITY;
let mut best = None;
for axis in 0..3 {
let lo = bbox.min[axis];
let hi = bbox.max[axis];
if hi - lo < 1e-6 { continue; }
let scale = BINS as f32 / (hi - lo);
let mut bins = [Bin::default(); BINS];
for &p in prims {
let c = centroids[p as usize][axis];
let b = ((c - lo) * scale) as usize;
let b = b.min(BINS - 1);
bins[b].count += 1;
bins[b].bbox.expand(&prim_bbox(p));
}
// Sweep left-to-right and right-to-left to get cumulative areas.
let mut left_count = [0u32; BINS - 1];
let mut left_area = [0f32; BINS - 1];
let mut right_count = [0u32; BINS - 1];
let mut right_area = [0f32; BINS - 1];
let mut acc = Bin::default();
for i in 0..BINS - 1 {
acc.count += bins[i].count;
acc.bbox.expand_aabb(&bins[i].bbox);
left_count[i] = acc.count;
left_area[i] = acc.bbox.surface_area();
}
let mut acc = Bin::default();
for i in (1..BINS).rev() {
acc.count += bins[i].count;
acc.bbox.expand_aabb(&bins[i].bbox);
right_count[i - 1] = acc.count;
right_area[i - 1] = acc.bbox.surface_area();
}
for i in 0..BINS - 1 {
let cost = left_count[i] as f32 * left_area[i]
+ right_count[i] as f32 * right_area[i];
if cost < best_cost {
best_cost = cost;
let pos = lo + (i + 1) as f32 / BINS as f32 * (hi - lo);
best = Some((axis, pos));
}
}
}
best
}
This is the version that turns a 5-second frame into a 100-millisecond frame on a moderately complex scene. The cumulative-sum trick is what gets it to O(n) per node instead of O(n^2).
A leaf-cost check (leaf_cost = N * C_isect) lets you stop subdividing when SAH says splitting is worse than just leaf-testing the primitives. This is the part that makes the tree balanced for traversal, not balanced by count.
hi - lo is 0 and the scale becomes inf. Skip the axis.For my Rust raytracer (motion blur, BVH, multiple importance sampling), moving from midpoint splits to binned SAH on a 200k-triangle scene took the build from 1.2s to 1.8s but the per-frame trace from ~5s to ~110ms. Three percent of the build is spent making fifty times less work for the trace. That is roughly always the right trade.
The next step after this is either an 8-wide BVH (MBVH) for SIMD-friendly traversal, or upgrading to a GPU BVH with stackless traversal. Both are big projects. The version above is the smallest BVH that earns its complexity.