An adaptive deferred shading implementation based on the paper Deferred Adaptive Compute Shading, with wave-level work distribution inspired by Brian Karis's Variable Sized Work. Built with Slang and SlangPy.
The original paper uses a global atomic counter + groupshared ring queue state machine: a single dispatch serves all pixels, thread groups compete on an InterlockedAdd to claim work, and alternate between SEARCH (evaluate & enqueue) and SHADE (drain queue) phases. This introduces cross-group atomic contention, heavy barriers (DeviceMemoryBarrierWithGroupSync), and large groupshared footprint for storing full pixel coordinates.
We replace this with Brian Karis's DistributeWork pattern — a lightweight wave-local producer-consumer model. Each lane evaluates its own pixels, records a shade count, and a single WavePrefixSum compacts all work items into a contiguous queue consumed in wave-sized batches. This eliminates global atomics entirely, replaces heavy barriers with native wave intrinsics (WavePrefixSum, WaveReadLaneAt, WaveActiveBallot), and reduces groupshared usage from a coordinate ring buffer to just two uint[32] arrays.
The screen is divided into 4×4 pixel blocks. Instead of shading every pixel, we shade a sparse subset first and then decide for each remaining pixel whether it needs full shading or can be cheaply interpolated from already-computed neighbors.
Five passes progressively fill in all 16 pixels of each 4×4 block. Each new pixel sits at the center of 4 already-computed neighbors, enabling the shade-or-interpolate decision.
| Pass | Pixels/block | Pixel positions | Neighbor offsets |
|---|---|---|---|
| 0 | 1 | (0,0) |
— (unconditional shade) |
| 1 | 1 | (2,2) |
(±2, ±2) diagonal corners |
| 2 | 2 | (0,2), (2,0) |
(±2, 0), (0, ±2) axis-aligned |
| 3 | 4 | (1,1), (1,3), (3,1), (3,3) |
(±1, ±1) diagonal |
| 4 | 8 | remaining 8 positions | (±1, 0), (0, ±1) axis-aligned |
Which pass fills which pixel within a 4×4 block:
For each pixel in passes 1–4, the algorithm reads 4 already-shaded neighbors, converts them to luminance, and computes the variance. If the variance exceeds a threshold (1e-3), the pixel is fully shaded; otherwise it is interpolated as the average of its neighbors.
Passes 1–4 use a DistributeWork pattern (from Brian Karis — Variable sized work) to improve GPU wave utilization.
The problem: within a wave of 32 threads, some threads need to shade (expensive) and others only interpolate (cheap). Naively, all 32 threads stay active for the duration of the slowest path, wasting SIMD lanes.
The solution: a producer-consumer model using groupshared memory and wave intrinsics.
- Producer phase — Each lane evaluates its pixels, determines which need shading, and stores the count. Interpolation is performed immediately.
- DistributeWork — Uses
WavePrefixSumto compute a compact queue of all shade-work items across the wave, then distributes them evenly so every lane gets work. Producer data (block base position + shade mask) is communicated via groupshared arrays. - Consumer phase (
RunChild) — Each lane shades its assigned pixel, looking up the source lane's block position and selecting the correct sub-pixel viaNthSetBit.
Pass 1 and Pass 2 originally had only 1–2 pixels per lane, too few for DistributeWork to provide a benefit. To increase the work density, these passes use a 2×2 super-block mapping: each lane covers a 2×2 group of 4×4 blocks (an 8×8 pixel region), raising the pixels per lane to 4 (pass 1) and 8 (pass 2).
| File | Description |
|---|---|
EntryPoint.py |
Pipeline orchestration using SlangPy |
AdaptiveLightingPass.slang |
Core adaptive lighting — 5 passes + DistributeWork |
Shading.slang |
Deferred lighting evaluation (shade()) |
GBufferPass.slang |
G-Buffer generation compute shader |
GBuffer.slang |
G-Buffer texture declarations |
LightingPass.slang |
Traditional (non-adaptive) deferred lighting reference |
Elevated.slang |
Procedural terrain scene (from Shadertoy/Elevated) |
Shadertoy.slang |
Shadertoy compatibility utilities |
# Adaptive shading (default)
python EntryPoint.py
# Traditional deferred shading (reference)
python EntryPoint.py -referenceOutput is saved as Result.png (adaptive) or Reference.png (reference).
- Python 3.10+
- SlangPy
- NumPy
- imageio




