Adaptive Deferred Shading

An adaptive deferred shading implementation based on the paper Deferred Adaptive Compute Shading, with wave-level work distribution inspired by Brian Karis's Variable Sized Work. Built with Slang and SlangPy.

Improvements over the original paper

The original paper uses a global atomic counter + groupshared ring queue state machine: a single dispatch serves all pixels, thread groups compete on an InterlockedAdd to claim work, and alternate between SEARCH (evaluate & enqueue) and SHADE (drain queue) phases. This introduces cross-group atomic contention, heavy barriers (DeviceMemoryBarrierWithGroupSync), and large groupshared footprint for storing full pixel coordinates.

We replace this with Brian Karis's DistributeWork pattern — a lightweight wave-local producer-consumer model. Each lane evaluates its own pixels, records a shade count, and a single WavePrefixSum compacts all work items into a contiguous queue consumed in wave-sized batches. This eliminates global atomics entirely, replaces heavy barriers with native wave intrinsics (WavePrefixSum, WaveReadLaneAt, WaveActiveBallot), and reduces groupshared usage from a coordinate ring buffer to just two uint[32] arrays.

Algorithm Overview

The screen is divided into 4×4 pixel blocks. Instead of shading every pixel, we shade a sparse subset first and then decide for each remaining pixel whether it needs full shading or can be cheaply interpolated from already-computed neighbors.

Pass Progression

Five passes progressively fill in all 16 pixels of each 4×4 block. Each new pixel sits at the center of 4 already-computed neighbors, enabling the shade-or-interpolate decision.

Pass	Pixels/block	Pixel positions	Neighbor offsets
0	1	`(0,0)`	— (unconditional shade)
1	1	`(2,2)`	`(±2, ±2)` diagonal corners
2	2	`(0,2)`, `(2,0)`	`(±2, 0)`, `(0, ±2)` axis-aligned
3	4	`(1,1)`, `(1,3)`, `(3,1)`, `(3,3)`	`(±1, ±1)` diagonal
4	8	remaining 8 positions	`(±1, 0)`, `(0, ±1)` axis-aligned

Fill Order

Which pass fills which pixel within a 4×4 block:

Shade-or-Interpolate Decision

For each pixel in passes 1–4, the algorithm reads 4 already-shaded neighbors, converts them to luminance, and computes the variance. If the variance exceeds a threshold (1e-3), the pixel is fully shaded; otherwise it is interpolated as the average of its neighbors.

Wave Optimization — DistributeWork

Passes 1–4 use a DistributeWork pattern (from Brian Karis — Variable sized work) to improve GPU wave utilization.

The problem: within a wave of 32 threads, some threads need to shade (expensive) and others only interpolate (cheap). Naively, all 32 threads stay active for the duration of the slowest path, wasting SIMD lanes.

The solution: a producer-consumer model using groupshared memory and wave intrinsics.

Producer phase — Each lane evaluates its pixels, determines which need shading, and stores the count. Interpolation is performed immediately.
DistributeWork — Uses WavePrefixSum to compute a compact queue of all shade-work items across the wave, then distributes them evenly so every lane gets work. Producer data (block base position + shade mask) is communicated via groupshared arrays.
Consumer phase (RunChild) — Each lane shades its assigned pixel, looking up the source lane's block position and selecting the correct sub-pixel via NthSetBit.

Super-Block Mapping (Pass 1 & 2)

Pass 1 and Pass 2 originally had only 1–2 pixels per lane, too few for DistributeWork to provide a benefit. To increase the work density, these passes use a 2×2 super-block mapping: each lane covers a 2×2 group of 4×4 blocks (an 8×8 pixel region), raising the pixels per lane to 4 (pass 1) and 8 (pass 2).

Project Structure

File	Description
`EntryPoint.py`	Pipeline orchestration using SlangPy
`AdaptiveLightingPass.slang`	Core adaptive lighting — 5 passes + DistributeWork
`Shading.slang`	Deferred lighting evaluation (`shade()`)
`GBufferPass.slang`	G-Buffer generation compute shader
`GBuffer.slang`	G-Buffer texture declarations
`LightingPass.slang`	Traditional (non-adaptive) deferred lighting reference
`Elevated.slang`	Procedural terrain scene (from Shadertoy/Elevated)
`Shadertoy.slang`	Shadertoy compatibility utilities

Usage

# Adaptive shading (default)
python EntryPoint.py

# Traditional deferred shading (reference)
python EntryPoint.py -reference

Output is saved as Result.png (adaptive) or Reference.png (reference).

Requirements

Python 3.10+
SlangPy
NumPy
imageio

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Docs		Docs
.gitignore		.gitignore
AGENTS.md		AGENTS.md
AdaptiveLightingPass.slang		AdaptiveLightingPass.slang
BlueNoise.png		BlueNoise.png
Elevated.slang		Elevated.slang
EntryPoint.py		EntryPoint.py
GBuffer.slang		GBuffer.slang
GBufferPass.slang		GBufferPass.slang
LightingPass.slang		LightingPass.slang
Pseudocode.hlsl		Pseudocode.hlsl
README.md		README.md
Reference.png		Reference.png
Shadertoy.slang		Shadertoy.slang
Shading.slang		Shading.slang
generate_figures.py		generate_figures.py
generate_teaser.py		generate_teaser.py
teaser.png		teaser.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive Deferred Shading

Improvements over the original paper

Algorithm Overview

Pass Progression

Fill Order

Shade-or-Interpolate Decision

Wave Optimization — DistributeWork

Super-Block Mapping (Pass 1 & 2)

Project Structure

Usage

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adaptive Deferred Shading

Improvements over the original paper

Algorithm Overview

Pass Progression

Fill Order

Shade-or-Interpolate Decision

Wave Optimization — DistributeWork

Super-Block Mapping (Pass 1 & 2)

Project Structure

Usage

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages