Cathedral Architecture

Section I

The Core Idea

Most performance work happens at runtime: you write flexible code, run it, profile it, and chase the slow paths. Cathedral Architecture starts from a different observation — a surprising amount of what we defer to runtime is already known when we write the code. Tensor shapes, model configurations, divisors, memory layouts: in an inference engine, these rarely change once a request begins.

If something is known at compile-time, the compiler can act on it. Bounds can be checked. Divisions can be turned into shifts or multiplications. Dispatch can be resolved before a single instruction runs. The work is the same; it just happens earlier, once, instead of repeatedly on every thread.

The Premise

The line between compile-time and runtime isn't fixed by the language — it's a choice. Cathedral Architecture is the practice of pushing that line as far toward compile-time as the problem allows, and encoding what's known into the type system rather than rediscovering it at runtime.

None of the individual techniques here are new. What's useful is applying them consistently, as a default rather than a last resort, in a domain — LLM inference — where the structure happens to be unusually static.

Section II

When the Architecture Crystallizes

Eric Raymond's "Cathedral and Bazaar" contrasted two ways of organizing a project: planned and structured versus emergent and flexible. That's a useful framing, but it's about how code gets built. The question here is a different one — when the structure of a computation gets fixed. The same algorithm can be expressed so the compiler knows nothing until execution, or so it knows almost everything beforehand.

Decided at Runtime

class Tensor {
    void* data;
    vector<int> shape;
    string dtype;

    void* index(vector<int> coords) {
        return data + calculate_offset(coords);
    }
};

Decided at Compile-Time

template<uint32_t batch, uint32_t seq,
         uint32_t hidden, uint32_t heads>
struct AttentionTensor {
    static constexpr auto dims =
        dimensions<batch,seq,hidden,heads>;
    using dtype = float16_t;

    template<uint32_t b,uint32_t s,
             uint32_t h,uint32_t n>
    DEVICE dtype& at() {
        static_assert(b<batch && s<seq);
        constexpr uint64_t offset =
            compile_time_index<dims,b,s,h,n>();
        return data[offset];
    }
    dtype* data;
};

The version on the left is maximally flexible and resolves everything at runtime. The version on the right trades that flexibility away: shapes are template parameters, the offset is a constexpr, and out-of-bounds access fails to compile. The compiler can see the whole indexing scheme before it emits code.

That trade is the whole game. You give up runtime flexibility you don't actually need, and in exchange you get checking and optimization for free. The skill is knowing which flexibility you don't need — which is exactly where this approach can go wrong if applied carelessly (see Section VII).

Section III

The Four Pillars

I

Compile-Time Knowledge

Encode what's known as types. Check bounds with static_assert. Resolve dispatch and compute layouts in constexpr/consteval rather than at runtime.

II

Zero-Cost Abstractions

Use templates and concepts so abstraction has no runtime footprint. Template instantiation resolves dispatch at compile-time — no virtual calls, no indirection.

III

Types as the Compute Graph

Express dependencies as types rather than runtime values. A transformer layer becomes a type; memory planning becomes constexpr; malformed graphs fail to compile.

IV

Constraints as Verification

Use type constraints to enforce invariants. If shape requirements are encoded as constraints, mismatched shapes simply won't compile.

Pillar I — Compile-Time Knowledge

Take a routine operation: thread_id / sequence_length.

Runtime Division

__device__ uint32_t batch =
    thread_id / seq_len;

Resolved at Compile-Time

template<uint32_t divisor> struct Division {
    DEVICE static uint32_t div(uint32_t value) {
        if constexpr (is_power_of_2(divisor)) {
            return value >> log2_ct(divisor);
        } else {
            constexpr auto params =
                magic_params(divisor);
            return mulhi(value, params.multiplier)
                       >> params.shift;
        }
    }
};

When the divisor is known at compile-time, the expensive integer division never reaches the device. It becomes a shift for powers of two, or a precomputed multiply-and-shift (the classic Granlund–Montgomery "magic number" division) otherwise.

The Obvious Objection

"A compiler already does this. x / 8 is already a shift and x / 10 is already a magic-number multiply at -O2 — for free." That's true, and it's the point. The win here is not the transformation itself; it's extending it to divisors the compiler can't see — values fixed once per request and threaded through layers, function boundaries, and kernel launches where the optimizer has long since lost track of the constant. Hoisting those to template parameters puts them back in the compiler's hands, where the same transformation it already knows how to do becomes applicable again.

On the indexing-bound microkernels we measured, isolating the division operation itself:

~15–40× On the isolated division op microbenchmark, not end-to-end

100% Of constant-divisor divisions removed from device code

~28% Fewer instructions on the measured indexing kernels

Measured on an indexing-bound microkernel against a runtime-divisor baseline on the same hardware; the speedup is for the division operation in isolation, not whole-model throughput. Full setup and per-kernel figures are in the Division Elimination whitepaper.

Pillar II — Zero-Cost Abstractions

Virtual Dispatch

virtual float compute(const Tensor& t) = 0;

struct FastTensor : Tensor {
    float compute(const Tensor& t) override;
};

Template Dispatch

template<TensorConcept T>
DEVICE float compute(const T& tensor) {
    return tensor.compute_impl();
}

The template version has no runtime footprint: instantiation resolves dispatch at compile-time, DEVICE inline forces inlining, concept checking happens during compilation, and there's no virtual table to chase. The abstraction costs nothing at runtime — not as a hope about what the optimizer might do, but as a property of how the code is dispatched.

Compile-Time Diagnostics That Carry Values

A plain static_assert tells you something is wrong. It usually doesn't tell you which values caused it, which means a second debugging round to find them. With a little work, you can bake the offending values into the error message itself.

Plain static_assert

static_assert(A::dims[1] == B::dims[0],
              "Matrix dimension mismatch");

// → error: Matrix dimension mismatch
// You know there's a mismatch. You don't know what the values were.

Value-Carrying Diagnostic

static_assert_printer<
    A::dims[1] == B::dims[0],
    MatMulError::dimension_mismatch,
    static_assert_printer_val_inserter<A::dims[1], B::dims[0]>
>::impl;

// → error: 'nonexistent_value' is not a member of
// 'static_assert_printer_impl<MatMulError::dimension_mismatch,
//  static_assert_printer_val_inserter<512, 768>>'
//
// The enum names the failure; 512 vs 768 are right there.

On Diagnostics

It's a small technique with an outsized payoff: the error tells you what failed and the values that caused it, so the fix doesn't require a separate debugging pass.

Pillar III — Types as the Compute Graph

A Transformer Layer Expressed as a Type

template<ModelConfig config>
struct TransformerBlock {
    using qcur      = MatMul<config, attn_q, attn_norm_mul>;
    using kcur      = MatMul<config, attn_k, attn_norm_mul>;
    using vcur      = MatMul<config, attn_v, attn_norm_mul>;
    using qcur_rope = Rope<config, qcur, inp_pos, rope_freqs>;
    using kcur_rope = Rope<config, kcur, inp_pos, rope_freqs>;
    using kq        = MatMul<config, k_view, q_permute>;
    using kq_soft   = Softmax<config, kq, kq_mask>;
    using kqv       = MatMul<config, v_view, kq_soft>;
    using output    = MatMul<config, attn_output, kqv_merged_cont>;
};

Dependencies between operations are compile-time relationships
The compiler can see the full compute graph in one place
Memory planning is constexpr — requirements computed during compilation
A graph wired up incorrectly won't compile

Pillar IV — Constraints as Verification

Runtime Assertions

void matmul(Tensor a, Tensor b, Tensor out) {
    assert(a.shape[1] == b.shape[0]);
    assert(out.shape[0] == a.shape[0]);
    assert(out.shape[1] == b.shape[1]);
}

Compile-Time Constraints

template<Dims A, Dims B>
    requires(A::dims[1] == B::dims[0])
auto matmul(Tensor<A> a, Tensor<B> b)
    -> Tensor<Dims<A::dims[0], B::dims[1]>>
{
}

The runtime version catches a shape mismatch when it runs — if that path runs during testing. The constrained version catches it at compile-time, every time, for free. Shape errors that can be expressed as constraints stop being a class of bug you can ship. It's a narrow guarantee — it covers what you encode and nothing more — but within that scope it's reliable.

Section IV

Worked Example: Division Elimination

The clearest illustration of the approach is integer division, which our Division Elimination whitepaper covers in detail. Integer division costs roughly 20–40 cycles on current GPUs, and tensor indexing performs several divisions per access. On lightweight kernels, that overhead can dominate.

The runtime instinct: division is expensive, so avoid it where convenient and live with it elsewhere.
The compile-time observation: in an inference engine, most divisors are either constants or set once per request. The few that are truly dynamic can be handled separately. So most divisions can be turned into cheaper operations before the kernel runs — including the ones a stock compiler can't optimize, because by the time the divisor reaches the kernel it's no longer a visible constant.

The Transformation

template<uint32_t divisor> struct Div {
    constexpr uint32_t operator()(uint32_t value) {
        if constexpr (is_pow2(divisor))
            return value >> log2(divisor);
        else
            return magic_divide(value);  // multiply-and-shift
    }
};

100% Of constant-divisor divisions removed from device code

~15–40× On the isolated division op microbenchmark, not end-to-end

~28% Fewer instructions on the measured indexing kernels

These are operation-level and kernel-level figures from indexing-bound microkernels, not whole-model inference throughput — the end-to-end gain depends entirely on how division-bound your kernels are, and for compute-heavy kernels it is small. Hardware, baselines, and per-kernel breakdowns are in the whitepaper.

The pattern generalizes: identify what's actually known ahead of time, encode it in the type, and let the compiler specialize. The division case is just the easiest one to measure.

Section V

Why This Works Now

The pieces of this approach have existed for a while; what's changed is that they've matured enough to use together comfortably.

C++20 consteval & constexpr

Real computation at compile-time. Magic-number parameters and memory layouts can be computed by the compiler rather than at startup.

Mature Template Metaprogramming

Concepts make constraints readable, if constexpr gives zero-cost branching, and fold expressions handle variadic cases without the old boilerplate.

GPU Architecture

Massive parallelism rewards minimal per-thread overhead. L2 cache keeps constant data fast, and intrinsics make multiply-high division practical.

Modern Compilers

Aggressive template instantiation, cross-translation-unit optimization, and dependable constant folding make these patterns viable in production.

Hardware Balance

Arithmetic is cheap; memory bandwidth and control-flow overhead increasingly are not. That shifts the payoff toward eliminating overhead.

Putting It Together

Each of these was usable on its own. The contribution here is mostly assembling them into a consistent way of working rather than inventing any one piece.

Section VI

How to Build This Way

Working Principles

Encode invariants in types. If a property can be expressed as a constraint, expressing it there gets it checked for free.
Compute once. Static constexpr results are calculated during compilation, not recomputed per call.
Prefer compile-time failure. A shape error caught by the compiler never becomes a runtime incident.
Keep the computation visible to the compiler. Opaque boundaries block the optimizations this approach depends on.

Workflow

Typical Cycle

Write → Test → Profile
     → Optimize → Repeat

Compile-Time-First Cycle

Design → Encode → Verify
       → Compile → Deploy

This style front-loads effort into design and type modeling. That's a real cost — the design phase is longer and the types take work to get right. The payoff is that a lot of debugging moves to compile-time, where it's cheaper, and a class of performance surprises disappears. Whether that trade is worth it depends on how stable your problem actually is. For inference engines, it usually is.

Keeping Template Instantiation Under Control

The obvious risk with heavy compile-time work is combinatorial explosion: instantiate a kernel across every configuration dimension and the numbers get ugly fast. The fix is to templatize where it pays and stop where it doesn't — keeping the leaf-level instantiation count small.

The Naive Version — 122,880 instantiations

5 architectures × 4 sizes × 2 devices × 10 batch sizes × 8 sequence lengths × 4 hidden dims × 4 head configs × 2 kv configs × 3 quant types × 2 flash options. Compile time in hours, binary size in gigabytes. Not workable.

What Gets Templated Where

Top level — full templating: architecture selection, quantization strategy, memory layout, workspace sizing, type construction.
Middle level — selective: tensor dimensions where they actually vary, data-type transforms, access patterns.
Leaf level — minimal: input/output types only, not the full configuration space.

~50 Top-level configuration variations in Nihilus

~15 Leaf-level kernel instantiations

750 Total instantiations vs. 122,880 naïve

The Rule of Thumb

Compute liberally near the top of the type hierarchy; instantiate conservatively at the leaves. Most of the practical difficulty in this approach is deciding where that line sits.

Section VII

Anti-Patterns

It's easy to take this idea too far. Things to avoid:

Template metaprogramming for its own sake. The goal is encoding domain knowledge, not computing Fibonacci numbers at compile-time to prove you can.
Over-generalizing. Specialize when specialization buys something; don't build a configuration space you'll never use.
Forcing runtime values to compile-time. Real input data is runtime data. Pretending otherwise is where this approach breaks.
Type machinery that hides intent. If a teammate needs an afternoon to read the types, the abstraction is failing its job.
Treating a string as a diagnostic. Error messages are more useful when they carry the values that caused the failure.
Claiming the compiler's own wins as your own. If a stock -O2 build already does the transformation, you haven't added anything — the value is only in the cases the optimizer can't reach.

And conversely, the parts worth keeping:

Encoding domain knowledge in types
Removing runtime overhead the compiler can resolve instead
Enforcing invariants through constraints
Specializing for cases that are genuinely known ahead of time
Keeping the full computation visible to the compiler
Holding leaf instantiation counts down
Writing diagnostics that carry their values

Section VIII

Where It's Headed

Near-Term

Compile-time auto-tuning. Having the build try several implementations and keep the fastest, rather than tuning by hand.
Richer constraints. Encoding properties like thread-safety or numerical-stability bounds where the type system can carry them.
Cross-kernel optimization. Pipeline-level type analysis to enable kernel fusion and cut down on data movement.
Target-specific specialization. Architecture-aware templating that emits well-suited code per GPU target.

Longer-Term, More Speculative

More checking moves to the compiler, reducing — though not replacing — the runtime testing surface.
Zero-cost abstraction as a checkable property, rather than something you confirm by reading disassembly after the fact.
Whole categories of shape and indexing bugs caught before they ship. Not all bugs — only the ones expressible as constraints.
Optimization decisions made with fuller information, since the compiler can see more of the computation at once.

A Caveat

These are directions, not promises. The further down this list you go, the more it depends on tooling and language features that are still maturing.

Section IX

Getting Involved

More than a library, this is a habit: before reaching for a runtime mechanism, asking whether the relevant fact is already known at compile-time. Often it is.

When you find yourself writing...

if (condition) { }

void* ptr = malloc(size);

virtual void compute() = 0;

static_assert(condition,
    "something went wrong");

...it's worth asking whether this fits.

if constexpr (condition) { }

static byte storage[size];

template<typename T> void compute() { }

static_assert_printer<
    condition,
    ErrorEnum::specific_failure,
    static_assert_printer_val_inserter<values>
>::impl;

The right-hand column isn't always the answer — sometimes the runtime version is genuinely simpler or more appropriate. The point is just to make the question a reflex: is this actually a runtime decision, or have I just been treating it like one?

Where Help Is Useful

Theory

Formalizing the patterns, establishing bounds, working out where the guarantees actually hold.

Implementation

Building libraries and kernels, and finding the cases where the approach helps versus where it doesn't.

Tooling

Better compiler diagnostics, analyzers, and anything that lowers the cost of working this way.

If you've ever stared at a profiler wondering why a kernel spends its time on indexing arithmetic, or wished a shape bug had been caught before it shipped, this is aimed at that. It's a way of thinking about where computation belongs — and the answer is more often "compile-time" than habit suggests.

Section X

Guiding Principles

To summarize, the practices this comes down to:

Encode what's known in types — if it can be checked at compile-time, check it there.
Decide early — move decisions out of runtime when the information is already available.
Keep abstractions free — prefer mechanisms with no runtime footprint over ones you hope get optimized away.
Enforce invariants with constraints — let the type system rule out the bugs it can.
Share what works — keep the techniques documented and open so others can use and critique them.
Mind instantiation cost — compute liberally at the top, instantiate sparingly at the leaves.
Make diagnostics carry their values — an error that names what went wrong saves a debugging pass.

In One Line

"Whatever's already known when you write the code, let the compiler act on it."

Appendix A

An Overview

The Core Idea

When the Architecture Crystallizes

The Four Pillars

Pillar I — Compile-Time Knowledge

Pillar II — Zero-Cost Abstractions

Compile-Time Diagnostics That Carry Values

Pillar III — Types as the Compute Graph

Pillar IV — Constraints as Verification

Worked Example: Division Elimination

Why This Works Now

How to Build This Way

Working Principles

Workflow

Keeping Template Instantiation Under Control

What Gets Templated Where

Anti-Patterns

Where It's Headed

Near-Term

Longer-Term, More Speculative

Getting Involved

Where Help Is Useful

Theory

Implementation

Tooling

Guiding Principles

Further Reading