Cathedral Architecture

An Overview

VERSION 1.3  ·  TORONTO, MARCH 2026

Section I

The Core Idea

Most performance work happens at runtime: you write flexible code, run it, profile it, and chase the slow paths. Cathedral Architecture starts from a different observation — a surprising amount of what we defer to runtime is already known when we write the code. Tensor shapes, model configurations, divisors, memory layouts: in an inference engine, these rarely change once a request begins.

If something is known at compile-time, the compiler can act on it. Bounds can be checked. Divisions can be turned into shifts or multiplications. Dispatch can be resolved before a single instruction runs. The work is the same; it just happens earlier, once, instead of repeatedly on every thread.

The Premise

The line between compile-time and runtime isn't fixed by the language — it's a choice. Cathedral Architecture is the practice of pushing that line as far toward compile-time as the problem allows, and encoding what's known into the type system rather than rediscovering it at runtime.

None of the individual techniques here are new. What's useful is applying them consistently, as a default rather than a last resort, in a domain — LLM inference — where the structure happens to be unusually static.

Section II

When the Architecture Crystallizes

Eric Raymond's "Cathedral and Bazaar" contrasted two ways of organizing a project: planned and structured versus emergent and flexible. That's a useful framing, but it's about how code gets built. The question here is a different one — when the structure of a computation gets fixed. The same algorithm can be expressed so the compiler knows nothing until execution, or so it knows almost everything beforehand.

Decided at Runtime
class Tensor {
    void* data;
    vector<int> shape;
    string dtype;

    void* index(vector<int> coords) {
        return data + calculate_offset(coords);
    }
};
Decided at Compile-Time
template<uint32_t batch, uint32_t seq,
         uint32_t hidden, uint32_t heads>
struct AttentionTensor {
    static constexpr auto dims =
        dimensions<batch,seq,hidden,heads>;
    using dtype = float16_t;

    template<uint32_t b,uint32_t s,
             uint32_t h,uint32_t n>
    DEVICE dtype& at() {
        static_assert(b<batch && s<seq);
        constexpr uint64_t offset =
            compile_time_index<dims,b,s,h,n>();
        return data[offset];
    }
    dtype* data;
};

The version on the left is maximally flexible and resolves everything at runtime. The version on the right trades that flexibility away: shapes are template parameters, the offset is a constexpr, and out-of-bounds access fails to compile. The compiler can see the whole indexing scheme before it emits code.

That trade is the whole game. You give up runtime flexibility you don't actually need, and in exchange you get checking and optimization for free. The skill is knowing which flexibility you don't need — which is exactly where this approach can go wrong if applied carelessly (see Section VII).

Section III

The Four Pillars

I

Compile-Time Knowledge

Encode what's known as types. Check bounds with static_assert. Resolve dispatch and compute layouts in constexpr/consteval rather than at runtime.

II

Zero-Cost Abstractions

Use templates and concepts so abstraction has no runtime footprint. Template instantiation resolves dispatch at compile-time — no virtual calls, no indirection.

III

Types as the Compute Graph

Express dependencies as types rather than runtime values. A transformer layer becomes a type; memory planning becomes constexpr; malformed graphs fail to compile.

IV

Constraints as Verification

Use type constraints to enforce invariants. If shape requirements are encoded as constraints, mismatched shapes simply won't compile.


Pillar I — Compile-Time Knowledge

Take a routine operation: thread_id / sequence_length.

Runtime Division
__device__ uint32_t batch =
    thread_id / seq_len;
Resolved at Compile-Time
template<uint32_t divisor> struct Division {
    DEVICE static uint32_t div(uint32_t value) {
        if constexpr (is_power_of_2(divisor)) {
            return value >> log2_ct(divisor);
        } else {
            constexpr auto params =
                magic_params(divisor);
            return mulhi(value, params.multiplier)
                       >> params.shift;
        }
    }
};

When the divisor is known at compile-time, the expensive integer division never reaches the device. It becomes a shift for powers of two, or a precomputed multiply-and-shift (the classic Granlund–Montgomery "magic number" division) otherwise.

The Obvious Objection

"A compiler already does this. x / 8 is already a shift and x / 10 is already a magic-number multiply at -O2 — for free." That's true, and it's the point. The win here is not the transformation itself; it's extending it to divisors the compiler can't see — values fixed once per request and threaded through layers, function boundaries, and kernel launches where the optimizer has long since lost track of the constant. Hoisting those to template parameters puts them back in the compiler's hands, where the same transformation it already knows how to do becomes applicable again.

On the indexing-bound microkernels we measured, isolating the division operation itself:

~15–40× On the isolated division op microbenchmark, not end-to-end
100% Of constant-divisor divisions removed from device code
~28% Fewer instructions on the measured indexing kernels

Measured on an indexing-bound microkernel against a runtime-divisor baseline on the same hardware; the speedup is for the division operation in isolation, not whole-model throughput. Full setup and per-kernel figures are in the Division Elimination whitepaper.


Pillar II — Zero-Cost Abstractions

Virtual Dispatch
virtual float compute(const Tensor& t) = 0;

struct FastTensor : Tensor {
    float compute(const Tensor& t) override;
};
Template Dispatch
template<TensorConcept T>
DEVICE float compute(const T& tensor) {
    return tensor.compute_impl();
}

The template version has no runtime footprint: instantiation resolves dispatch at compile-time, DEVICE inline forces inlining, concept checking happens during compilation, and there's no virtual table to chase. The abstraction costs nothing at runtime — not as a hope about what the optimizer might do, but as a property of how the code is dispatched.


Compile-Time Diagnostics That Carry Values

A plain static_assert tells you something is wrong. It usually doesn't tell you which values caused it, which means a second debugging round to find them. With a little work, you can bake the offending values into the error message itself.

Plain static_assert
static_assert(A::dims[1] == B::dims[0],
              "Matrix dimension mismatch");

// → error: Matrix dimension mismatch
// You know there's a mismatch. You don't know what the values were.
Value-Carrying Diagnostic
static_assert_printer<
    A::dims[1] == B::dims[0],
    MatMulError::dimension_mismatch,
    static_assert_printer_val_inserter<A::dims[1], B::dims[0]>
>::impl;

// → error: 'nonexistent_value' is not a member of
// 'static_assert_printer_impl<MatMulError::dimension_mismatch,
//  static_assert_printer_val_inserter<512, 768>>'
//
// The enum names the failure; 512 vs 768 are right there.
On Diagnostics

It's a small technique with an outsized payoff: the error tells you what failed and the values that caused it, so the fix doesn't require a separate debugging pass.


Pillar III — Types as the Compute Graph

A Transformer Layer Expressed as a Type
template<ModelConfig config>
struct TransformerBlock {
    using qcur      = MatMul<config, attn_q, attn_norm_mul>;
    using kcur      = MatMul<config, attn_k, attn_norm_mul>;
    using vcur      = MatMul<config, attn_v, attn_norm_mul>;
    using qcur_rope = Rope<config, qcur, inp_pos, rope_freqs>;
    using kcur_rope = Rope<config, kcur, inp_pos, rope_freqs>;
    using kq        = MatMul<config, k_view, q_permute>;
    using kq_soft   = Softmax<config, kq, kq_mask>;
    using kqv       = MatMul<config, v_view, kq_soft>;
    using output    = MatMul<config, attn_output, kqv_merged_cont>;
};

Pillar IV — Constraints as Verification

Runtime Assertions
void matmul(Tensor a, Tensor b, Tensor out) {
    assert(a.shape[1] == b.shape[0]);
    assert(out.shape[0] == a.shape[0]);
    assert(out.shape[1] == b.shape[1]);
}
Compile-Time Constraints
template<Dims A, Dims B>
    requires(A::dims[1] == B::dims[0])
auto matmul(Tensor<A> a, Tensor<B> b)
    -> Tensor<Dims<A::dims[0], B::dims[1]>>
{
}

The runtime version catches a shape mismatch when it runs — if that path runs during testing. The constrained version catches it at compile-time, every time, for free. Shape errors that can be expressed as constraints stop being a class of bug you can ship. It's a narrow guarantee — it covers what you encode and nothing more — but within that scope it's reliable.

Section IV

Worked Example: Division Elimination

The clearest illustration of the approach is integer division, which our Division Elimination whitepaper covers in detail. Integer division costs roughly 20–40 cycles on current GPUs, and tensor indexing performs several divisions per access. On lightweight kernels, that overhead can dominate.

The Transformation
template<uint32_t divisor> struct Div {
    constexpr uint32_t operator()(uint32_t value) {
        if constexpr (is_pow2(divisor))
            return value >> log2(divisor);
        else
            return magic_divide(value);  // multiply-and-shift
    }
};
100% Of constant-divisor divisions removed from device code
~15–40× On the isolated division op microbenchmark, not end-to-end
~28% Fewer instructions on the measured indexing kernels

These are operation-level and kernel-level figures from indexing-bound microkernels, not whole-model inference throughput — the end-to-end gain depends entirely on how division-bound your kernels are, and for compute-heavy kernels it is small. Hardware, baselines, and per-kernel breakdowns are in the whitepaper.

The pattern generalizes: identify what's actually known ahead of time, encode it in the type, and let the compiler specialize. The division case is just the easiest one to measure.

Section V

Why This Works Now

The pieces of this approach have existed for a while; what's changed is that they've matured enough to use together comfortably.

C++20 consteval & constexpr

Real computation at compile-time. Magic-number parameters and memory layouts can be computed by the compiler rather than at startup.

Mature Template Metaprogramming

Concepts make constraints readable, if constexpr gives zero-cost branching, and fold expressions handle variadic cases without the old boilerplate.

GPU Architecture

Massive parallelism rewards minimal per-thread overhead. L2 cache keeps constant data fast, and intrinsics make multiply-high division practical.

Modern Compilers

Aggressive template instantiation, cross-translation-unit optimization, and dependable constant folding make these patterns viable in production.

Hardware Balance

Arithmetic is cheap; memory bandwidth and control-flow overhead increasingly are not. That shifts the payoff toward eliminating overhead.

Putting It Together

Each of these was usable on its own. The contribution here is mostly assembling them into a consistent way of working rather than inventing any one piece.

Section VI

How to Build This Way

Working Principles

Workflow

Typical Cycle
Write → Test → Profile
     → Optimize → Repeat
Compile-Time-First Cycle
Design → Encode → Verify
       → Compile → Deploy

This style front-loads effort into design and type modeling. That's a real cost — the design phase is longer and the types take work to get right. The payoff is that a lot of debugging moves to compile-time, where it's cheaper, and a class of performance surprises disappears. Whether that trade is worth it depends on how stable your problem actually is. For inference engines, it usually is.


Keeping Template Instantiation Under Control

The obvious risk with heavy compile-time work is combinatorial explosion: instantiate a kernel across every configuration dimension and the numbers get ugly fast. The fix is to templatize where it pays and stop where it doesn't — keeping the leaf-level instantiation count small.

The Naive Version — 122,880 instantiations

5 architectures × 4 sizes × 2 devices × 10 batch sizes × 8 sequence lengths × 4 hidden dims × 4 head configs × 2 kv configs × 3 quant types × 2 flash options. Compile time in hours, binary size in gigabytes. Not workable.

What Gets Templated Where

~50 Top-level configuration variations in Nihilus
~15 Leaf-level kernel instantiations
750 Total instantiations vs. 122,880 naïve
The Rule of Thumb

Compute liberally near the top of the type hierarchy; instantiate conservatively at the leaves. Most of the practical difficulty in this approach is deciding where that line sits.

Section VII

Anti-Patterns

It's easy to take this idea too far. Things to avoid:


And conversely, the parts worth keeping:

Section VIII

Where It's Headed

Near-Term

Longer-Term, More Speculative

A Caveat

These are directions, not promises. The further down this list you go, the more it depends on tooling and language features that are still maturing.

Section IX

Getting Involved

More than a library, this is a habit: before reaching for a runtime mechanism, asking whether the relevant fact is already known at compile-time. Often it is.

When you find yourself writing...
if (condition) { }

void* ptr = malloc(size);

virtual void compute() = 0;

static_assert(condition,
    "something went wrong");
...it's worth asking whether this fits.
if constexpr (condition) { }

static byte storage[size];

template<typename T> void compute() { }

static_assert_printer<
    condition,
    ErrorEnum::specific_failure,
    static_assert_printer_val_inserter<values>
>::impl;

The right-hand column isn't always the answer — sometimes the runtime version is genuinely simpler or more appropriate. The point is just to make the question a reflex: is this actually a runtime decision, or have I just been treating it like one?


Where Help Is Useful

Theory

Formalizing the patterns, establishing bounds, working out where the guarantees actually hold.

Implementation

Building libraries and kernels, and finding the cases where the approach helps versus where it doesn't.

Tooling

Better compiler diagnostics, analyzers, and anything that lowers the cost of working this way.

If you've ever stared at a profiler wondering why a kernel spends its time on indexing arithmetic, or wished a shape bug had been caught before it shipped, this is aimed at that. It's a way of thinking about where computation belongs — and the answer is more often "compile-time" than habit suggests.

Section X

Guiding Principles

To summarize, the practices this comes down to:

  1. Encode what's known in types — if it can be checked at compile-time, check it there.
  2. Decide early — move decisions out of runtime when the information is already available.
  3. Keep abstractions free — prefer mechanisms with no runtime footprint over ones you hope get optimized away.
  4. Enforce invariants with constraints — let the type system rule out the bugs it can.
  5. Share what works — keep the techniques documented and open so others can use and critique them.
  6. Mind instantiation cost — compute liberally at the top, instantiate sparingly at the leaves.
  7. Make diagnostics carry their values — an error that names what went wrong saves a debugging pass.
In One Line

"Whatever's already known when you write the code, let the compiler act on it."

Appendix A

Further Reading

Foundations

Stroustrup, B. — The Design and Evolution of C++
Alexandrescu, A. — Modern C++ Design
Abrahams, D. & Gurtovoy, A. — C++ Template Metaprogramming

Performance Engineering

Warren, H. S. — Hacker's Delight (Ch. 10, integer division by constants)
Granlund, T. & Montgomery, P. — Division by Invariant Integers using Multiplication
Fog, A. — Optimizing Software in C++
Lemire, D. — Fast Random Integer Generation

Type Theory

Pierce, B. C. — Types and Programming Languages
Martin-Löf, P. — Intuitionistic Type Theory

Our Whitepapers

Compile-Time Division Elimination for Zero-Overhead Tensor Indexing in CUDA Kernels
Order-Agnostic Constexpr Configuration: A Type-Routed Compile-Time Parameter System with Enforced Uniqueness
More to come.

The biggest wins tend to come from moving work to the right time, not from doing the same work faster — deciding at compile-time what doesn't need to be decided at runtime.

Thanks to Claude (Sonnet) for the conversation on Jun 7 2025 that helped clarify how few transformer dimensions actually vary at runtime — the observation that started this.