Most performance work happens at runtime: you write flexible code, run it, profile it, and chase the slow paths. Cathedral Architecture starts from a different observation — a surprising amount of what we defer to runtime is already known when we write the code. Tensor shapes, model configurations, divisors, memory layouts: in an inference engine, these rarely change once a request begins.
If something is known at compile-time, the compiler can act on it. Bounds can be checked. Divisions can be turned into shifts or multiplications. Dispatch can be resolved before a single instruction runs. The work is the same; it just happens earlier, once, instead of repeatedly on every thread.
The Premise
The line between compile-time and runtime isn't fixed by the language — it's a choice. Cathedral Architecture is the practice of pushing that line as far toward compile-time as the problem allows, and encoding what's known into the type system rather than rediscovering it at runtime.
None of the individual techniques here are new. What's useful is applying them consistently, as a default rather than a last resort, in a domain — LLM inference — where the structure happens to be unusually static.
Section II
When the Architecture Crystallizes
Eric Raymond's "Cathedral and Bazaar" contrasted two ways of organizing a project: planned and structured versus emergent and flexible. That's a useful framing, but it's about how code gets built. The question here is a different one — when the structure of a computation gets fixed. The same algorithm can be expressed so the compiler knows nothing until execution, or so it knows almost everything beforehand.
Decided at Runtime
class Tensor {
void* data;
vector<int> shape;
string dtype;
void* index(vector<int> coords) {
return data + calculate_offset(coords);
}
};
The version on the left is maximally flexible and resolves everything at runtime. The version on the right trades that flexibility away: shapes are template parameters, the offset is a constexpr, and out-of-bounds access fails to compile. The compiler can see the whole indexing scheme before it emits code.
That trade is the whole game. You give up runtime flexibility you don't actually need, and in exchange you get checking and optimization for free. The skill is knowing which flexibility you don't need — which is exactly where this approach can go wrong if applied carelessly (see Section VII).
Section III
The Four Pillars
I
Compile-Time Knowledge
Encode what's known as types. Check bounds with static_assert. Resolve dispatch and compute layouts in constexpr/consteval rather than at runtime.
II
Zero-Cost Abstractions
Use templates and concepts so abstraction has no runtime footprint. Template instantiation resolves dispatch at compile-time — no virtual calls, no indirection.
III
Types as the Compute Graph
Express dependencies as types rather than runtime values. A transformer layer becomes a type; memory planning becomes constexpr; malformed graphs fail to compile.
IV
Constraints as Verification
Use type constraints to enforce invariants. If shape requirements are encoded as constraints, mismatched shapes simply won't compile.
Pillar I — Compile-Time Knowledge
Take a routine operation: thread_id / sequence_length.
When the divisor is known at compile-time, the expensive integer division never reaches the device. It becomes a shift for powers of two, or a precomputed multiply-and-shift (the classic Granlund–Montgomery "magic number" division) otherwise.
The Obvious Objection
"A compiler already does this. x / 8 is already a shift and x / 10 is already a magic-number multiply at -O2 — for free." That's true, and it's the point. The win here is not the transformation itself; it's extending it to divisors the compiler can't see — values fixed once per request and threaded through layers, function boundaries, and kernel launches where the optimizer has long since lost track of the constant. Hoisting those to template parameters puts them back in the compiler's hands, where the same transformation it already knows how to do becomes applicable again.
On the indexing-bound microkernels we measured, isolating the division operation itself:
~15–40×On the isolated division opmicrobenchmark, not end-to-end
100%Of constant-divisor divisions removed from device code
~28%Fewer instructions on the measured indexing kernels
Measured on an indexing-bound microkernel against a runtime-divisor baseline on the same hardware; the speedup is for the division operation in isolation, not whole-model throughput. Full setup and per-kernel figures are in the Division Elimination whitepaper.
The template version has no runtime footprint: instantiation resolves dispatch at compile-time, DEVICE inline forces inlining, concept checking happens during compilation, and there's no virtual table to chase. The abstraction costs nothing at runtime — not as a hope about what the optimizer might do, but as a property of how the code is dispatched.
Compile-Time Diagnostics That Carry Values
A plain static_assert tells you something is wrong. It usually doesn't tell you which values caused it, which means a second debugging round to find them. With a little work, you can bake the offending values into the error message itself.
Plain static_assert
static_assert(A::dims[1] == B::dims[0],
"Matrix dimension mismatch");
// → error: Matrix dimension mismatch
// You know there's a mismatch. You don't know what the values were.
Value-Carrying Diagnostic
static_assert_printer<
A::dims[1] == B::dims[0],
MatMulError::dimension_mismatch,
static_assert_printer_val_inserter<A::dims[1], B::dims[0]>
>::impl;
// → error: 'nonexistent_value' is not a member of
// 'static_assert_printer_impl<MatMulError::dimension_mismatch,
// static_assert_printer_val_inserter<512, 768>>'
//
// The enum names the failure; 512 vs 768 are right there.
On Diagnostics
It's a small technique with an outsized payoff: the error tells you what failed and the values that caused it, so the fix doesn't require a separate debugging pass.
Pillar III — Types as the Compute Graph
A Transformer Layer Expressed as a Type
template<ModelConfig config>
struct TransformerBlock {
using qcur = MatMul<config, attn_q, attn_norm_mul>;
using kcur = MatMul<config, attn_k, attn_norm_mul>;
using vcur = MatMul<config, attn_v, attn_norm_mul>;
using qcur_rope = Rope<config, qcur, inp_pos, rope_freqs>;
using kcur_rope = Rope<config, kcur, inp_pos, rope_freqs>;
using kq = MatMul<config, k_view, q_permute>;
using kq_soft = Softmax<config, kq, kq_mask>;
using kqv = MatMul<config, v_view, kq_soft>;
using output = MatMul<config, attn_output, kqv_merged_cont>;
};
Dependencies between operations are compile-time relationships
The compiler can see the full compute graph in one place
Memory planning is constexpr — requirements computed during compilation
template<Dims A, Dims B>
requires(A::dims[1] == B::dims[0])
auto matmul(Tensor<A> a, Tensor<B> b)
-> Tensor<Dims<A::dims[0], B::dims[1]>>
{
}
The runtime version catches a shape mismatch when it runs — if that path runs during testing. The constrained version catches it at compile-time, every time, for free. Shape errors that can be expressed as constraints stop being a class of bug you can ship. It's a narrow guarantee — it covers what you encode and nothing more — but within that scope it's reliable.
Section IV
Worked Example: Division Elimination
The clearest illustration of the approach is integer division, which our Division Elimination whitepaper covers in detail. Integer division costs roughly 20–40 cycles on current GPUs, and tensor indexing performs several divisions per access. On lightweight kernels, that overhead can dominate.
The runtime instinct: division is expensive, so avoid it where convenient and live with it elsewhere.
The compile-time observation: in an inference engine, most divisors are either constants or set once per request. The few that are truly dynamic can be handled separately. So most divisions can be turned into cheaper operations before the kernel runs — including the ones a stock compiler can't optimize, because by the time the divisor reaches the kernel it's no longer a visible constant.
The Transformation
template<uint32_t divisor> struct Div {
constexpr uint32_t operator()(uint32_t value) {
if constexpr (is_pow2(divisor))
return value >> log2(divisor);
else
return magic_divide(value); // multiply-and-shift
}
};
100%Of constant-divisor divisions removed from device code
~15–40×On the isolated division opmicrobenchmark, not end-to-end
~28%Fewer instructions on the measured indexing kernels
These are operation-level and kernel-level figures from indexing-bound microkernels, not whole-model inference throughput — the end-to-end gain depends entirely on how division-bound your kernels are, and for compute-heavy kernels it is small. Hardware, baselines, and per-kernel breakdowns are in the whitepaper.
The pattern generalizes: identify what's actually known ahead of time, encode it in the type, and let the compiler specialize. The division case is just the easiest one to measure.
Section V
Why This Works Now
The pieces of this approach have existed for a while; what's changed is that they've matured enough to use together comfortably.
C++20 consteval & constexpr
Real computation at compile-time. Magic-number parameters and memory layouts can be computed by the compiler rather than at startup.
Mature Template Metaprogramming
Concepts make constraints readable, if constexpr gives zero-cost branching, and fold expressions handle variadic cases without the old boilerplate.
GPU Architecture
Massive parallelism rewards minimal per-thread overhead. L2 cache keeps constant data fast, and intrinsics make multiply-high division practical.
Modern Compilers
Aggressive template instantiation, cross-translation-unit optimization, and dependable constant folding make these patterns viable in production.
Hardware Balance
Arithmetic is cheap; memory bandwidth and control-flow overhead increasingly are not. That shifts the payoff toward eliminating overhead.
Putting It Together
Each of these was usable on its own. The contribution here is mostly assembling them into a consistent way of working rather than inventing any one piece.
Section VI
How to Build This Way
Working Principles
Encode invariants in types. If a property can be expressed as a constraint, expressing it there gets it checked for free.
Compute once. Static constexpr results are calculated during compilation, not recomputed per call.
Prefer compile-time failure. A shape error caught by the compiler never becomes a runtime incident.
Keep the computation visible to the compiler. Opaque boundaries block the optimizations this approach depends on.
Workflow
Typical Cycle
Write → Test → Profile
→ Optimize → Repeat
Compile-Time-First Cycle
Design → Encode → Verify
→ Compile → Deploy
This style front-loads effort into design and type modeling. That's a real cost — the design phase is longer and the types take work to get right. The payoff is that a lot of debugging moves to compile-time, where it's cheaper, and a class of performance surprises disappears. Whether that trade is worth it depends on how stable your problem actually is. For inference engines, it usually is.
Keeping Template Instantiation Under Control
The obvious risk with heavy compile-time work is combinatorial explosion: instantiate a kernel across every configuration dimension and the numbers get ugly fast. The fix is to templatize where it pays and stop where it doesn't — keeping the leaf-level instantiation count small.
The Naive Version — 122,880 instantiations
5 architectures × 4 sizes × 2 devices × 10 batch sizes × 8 sequence lengths × 4 hidden dims × 4 head configs × 2 kv configs × 3 quant types × 2 flash options. Compile time in hours, binary size in gigabytes. Not workable.
What Gets Templated Where
Top level — full templating: architecture selection, quantization strategy, memory layout, workspace sizing, type construction.
Middle level — selective: tensor dimensions where they actually vary, data-type transforms, access patterns.
Leaf level — minimal: input/output types only, not the full configuration space.
~50Top-level configuration variations in Nihilus
~15Leaf-level kernel instantiations
750Total instantiations vs. 122,880 naïve
The Rule of Thumb
Compute liberally near the top of the type hierarchy; instantiate conservatively at the leaves. Most of the practical difficulty in this approach is deciding where that line sits.
Section VII
Anti-Patterns
It's easy to take this idea too far. Things to avoid:
Template metaprogramming for its own sake. The goal is encoding domain knowledge, not computing Fibonacci numbers at compile-time to prove you can.
Over-generalizing. Specialize when specialization buys something; don't build a configuration space you'll never use.
Forcing runtime values to compile-time. Real input data is runtime data. Pretending otherwise is where this approach breaks.
Type machinery that hides intent. If a teammate needs an afternoon to read the types, the abstraction is failing its job.
Treating a string as a diagnostic. Error messages are more useful when they carry the values that caused the failure.
Claiming the compiler's own wins as your own. If a stock -O2 build already does the transformation, you haven't added anything — the value is only in the cases the optimizer can't reach.
And conversely, the parts worth keeping:
Encoding domain knowledge in types
Removing runtime overhead the compiler can resolve instead
Enforcing invariants through constraints
Specializing for cases that are genuinely known ahead of time
Keeping the full computation visible to the compiler
Holding leaf instantiation counts down
Writing diagnostics that carry their values
Section VIII
Where It's Headed
Near-Term
Compile-time auto-tuning. Having the build try several implementations and keep the fastest, rather than tuning by hand.
Richer constraints. Encoding properties like thread-safety or numerical-stability bounds where the type system can carry them.
Cross-kernel optimization. Pipeline-level type analysis to enable kernel fusion and cut down on data movement.
Target-specific specialization. Architecture-aware templating that emits well-suited code per GPU target.
Longer-Term, More Speculative
More checking moves to the compiler, reducing — though not replacing — the runtime testing surface.
Zero-cost abstraction as a checkable property, rather than something you confirm by reading disassembly after the fact.
Whole categories of shape and indexing bugs caught before they ship. Not all bugs — only the ones expressible as constraints.
Optimization decisions made with fuller information, since the compiler can see more of the computation at once.
A Caveat
These are directions, not promises. The further down this list you go, the more it depends on tooling and language features that are still maturing.
Section IX
Getting Involved
More than a library, this is a habit: before reaching for a runtime mechanism, asking whether the relevant fact is already known at compile-time. Often it is.
When you find yourself writing...
if (condition) { }
void* ptr = malloc(size);
virtual void compute() = 0;
static_assert(condition,
"something went wrong");
The right-hand column isn't always the answer — sometimes the runtime version is genuinely simpler or more appropriate. The point is just to make the question a reflex: is this actually a runtime decision, or have I just been treating it like one?
Where Help Is Useful
Theory
Formalizing the patterns, establishing bounds, working out where the guarantees actually hold.
Implementation
Building libraries and kernels, and finding the cases where the approach helps versus where it doesn't.
Tooling
Better compiler diagnostics, analyzers, and anything that lowers the cost of working this way.
If you've ever stared at a profiler wondering why a kernel spends its time on indexing arithmetic, or wished a shape bug had been caught before it shipped, this is aimed at that. It's a way of thinking about where computation belongs — and the answer is more often "compile-time" than habit suggests.
Section X
Guiding Principles
To summarize, the practices this comes down to:
Encode what's known in types — if it can be checked at compile-time, check it there.
Decide early — move decisions out of runtime when the information is already available.
Keep abstractions free — prefer mechanisms with no runtime footprint over ones you hope get optimized away.
Enforce invariants with constraints — let the type system rule out the bugs it can.
Share what works — keep the techniques documented and open so others can use and critique them.
Mind instantiation cost — compute liberally at the top, instantiate sparingly at the leaves.
Make diagnostics carry their values — an error that names what went wrong saves a debugging pass.
In One Line
"Whatever's already known when you write the code, let the compiler act on it."
Appendix A
Further Reading
Foundations
Stroustrup, B. — The Design and Evolution of C++
Alexandrescu, A. — Modern C++ Design
Abrahams, D. & Gurtovoy, A. — C++ Template Metaprogramming
Performance Engineering
Warren, H. S. — Hacker's Delight (Ch. 10, integer division by constants)
Granlund, T. & Montgomery, P. — Division by Invariant Integers using Multiplication
Fog, A. — Optimizing Software in C++
Lemire, D. — Fast Random Integer Generation
Type Theory
Pierce, B. C. — Types and Programming Languages
Martin-Löf, P. — Intuitionistic Type Theory
Our Whitepapers
Compile-Time Division Elimination for Zero-Overhead Tensor Indexing in CUDA Kernels
Order-Agnostic Constexpr Configuration: A Type-Routed Compile-Time Parameter System with Enforced Uniqueness
More to come.
The biggest wins tend to come from moving work to the right time, not from doing the same work faster —deciding at compile-time what doesn't need to be decided at runtime.
Thanks to Claude (Sonnet) for the conversation on Jun 7 2025 that helped clarify how few transformer dimensions actually vary at runtime — the observation that started this.