During a transformer forward pass, remarkably little is truly runtime-mutable.
Most inference engines are built around flexibility. Nihilus is built around certainty.
Public Technical Preview · Nihilai Collective Corp.
During a transformer forward pass, remarkably little is truly runtime-mutable.
Most inference engines are built around flexibility. Nihilus is built around certainty.
A transformer forward pass contains thousands of values. Only two of them fundamentally change the shape of execution:
Everything else can be known before inference begins.
Cathedral Architecture moves model topology, tensor layouts, memory plans, and dispatch logic as far upstream as possible — collapsing them into a compact architectural representation that resides in GPU constant memory during execution.
The runtime sees a solved problem, not an active one.
Conventional engines cross the CPU/GPU boundary on every token.
Nihilus does not.
A generation request enters the GPU once. The decode loop executes there. Sampling, state updates, and generation control remain inside the running kernel until completion.
Per-token host orchestration overhead: structurally zero.
Many transformer operations exist solely to move bytes.
Reshapes. Permutations. Views. Transposes. Copies.
Nihilus eliminates them. Index transforms for ephemeral operations are composed at compile time into the address arithmetic of adjacent compute operations.
The operations exist in the model graph. They do not exist in the executed kernel.
On conventional engines, tensor-parallel all-reduces are coordinated by the CPU.
On Nihilus, cross-GPU communication happens inside the kernel. No host involvement. No collective latency tax.
The same source compiles for single-GPU and multi-GPU configurations. Scale is not a separate architecture.
The codebase is warning-clean across seven toolchains, sanitizer-clean, and enforces compile-time dispatch throughout. Several constructs are banned at the preprocessor level — not as style preferences, but as enforcement mechanisms for architectural properties.
What the optimizer can see, the optimizer can optimize.
| CPU build time | ~5 seconds |
| CUDA build time | ~15 seconds |
| CUDA binary size | ~1 MB |
| Supported toolchains | 7 |
| Target context length | Up to 131,072 tokens |
| Target model class | Up to 405B parameters |
| Models per binary | Multiple |
| Deployment modes | Interactive & Server |
Nihilus is a commercial licensing product. The public goal is transparency regarding architectural direction, not disclosure of proprietary implementation.
How much of transformer execution can disappear before runtime begins?
Everything in Nihilus follows from that question.