Nihilai Collective Logo

NIHILUS

Public Technical Preview  ·  Nihilai Collective Corp.

During a transformer forward pass, remarkably little is truly runtime-mutable.

Most inference engines are built around flexibility. Nihilus is built around certainty.

Cathedral Architecture™

A transformer forward pass contains thousands of values. Only two of them fundamentally change the shape of execution:

Everything else can be known before inference begins.

Cathedral Architecture moves model topology, tensor layouts, memory plans, and dispatch logic as far upstream as possible — collapsing them into a compact architectural representation that resides in GPU constant memory during execution.

The runtime sees a solved problem, not an active one.

One Generation. One Launch.

Conventional engines cross the CPU/GPU boundary on every token.

Nihilus does not.

A generation request enters the GPU once. The decode loop executes there. Sampling, state updates, and generation control remain inside the running kernel until completion.

Per-token host orchestration overhead: structurally zero.

Memory Traffic as a First-Class Constraint

Many transformer operations exist solely to move bytes.

Reshapes. Permutations. Views. Transposes. Copies.

Nihilus eliminates them. Index transforms for ephemeral operations are composed at compile time into the address arithmetic of adjacent compute operations.

The operations exist in the model graph. They do not exist in the executed kernel.

Multi-GPU Without the Host

On conventional engines, tensor-parallel all-reduces are coordinated by the CPU.

On Nihilus, cross-GPU communication happens inside the kernel. No host involvement. No collective latency tax.

The same source compiles for single-GPU and multi-GPU configurations. Scale is not a separate architecture.

Engineering Discipline

The codebase is warning-clean across seven toolchains, sanitizer-clean, and enforces compile-time dispatch throughout. Several constructs are banned at the preprocessor level — not as style preferences, but as enforcement mechanisms for architectural properties.

What the optimizer can see, the optimizer can optimize.

~19,000
Lines of engine code
~1 MB
CUDA release binary
131k
Target context length
405B
Target model class
7
Supported toolchains
5 s
CPU build time
CPU build time ~5 seconds
CUDA build time ~15 seconds
CUDA binary size ~1 MB
Supported toolchains 7
Target context length Up to 131,072 tokens
Target model class Up to 405B parameters
Models per binary Multiple
Deployment modes Interactive & Server

What We Are Not Publishing

Nihilus is a commercial licensing product. The public goal is transparency regarding architectural direction, not disclosure of proprietary implementation.

How much of transformer execution can disappear before runtime begins?

Everything in Nihilus follows from that question.

Philosophy
Cathedral Architecture — the methodology behind the engine.