NIHILUS

Public Technical Preview · Nihilai Collective Corp.

During a transformer forward pass, remarkably little is truly runtime-mutable.

Most inference engines are built around flexibility. Nihilus is built around certainty.

Cathedral Architecture™

A transformer forward pass contains thousands of values. Only two of them fundamentally change the shape of execution:

Batch size
Sequence length

Everything else can be known before inference begins.

Cathedral Architecture moves model topology, tensor layouts, memory plans, and dispatch logic as far upstream as possible — collapsing them into a compact architectural representation that resides in GPU constant memory during execution.

The runtime sees a solved problem, not an active one.

One Generation. One Launch.

Conventional engines cross the CPU/GPU boundary on every token.

Nihilus does not.

A generation request enters the GPU once. The decode loop executes there. Sampling, state updates, and generation control remain inside the running kernel until completion.

Per-token host orchestration overhead: structurally zero.

Memory Traffic as a First-Class Constraint

Many transformer operations exist solely to move bytes.

Reshapes. Permutations. Views. Transposes. Copies.

Nihilus eliminates them. Index transforms for ephemeral operations are composed at compile time into the address arithmetic of adjacent compute operations.

The operations exist in the model graph. They do not exist in the executed kernel.

Multi-GPU Without the Host

On conventional engines, tensor-parallel all-reduces are coordinated by the CPU.

On Nihilus, cross-GPU communication happens inside the kernel. No host involvement. No collective latency tax.

The same source compiles for single-GPU and multi-GPU configurations. Scale is not a separate architecture.

Engineering Discipline

The codebase is warning-clean across seven toolchains, sanitizer-clean, and enforces compile-time dispatch throughout. Several constructs are banned at the preprocessor level — not as style preferences, but as enforcement mechanisms for architectural properties.

What the optimizer can see, the optimizer can optimize.

~19,000

Lines of engine code

~1 MB

CUDA release binary

131k

Target context length

405B

Target model class

Supported toolchains

5 s

CPU build time

CPU build time	~5 seconds
CUDA build time	~15 seconds
CUDA binary size	~1 MB
Supported toolchains	7
Target context length	Up to 131,072 tokens
Target model class	Up to 405B parameters
Models per binary	Multiple
Deployment modes	Interactive & Server

What We Are Not Publishing

Nihilus is a commercial licensing product. The public goal is transparency regarding architectural direction, not disclosure of proprietary implementation.

Execution scheduling internals
Compile-time architecture generation
Memory orchestration systems
Communication architecture
Quantization implementation details
Performance-critical GPU execution strategies

How much of transformer execution can disappear before runtime begins?

Everything in Nihilus follows from that question.

Philosophy

Cathedral Architecture — the methodology behind the engine.