rtc-digit-count is an algorithm for counting the number of
decimal digits in an unsigned 32- or 64-bit integer. It eliminates the
log₂ → log₁₀ approximation step used by every prior
state-of-the-art approach, replacing it with a direct lookup keyed on the value's
leading-zero count.
Counting decimal digits is a load-bearing primitive in serialization. Every integer written
to JSON, every number formatted to a string, every printf-style numeric conversion hits this
code path. Shaving instructions off this kernel propagates through every workload that
emits numbers as text — which is most of them.
Why
The problem reduces to: given x, find the smallest n such that
x < 10ⁿ. The established techniques — Lemire's and the prior fast-digit-count —
approximate log₁₀(x) from log₂(x) using a magic-constant multiplication
((9 * log2) >> 5, (77 * log2) >> 8), then correct
the off-by-one error with a single threshold comparison.
Both halves of that pipeline are paid for on every call: the multiply, the shift,
and the correction.
rtc-digit-count observes that the multiplication is unnecessary. The leading-zero count
already partitions the integer space into log₂ bins,
and the digit count for each bin is a constant known at compile time. A 33-byte (uint32)
or 65-byte (uint64) lookup table maps the leading-zero count directly to the digit count.
The only correction needed is the same threshold comparison the prior approaches already
paid for — but now without the multiply in front of it.
Zero multiplication. Zero shift. One LZCNT, one L1 load,
one comparison, one branchless add.
The 32-bit path is identical in shape, with a 33-byte LUT and 11-entry threshold table.
Both tables fit comfortably inside a single L1 cache line and stay hot across
any realistic workload.
CMake pulls BenchmarkSuite and rt-ut
automatically via FetchContent. Requires C++20.
Correctness
Every benchmark run is preceded by a bit-exact validation pass:
each candidate algorithm processes the same input set and its output is compared element-by-element
against a reference implementation. Any divergence aborts the run with the exact failing input
printed. No algorithm is benchmarked before it is proven correct on that machine, with that
compiler, on that exact set of random inputs.
Validation runs across the full
Windows MSVC, Ubuntu GCC, Ubuntu Clang, macOS Clang, and macOS GCC
matrix on every push.
Benchmark Methodology
Throughput is measured in megabytes processed per second (MB/s),
benchmarked against Lemire's digit-count and the prior
fast-digit-count approach on identical hardware.
The harness uses adaptive epoch sampling: iterations begin at
60 and double each epoch (60 → 120 → 240 → …) up to a maximum of 1,200 iterations per epoch.
Each epoch evaluates a trailing window of max(iterations / 10, 10) samples, capped
at 100,000. Convergence requires RSE < 2.5% AND
mean shift < 1% epoch-over-epoch simultaneously — the
first epoch satisfying both conditions is retained as the canonical result.
If convergence is never reached before 5 seconds elapse or
the iteration cap is hit, the result is marked non-converged
and excluded from all rankings — only converged results participate in win/tie/loss tallying.
Statistical ties are detected via Welch's t-test on
Bessel-corrected variance, so a small delta inside the noise
floor doesn't get dressed up as a victory.
The output of each call is consumed through do_not_optimize_away()
so the compiler can't eliminate the work, and the random input vectors are regenerated per iteration
to prevent branch predictors and data caches from learning the input distribution.
Tests cover both uint32 and uint64 at batch sizes of
1, 10, 100, 1,000, 10,000, and 100,000 — spanning the per-call regime (where memory latency
dominates) through the large-batch regime (where instruction throughput dominates). The CPU, OS,
and compiler for each run are detected at build time and recorded in that platform's result file.