FlashHadamard: Structured Packed Retrieval Without Codebooks
Thesis
FlashHadamard is a no-codebook vector quantization family for approximate nearest neighbor search. It uses structured block-Hadamard rotation to decorrelate vector coordinates, followed by 4-bit scalar quantization with Gaussian Lloyd-Max levels, packed into nibble-transposed storage for fast ADC scoring via a fused C helper with per-thread top-k selection.
FlashHadamard achieves 87–89% recall@10 at 4 bits/dim on a real 103K×2880D retrieval workload (Gutenberg embeddings), with 8ms query latency and effectively zero metadata overhead. It requires no codebook training, making it fully online and data-oblivious.
FlashHadamard is not a general-purpose KV-cache compression method, not a replacement for SQ8 on quality-sensitive workloads, and not yet compared against the official Google TurboQuant implementation.
Core Contribution
Structured rotation. A seed-derived random permutation + random sign flip + block Fast Walsh-Hadamard Transform (FWHT) decorrelates coordinates in O(d log d) with O(1) metadata (just the seed). This replaces the dense random orthogonal matrix used in the original TurboQuant MSE path, which requires O(d²) metadata storage.
Group RMS equalization. After rotation, coordinates are grouped (e.g. groups of 16) and each group is scaled by its RMS to normalize variance before quantization. This improves hit@1 without losing recall@10 when the group size is tuned (group_size=16 is the current best on 2880D).
Packed nibble ADC. 4-bit codes are packed two per byte and stored in transposed layout (byte-major, row-minor) for cache-friendly sequential access during scoring. Query coefficients are folded with group scales at search time, so the packed scorer is shared across all blockwise variants.
Fused C helper. A standalone C library (turboquant_packed_adc.c) implements the hot scoring path: per-query byte-table build, 2-byte-fused transposed scorer with 4× row unrolling, optional multi-threaded row sharding, and fused per-thread top-k with merge. No external dependencies.
Canonical Operating Points
Setup: Full Gutenberg, 103210 base vectors (50 held out as queries), 2880 dimensions, cosine metric, k=10, seed=42, 8-thread C helper, Apple M2 Max.
Single-stage (packed-only)
| Lane | hit@1 | recall@10 | p50 ms | bytes/vec | bits/dim |
|---|---|---|---|---|---|
| FlashHadamard-16 (topk) | 80% | 89.4% | 8.7 | 1440 | 4 |
| FlashHadamard-Base (topk) | 86% | 87.8% | 7.9 | 1440 | 4 |
Two-stage (packed shortlist → rerank)
| Config | hit@1 | recall@10 | p50 ms | avg ms | p95 ms | bytes/vec | bits/dim |
|---|---|---|---|---|---|---|---|
| SQ8 rerank M=12 | 94% | 92.8% | 9.2 | 9.4 | 11.0 | 4320 | 12 |
| SQ8 rerank M=20 | 94% | 95.8% | 11.4 | 11.8 | 15.4 | 4320 | 12 |
| FP32 rerank M=20 | 100% | 99.6% | 11.4 | 11.7 | 14.2 | (resident fp32) | - |
Reference baselines
| Method | hit@1 | recall@10 | bytes/vec |
|---|---|---|---|
| float32 exact | 100% | 100% | 11520 |
| SQ8 linear | ~99% | ~99% | 2880 |
| PQ k-means | ~45% | ~67% | 16 + codebooks |
| IVF32 nprobe=4 | 82% | 86.2% | 1444 + centroids |
Recommended serving point
FlashHadamard-16 + SQ8 rerank M=12: best current exhaustive CPU operating point. 94% hit@1, 92.8% recall@10, 9.2ms p50, 4320 bytes/vec (12 bits/dim = 2.7× compression vs float32).
For applications where hit@1 matters most and storage is tight, packed-only FlashHadamard-Base at 4 bits/dim provides 86% hit@1 at 7.9ms.
Latency breakdown
| Stage | p50 ms | Share |
|---|---|---|
| Hadamard transform | 0.15 | 2% |
| C packed scorer (8 threads) | 6.6 | 72% |
| Argpartition / top-k | 0.9 | 10% |
| SQ8 gather + rerank | 0.5 | 5% |
| Python overhead | 1.0 | 11% |
| Total | ~9.2 |
The packed scorer is the dominant cost (~6.6ms), scanning 103K × 1440 bytes = 149MB per query. This is the current dominant cost under the present exhaustive CPU layout, not a fundamental limit. Shortlist size M has minimal impact on latency (8.5-9.5ms for M=8-20) because the scorer cost is fixed. Rerank overhead is negligible (<0.5ms for M=12).
Robustness
Stable across seeds 42/123/7/999 on vetted 50-query subset. 200-query runs confirm 50-query results within noise (see consumer-plan.md). Earlier numbers in the evidence trail used different setups (non-TopK inner, different holdout splits); the canonical numbers above supersede them.
Naming
- FlashHadamard — the family
- FlashHadamard-16 — group_size=16, best recall@10
- FlashHadamard-Base — no group scaling, best hit@1
Negative Results
| Branch | Result | Reason | Dataset |
|---|---|---|---|
| IVF at 103K | Not recommended | Exhaustive packed topk (8ms) matches or beats all IVF nprobe settings at comparable recall | Gutenberg 103K×2880D |
| Packed block32_dither | Refuted | Per-row dither correction essential; dropping it hurts quality vs non-dithered | Gutenberg 103K×2880D |
| Residual QJL (turboquant_prod) | Refuted | Underperforms simpler MSE path on real code-graph workload at 2–4 bits | Code-graph 1124×768D |
| Tail-similarity memory routing | Refuted | Random block selection outperforms sketch-based routing; similarity selects redundant context, not complementary | 3 Gutenberg novels, Gemma 3 4B |
| Per-dimension whitening | Refuted | Underperforms plain block-Hadamard on recall@10 across 2–4 bits | Code-graph 1124×768D |
| Companding (power β=0.75) | Refuted | Worse than block-Hadamard at all tested bit rates | Code-graph 1124×768D |
| D4 lattice quantization | Refuted | Worse recall + 25× slower encode in Python | Code-graph 1124×768D |
Interpretation
The current evidence supports FlashHadamard as a retrieval quantization family — a way to compress and search high-dimensional vectors without codebook training. It is not yet validated as a general-purpose vector search engine replacement, a KV-cache compression method, or superior to the official TurboQuant implementation.
The strongest practical story is: for embeddings in the 768–2880 dimension range, at 4 bits/dim, FlashHadamard provides 87–89% recall@10 with 8ms query latency and zero training cost. This is competitive with traditional PQ (which requires codebook fitting) and much cheaper in metadata than dense random rotation approaches.
Two-Stage Rerank
The packed quantizer’s errors concentrate at the top-k boundary. A cheap second-stage rerank on a small shortlist fixes most of them.
FP32 rerank (quality ceiling, not honest storage):
| M (shortlist) | hit@1 | recall@10 | p50 ms |
|---|---|---|---|
| no rerank | 80% | 89.4% | 6.8 |
| M=20 | 100% | 99.6% | 11.3 |
| M=50 | 100% | 100% | 23.5 |
SQ8 rerank (honest storage: 12 bits/dim total):
| M | hit@1 | recall@10 | p50 ms | p95 ms | bytes/vec |
|---|---|---|---|---|---|
| 8 | 100% | 78.2% | 10.7 | 19.4 | 4320 |
| 12 | 100% | 96.2% | 11.5 | 20.0 | 4320 |
| 16 | 100% | 97.0% | 12.1 | 21.8 | 4320 |
| 20 | 100% | 97.6% | 12.3 | 23.8 | 4320 |
Sweet spot: M=12 with SQ8 rerank — 94% hit@1, 92.8% recall@10, 8.9ms p50 (with fused TopK inner), 4320 bytes/vec (3× compression).
Shortlist size has minimal latency impact (8.5-9.5ms for M=8-20) because the packed scorer cost is fixed at ~6.6ms regardless of M. Reducing M saves rerank time (<0.5ms) but not scorer time.
Latency breakdown (SQ8 rerank M=12 on full 103K):
| Stage | p50 ms | Share |
|---|---|---|
| Hadamard transform | 0.15 | 1% |
| C packed scorer (8 threads) | 6.6 | 89% |
| SQ8 gather + rerank + topk | 0.5 | 7% |
| Python overhead | 0.3 | 3% |
| Total | ~7.5 |
The packed scorer dominates (6.6ms). It scans 103K × 1440 bytes = 149MB per query. M2 Max theoretical bandwidth floor: ~0.75ms. The 6.6ms includes per-byte LUT lookups, thread sync, and cache effects. Rerank overhead is negligible (<0.5ms for M=12).
Fusing rerank into C would save ~0.3ms Python overhead. This is the current floor for exhaustive CPU full-scan on this scorer layout, not a fundamental limit. Paths to materially faster serving:
- Fewer candidates scanned (IVF at 500K+ scale, pre-filtering)
- Fewer bytes per query (lower dim, coarser first stage)
- Different execution substrate (GPU scorer via Metal)
- Smaller shortlist M (if quality holds at M=8)
Note: the 12 bits/dim serving footprint (4-bit packed + 8-bit SQ8 rerank payload) is no longer a pure 4-bit lane. This is a two-tier storage tradeoff: fast packed scan for shortlisting + higher-fidelity SQ8 for boundary disambiguation.
Status
The exhaustive CPU serving path is done enough for the current workload. Further local tweaks (fused C rerank, scorer micro-optimizations) have low expected payoff — the scorer at 6.6ms dominates and is memory-bandwidth constrained.
Open Work
- Sub-3ms serving. Requires a regime change, not local tweaks:
- GPU scorer (Metal compute pipeline, extending cogni-ml’s infrastructure)
- IVF prefilter at 500K+ scale (centroid caching ready)
- New memory layout / execution substrate
- Higher dimensions. Synthetic scaling to 65536D shows tractability, but real recall numbers at very high dim are missing.
- Official TurboQuant comparison. No public implementation was available at time of writing.
- Diversity-based memory routing. The naive similarity routing failed. Coverage/MMR-based selection might succeed but is untested. (Parked.)
- Engine integration. The current evaluator + C helper is a research harness, not a production index path. Integration into pg_sorted_heap as a native quantization mode is the logical next step.
Reproducibility
All results use the repo-owned evaluator and Gutenberg dataset:
# Exhaustive packed lanes
make bench-turboquant-gutenberg-vetted \
TURBOQUANT_GUTENBERG_METHODS="turboquant_block16_packed4_topk,turboquant_blockhadamard_packed4_topk"
# IVF sweep with cache
make bench-turboquant-gutenberg-vetted \
TURBOQUANT_GUTENBERG_METHODS="turboquant_ivf32_block32_packed4" \
TURBOQUANT_ARGS="--ivf-cache-dir /tmp/ivf_cache --ivf-nprobe 8"
# Screening harness for packed parity
make bench-turboquant-gutenberg-screen
Source:
- Evaluator:
scripts/bench_turboquant_retrieval.py - C helper:
scripts/turboquant_packed_adc.c - Screen:
scripts/bench_turboquant_packed_screen.py - Evidence trail:
docs/turboquant-consumer-plan.md