FlashHadamard: Structured Packed Retrieval Quantizer + Selective Memory
1. What FlashHadamard Is
FlashHadamard is a no-codebook vector quantization family built on structured block-Hadamard rotation with packed 4-bit nibble ADC scoring. It achieves 91% recall@10 at 5.7ms on 103K x 2880D real retrieval workloads with effectively zero metadata overhead, competitive with SQ8 quality at 8x less storage. The packed transposed scoring path with fused C top-k eliminates the need for codebook training, making it fully online and data-oblivious.
Core pipeline
vector → normalize → block-Hadamard rotation → group RMS equalize → 4-bit scalar quantize → pack nibbles → transpose
At query time:
query → rotate → fold group scales into coeffs → packed ADC score (C helper) → fused top-k
What makes it different from original TurboQuant
| Axis | Original TurboQuant | FlashHadamard |
|---|---|---|
| Core math | PolarQuant + 1-bit QJL residual | structured block-Hadamard + scalar quant + packed ADC |
| Rotation | random orthogonal (dense, O(d^2) metadata) | seed-derived permutation + signs + FWHT (O(1) metadata) |
| Residual correction | explicit 1-bit QJL stage | none (refuted on our workloads; see consumer-plan.md) |
| Quantizer | Gaussian Lloyd-Max | Gaussian Lloyd-Max (shared) |
| Scoring | not focused on packed kernel path | packed nibble ADC with transposed layout + fused C top-k |
| Main target | KV-cache + vector search | retrieval / ANN-style packed search + KV memory (proposed) |
| Evidence | paper/blog claims | repo-owned measurements on Gutenberg + code-graph |
Family naming
- FlashHadamard-16 (
block16_packed4_topk): group_size=16, best combined quality on Gutenberg (100% hit@1, 91.0% recall@10, 5.7ms) - FlashHadamard-Base (
blockhadamard_packed4_topk): no group scaling, highest recall@10 on some workloads (91.2%)
2. Retrieval Operating Points
Vetted Gutenberg (103260 x 2880D, cosine, 4 bits, k=10):
| Lane | hit@1 | recall@10 | p50 ms | metadata | role |
|---|---|---|---|---|---|
| FlashHadamard-16 (topk) | 100/98% | 91.0% | 5.7 | 0.8 KB | fastest packed |
| FlashHadamard-Base (topk) | 98% | 91.2% | 6.2 | 0.1 KB | highest recall packed |
| block32_dither (dense) | 100% | 91.8% | 17.2 | 0.4 KB | best quality (not packable) |
| SQ8 linear | 98.6% | 99.3% | - | 0 | quality ceiling |
| PQ k-means (16 sub) | 45.3% | 66.6% | - | codebooks | baseline |
Robustness: stable across seeds 42/123/7/999. 200-query runs confirm 50-query vetted results within noise.
Comparisons against other families:
| Family | Strength | Weakness | vs FlashHadamard |
|---|---|---|---|
| SQ8 linear | simple, strong quality | 8x more bytes | FlashHadamard wins on compression + packed latency |
| PQ / IVF-PQ | mature, codebook-driven | training overhead, metadata | FlashHadamard is no-codebook, fully online |
| RaBitQ | very compact, fast | approximation error | FlashHadamard has better recall at 4-bit |
| Original TurboQuant | stronger theory, residual | not our implementation | we are retrieval-engineering-heavy |
3. KV Memory Architecture (Proposed)
Problem
LLM session KV cache grows linearly with context length:
- Qwen 3.5 9B at 8K context: ~2GB FP16 KV cache
- At 32K context: ~8GB — exceeds M2 Max unified memory budget
- cogni-ml already hits Metal memory pressure at 8GB per question
Key observation: most of a long session’s KV cache has already been “read through” — the information it carried is absorbed into the hidden state of subsequent tokens. The full cache is rarely needed simultaneously.
Architecture: Selective Memory
NOT “compressed KV cache” or “KV virtualization” — retrieval-augmented selective memory with pg_sorted_heap backing. Routes at text/token level, not KV tensor level.
┌─────────────────────────────────────────┐
│ HOT TIER (GPU RAM) │
│ - Last N tokens, exact FP16 KV │
│ - Standard llama.cpp attention │
│ - Size: configurable (e.g. last 2K tok) │
│ - Always resident, never evicted │
└─────────────┬───────────────────────────┘
│ overflow (oldest blocks spill)
┌─────────────▼───────────────────────────┐
│ WARM TIER (pg_sorted_heap) │
│ Per block (e.g. 256 tokens): │
│ - Routing sketch: FlashHadamard │
│ on mean-key vector (4-bit, dim=64) │
│ - Payload: SQ8 compressed KV block │
│ - Index: session_id + layer + block_seq │
│ Retrieval: │
│ - Query: current attention state │
│ - FlashHadamard sketch search │
│ - Fetch top-K relevant blocks only │
└─────────────┬───────────────────────────┘
│ eviction (TTL / LRU)
┌─────────────▼───────────────────────────┐
│ COLD TIER (disk / evicted) │
│ - Oldest blocks, rarely accessed │
│ - Can be reloaded if session resumes │
└─────────────────────────────────────────┘
Query flow
- Current tokens attend to HOT tier (standard attention, no change)
- If extended context needed: a. Compute routing query from current attention state b. FlashHadamard sketch search on WARM tier blocks c. Fetch top-K relevant blocks from pg_sorted_heap d. Decompress SQ8 -> FP16, inject into attention window
- Expected warm-tier hit rate: < 5% of stored tokens per query
Why pg_sorted_heap
- Already runs alongside Cogniformerus on the same PG instance
- Clustered heap: sequential scan for block retrieval
- SQ8 compression already implemented
- HNSW index available for sketch-based routing
- Session isolation via table partitioning or session_id column
- TTL eviction via standard PG maintenance
Minimal llama.cpp integration
Phase A — Exact offload (no routing):
On KV overflow:
serialize oldest block → INSERT INTO kv_blocks(session_id, layer, block_seq, kv_data)
On context window shift:
if block needed: SELECT kv_data FROM kv_blocks WHERE session_id=? AND layer=? AND block_seq=?
decompress → inject into KV cache slots
Phase B — Routed recall:
On KV overflow:
same as Phase A, plus: compute block sketch → store alongside payload
On extended context query:
compute routing query from current hidden state
SELECT kv_data FROM kv_blocks
WHERE session_id=? AND layer=?
ORDER BY sketch <-> query_sketch LIMIT k
decompress top-K → inject into attention
cogni-ml hook point
In ML::LLM::Context (cogni-ml src/ml/llm/llama.cr):
- After
encode()/decode(): check KV cache size - If exceeds hot-tier budget: call spill callback
- Spill callback: serialize block, write to PG via Crystal
pgshard - On cache miss: fetch callback, inject via
llama_kv_cache_seq_cp
Open questions
- Block sketch quality: Does mean-key FlashHadamard sketch actually correlate with attention relevance? Needs experiment.
- Round-trip latency: Can pg_sorted_heap serve 32KB blocks in < 5ms? Gate for interactive use.
- Layer coupling: Should routing be per-layer or shared across layers? Per-layer is safer but 12x more sketches.
- Warm-up cost: First query after cold start needs to build routing index. Acceptable if session is long.
Experiment plan
Experiment 0: pg_sorted_heap round-trip latency — PASS
Measured on local containerized PG (pg_sorted_heap 0.13.dev, port 30432):
| config | heap | read-id p50 | read+decode p50 | sketch top-5 p50 | verdict |
|---|---|---|---|---|---|
| 1K blocks, 32KB | plain | 0.72ms | 0.78ms | 1.88ms | PASS |
| 1K blocks, 32KB | sorted_heap | 0.69ms | 0.78ms | 1.99ms | PASS |
| 1K blocks, 128KB | plain | 1.60ms | 1.92ms | 5.49ms | PASS |
| 1K blocks, 128KB | sorted_heap | 1.51ms | 1.85ms | 5.62ms | PASS |
| 10K blocks, 32KB | plain | 0.70ms | 0.77ms | 2.21ms | PASS |
| 10K blocks, 32KB | sorted_heap | 0.69ms | 0.81ms | 2.27ms | PASS |
Key findings:
- exact read-by-id + decode comfortably under 1ms at 32KB, under 2ms at 128KB
- no degradation from 1K to 10K blocks on exact reads (0.69-0.72ms stable)
- sorted_heap performs identically to plain heap (no overhead)
- sketch search scales mildly with block count (1.9ms→2.3ms from 1K to 10K at 32KB)
- all gates pass for 32KB blocks; 128KB sketch search is marginal (5.5ms vs 8ms gate)
Conclusion: pg_sorted_heap is a viable warm-tier backing store for KV blocks. 32KB blocks (256 tokens) are the recommended payload size. Script: scripts/bench_kv_offload_exp0.py
Experiment 1A: Embedding-proxy routing quality — PASS
Tested on 4 Gutenberg novels (1K-2K chunks each, block_size=8, k=5):
- FlashHadamard sketch: 65-70% mean overlap with ground truth top-5
- Random baseline: 0-4%
- Recency-only: 0-11%
- Signal ~15x above random, consistently above recency
Caveat: ground truth is exact embedding cosine similarity, NOT actual KV-cache attention weights. This is a necessary but not sufficient condition for routing quality. Script: scripts/bench_kv_routing_exp1a.py
Experiment 1B: LLM-closer routing quality (next) Goal: validate routing with a ground truth signal closer to actual autoregressive model behavior.
Approach: for a long text processed by an autoregressive LLM (Qwen 3.5 9B via cogni-ml llama.cpp bindings), measure whether removing/restoring old context blocks affects generation quality. Specifically:
- process a long text (~2K tokens) through the LLM
- at each evaluation point, compare perplexity on next-N tokens with: a. full context (baseline) b. truncated to last-M tokens only (recency-only) c. last-M tokens + top-k blocks retrieved by FlashHadamard sketch
- if (c) is materially closer to (a) than (b), the routing thesis holds at the model level, not just embedding level
Gate:
- PASS: sketch-augmented perplexity materially closer to full context than recency-only
- SOFT: sketch helps but margin is small
- FAIL: no meaningful perplexity improvement from sketch-routed blocks
Result (Gemma 3 4B, War and Peace excerpt, block_size=64, k=2): PASS
- Full context (618 tokens): ppl=12.47
- Recency only (128 tokens): ppl=31.51
- Sketch + tail (256 tokens): ppl=12.48
- Recovery: 99.9% of context gap closed using 41% of tokens
- Script:
cogni-ml/bin/kv_routing_exp1b.cr - Adversary note: 1B uses block-text embedding similarity as routing signal, not actual serialized KV-block sketches. This is a strong routing proxy (proven by the perplexity recovery), but still not final proof that sketches computed from KV tensors directly would perform identically. The gap between “text embedding of block” and “mean key vector of KV block” remains untested.
Experiment 2A: Exact token spill/restore — PASS
Proves token-level spill/restore is quality-neutral:
- Full context (618 tokens): ppl=12.47
- Exact restore (618 tokens): ppl=12.47, delta=0.0
- Partial block included (boundary fix from initial 0.98 delta)
- Script:
cogni-ml/bin/kv_offload_exp2a.cr
Note: this is token re-evaluation, not KV tensor serialization.
Experiment 2B: Selective context routing proxy — PASS
Tests whether sketch-selected text blocks yield good context:
- Full context (618 tokens): ppl=12.47
- Recency only (128 tokens): ppl=31.51
- Sketch-selected (234 tokens, 2 blocks): ppl=11.93
- 62% memory reduction with equal or better perplexity
- Script:
cogni-ml/bin/kv_offload_exp2b.cr
Important caveats:
- This is context selection, not live KV-cache spill/restore. The script re-evaluates from tokens, not from serialized KV tensors.
- The routing sketch is embedding-based (nomic-embed), not KV-tensor-based.
- “Better than full context” is one excerpt, one k=2 — likely from filtering noise in irrelevant middle context, not a general guarantee.
- This supports the thesis for retrieval-augmented selective memory but does NOT yet prove a live KV offload/restore architecture.
What is proven so far (summary):
- pg_sorted_heap can serve blocks fast enough (Exp 0)
- Embedding-based sketch finds relevant blocks (Exp 1A)
- Sketch-selected context improves LLM perplexity over recency-only (Exp 1B, 2B)
- Token-level spill/restore is quality-neutral (Exp 2A)
What is NOT yet proven:
- Live KV tensor serialization/restore through llama.cpp
- KV-tensor-based sketches (vs embedding-based)
- Performance on longer contexts (> 1K tokens)
- Generalization beyond one text excerpt
- That the architecture works under real interactive generation
Decision (2026-04-03): selective memory, not live KV hook
All evidence so far supports retrieval-augmented selective memory (text/token-level routing + re-eval) more strongly than KV-tensor-level virtualization. Live KV hook requires deep llama.cpp surgery with uncertain payoff; selective memory is already proven to work at the text level.
Reframed thesis: FlashHadamard selective memory — a retrieval layer that selects relevant historical context blocks via cheap sketch routing and reinserts them as tokens for re-evaluation.
Live KV hook parked as later optimization: only revisit if re-eval latency becomes the dominant bottleneck after the selective memory path is productionized.
Experiment 3: Selective memory e2e prototype — single-text PASS, robustness FAIL
Single-text result (Gemma 3 4B, 844 tokens, k=5):
- Full context: ppl=9.76
- Recency only: ppl=25.36
- Sketch-routed (PG-backed): ppl=10.23, 74% memory, 97% recovery
- Retrieve overhead: 17.5ms (16ms embed + 1.5ms PG)
- Script:
cogniformerus/bin/selective_memory_exp3.cr
Robustness pass (3 Gutenberg texts, 1024 tokens each, k=3 and k=5):
- Random block selection consistently outperforms sketch routing
- War and Peace k=3: random 95% recovery vs sketch 78%
- Les Mis k=5: random 92% vs sketch 42%
- Bleak House k=3: random 85% vs sketch 56%
- Script:
cogniformerus/bin/selective_memory_robustness.cr
Root cause: embedding similarity to the tail selects REDUNDANT context (semantically close to what model already sees in hot tail), not COMPLEMENTARY context. For perplexity, diverse prefix coverage matters more than semantic similarity to the current window.
Consequence: tail-similarity routing is REFUTED as a general method.
Baseline for any future memory routing is now random blocks, not recency-only. A routing signal that cannot beat random is not useful.
Current status of selective memory track: PARKED
What is validated:
- pg_sorted_heap can serve blocks fast enough (Exp 0)
- Exact token spill/restore is quality-neutral (Exp 2A)
- FlashHadamard quantizer/retrieval (independent of routing)
What is refuted:
- “Retrieve blocks most similar to tail” as routing signal
What is untested:
- Coverage/diversity-based routing (MMR, DPP)
- Recency + diversity hybrid
- Attention-weight-based routing (needs KV access)
Live KV hook: parked. No investment until a routing signal beats random robustly across texts.
Two independent tracks going forward:
Track A — FlashHadamard quantizer/retrieval (ACTIVE):
- block16_packed4_topk and blockhadamard_packed4_topk validated
- IVF centroid caching + Gutenberg nprobe frontier needed
- Publishable as no-codebook packed ANN quantizer
Track B — Selective memory routing (RESEARCH-ONLY):
- Only offline experiments
- Must beat random as baseline
- Coverage/diversity objectives
- No live hook until routing validated
Original Experiment 2 (replaced):
- Same long session, but with KV offload active
- Compare generation quality (perplexity, coherence) between: a. Full KV cache (baseline) b. Hot tail only (2K tokens, no recall) c. Hot tail + FlashHadamard routed recall (2K + top-5 blocks)
- Gate: (c) should be materially closer to (a) than (b)
4. Relationship to Existing Repo
| Component | Role | Location |
|---|---|---|
| FlashHadamard quantizer | rotation + quant + pack + score | scripts/bench_turboquant_retrieval.py |
| Packed ADC C helper | transposed scoring + fused top-k | scripts/turboquant_packed_adc.c |
| pg_sorted_heap | backing store for warm-tier KV blocks | src/sorted_heap.c |
| SQ8 compression | payload compression for KV blocks | src/svec.c |
| HNSW index | sketch-based routing index | src/sorted_hnsw.c |
| cogni-ml llama bindings | KV cache hook point | ../cogni-ml/src/ml/llm/llama.cr |
| Cogniformerus MCP | session management, embedding | ../cogniformerus/src/cogniformerus/ |
5. Non-goals
- Drop-in replacement for llama.cpp KV cache (too invasive)
- Competing with KIVI / per-layer KV quantization (different problem)
- Publishing a new ANN benchmark paper (we optimize for real consumer, not benchmark leaderboards)
- Replacing SQ8 for small datasets (SQ8 is simpler and higher quality when storage is not the bottleneck)