GraphRAG on `sorted_heap`

This note evaluates a narrow question:

Can current sorted_heap + current vector search already support a useful GraphRAG-style retrieval workflow, or do we need a new storage/model layer?

The conclusion so far is:

1-hop fact retrieval by source entity already fits sorted_heap well.
Naive SQL join-based multi-hop expansion does not expose much advantage.
ANY(array_of_seed_ids) expansion does trigger SortedHeapScan, but on warm and medium-scale local benchmarks it still loses to heap+btree on end-to-end latency despite reading fewer blocks.
Narrow C helpers for expansion and fused top-K rerank now exist as:
- sorted_heap_expand_ids(...)
- sorted_heap_expand_rerank(...)
A one-call convenience wrapper now exists as:
- sorted_heap_graph_rag_scan(...)
Those helpers materially improve the sorted_heap path on the synthetic GraphRAG benchmark, though pure heap+btree expansion is still faster on this synthetic workload.
Therefore the next promising primitive was correctly a narrow C helper, not a new graph storage engine and not a giant monolithic graph_rag_scan() API.

Existing anchors

The repository already has the main building blocks:

Zone-map pruning on sorted_heap
- planner hook + SortedHeapScan custom scan
- supports base-relation restriction on the leading PK columns
Planner-integrated ANN via sorted_hnsw
- exact ordered results
- works on both heap tables and sorted_heap tables
Legacy graph traversal precedent
- svec_graph_scan() in pq.c
- this is for ANN sidecar graph navigation, not fact graphs
- still useful as evidence that the extension can host graph-like traversal logic in C

What was benchmarked

Synthetic fact graph schema:

CREATE TABLE facts_heap (
    entity_id   int4 NOT NULL,
    relation_id int2 NOT NULL,
    target_id   int4 NOT NULL,
    embedding   svec(32) NOT NULL,
    payload     text NOT NULL,
    PRIMARY KEY (entity_id, relation_id, target_id)
);

CREATE TABLE facts_sh (
    entity_id   int4 NOT NULL,
    relation_id int2 NOT NULL,
    target_id   int4 NOT NULL,
    embedding   svec(32) NOT NULL,
    payload     text NOT NULL,
    PRIMARY KEY (entity_id, relation_id, target_id)
) USING sorted_heap;

Both tables also receive the same ANN index:

CREATE INDEX ... USING sorted_hnsw (embedding) WITH (m = 16, ef_construction = 64);

Benchmark harness:

scripts/bench_graph_rag.py
local ephemeral PostgreSQL 18 temp cluster
deterministic synthetic fact graph
compares:
- hop1_entity
- hop1_entity_relation
- hop2_join
- hop2_in
- seed_expand_join
- seed_expand_in
- seed_expand_rerank_join
- seed_expand_rerank_in
- seed_expand_fn
- seed_expand_rerank_fn
- seed_expand_rerank_topk_fn
- seed_graph_rag_scan_fn

The key comparison is between:

join-shaped expansion
ANY(array(seed_ids)) expansion

The second shape is the one that allows sorted_heap to expose its pruning logic directly on entity_id.

Local findings

Small smoke run

On a tiny graph (300 entities, 4 edges/entity):

facts_sh reduced buffer hits strongly for:
- hop1_entity
- hop1_entity_relation
- hop2_in
- seed_expand_in
but end-to-end latency stayed close to heap because the whole dataset was fully warm and tiny

Most importantly:

join-shaped expansion largely erased the sorted_heap advantage
ANY(array(...)) expansion preserved SortedHeapScan

Medium warm run

On 20K entities, 8 edges/entity (160K rows total), warm local cache:

hop1_entity
- heap: Index Scan
- sorted_heap: Custom Scan:SortedHeapScan
- sorted_heap reads fewer blocks and is roughly latency-parity
seed_expand_join
- bad shape for both
- sorted_heap is not meaningfully better
seed_expand_in
- sorted_heap does use SortedHeapScan
- buffer footprint drops
- but heap+btree still wins on total latency

This means:

current SQL shape can make sorted_heap read less, but executor/custom-scan overhead can still dominate the total time on warm-medium datasets

Medium run with lower shared buffers

On 20K entities, 16 edges/entity (320K rows total), shared_buffers=64MB:

hop1_entity
- sorted_heap stayed strong: fewer hits, same-or-better latency
seed_expand_join
- both paths were much worse
- heap and sorted_heap were similar, with read noise dominating
seed_expand_in
- heap: lower latency
- sorted_heap: fewer touched blocks / lower expansion footprint
- but still slower end-to-end

This is the most important current result:

On a graph larger than a warm toy dataset, sorted_heap already shows the expected locality/pruning behavior for seed expansion, but the current SQL + CustomScan path is not enough to turn that into a consistent latency win over heap+btree.

Design implications

What not to build first

Not a new graph storage engine
- current evidence does not justify that jump
- 1-hop retrieval is already good on current storage
Not a giant monolithic svec_graph_rag_scan()
- it would have to combine:
  - ANN seed retrieval
  - graph expansion
  - rerank
- this is a large surface area
- it also risks duplicating planner/index logic from sorted_hnsw

What to build next

The next narrow primitive should be something like:

sorted_heap_expand_ids(
    rel regclass,
    seed_ids int4[],
    relation_filter int2 DEFAULT NULL,
    limit_rows int4 DEFAULT 0
)

Why this shape:

ANN seed retrieval can stay in SQL:
- SELECT target_id FROM facts ORDER BY embedding <=> $query LIMIT K
expansion becomes a dedicated low-overhead C primitive
it avoids:
- repeated executor/planner setup
- generic CustomScan overhead for this narrow use case
it keeps the product boundary small:
- “expand these known entity IDs quickly”

That primitive can later be composed into:

SQL-only GraphRAG
a higher-level helper
maybe a monolithic API if the narrow primitive proves valuable

Helper result

The narrow helpers now exist:

sorted_heap_expand_ids(
    rel regclass,
    seed_ids int4[],
    relation_filter int4 DEFAULT NULL,
    limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
    entity_id int4,
    relation_id int2,
    target_id int4,
    embedding svec,
    payload text
)

and:

sorted_heap_expand_rerank(
    rel regclass,
    seed_ids int4[],
    query svec,
    top_k int4,
    relation_filter int4 DEFAULT NULL,
    limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
    entity_id int4,
    relation_id int2,
    target_id int4,
    payload text,
    distance float8
)

and:

sorted_heap_expand_twohop_rerank(
    rel regclass,
    seed_ids int4[],
    query svec,
    top_k int4,
    hop1_relation_filter int4 DEFAULT NULL,
    hop2_relation_filter int4 DEFAULT NULL,
    limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
    entity_id int4,
    relation_id int2,
    target_id int4,
    payload text,
    distance float8
)

and:

sorted_heap_graph_rag_scan(
    rel regclass,
    query svec,
    ann_k int4,
    top_k int4,
    relation_filter int4 DEFAULT NULL,
    limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
    entity_id int4,
    relation_id int2,
    target_id int4,
    payload text,
    distance float8
)

Their current contract is intentionally narrow:

relation must be a sorted_heap table
relation must expose the columns:
- entity_id int4
- relation_id int2
- target_id int4
- embedding svec
- payload text
the function reuses the zone-map range builder directly
it emits fact rows for known source entity IDs

On the medium-pressure benchmark (20K entities, 16 edges/entity, 320K rows, shared_buffers=64MB, fresh backend, runs=3), the helpers produced:

facts_heap seed_expand_in: 0.123 ms
facts_sh seed_expand_in: 0.285 ms
facts_sh seed_expand_fn: 0.165 ms
facts_sh seed_expand_rerank_in: 0.369 ms
facts_sh seed_expand_rerank_fn: 0.234 ms
facts_sh seed_expand_rerank_topk_fn: 0.139 ms
facts_sh seed_graph_rag_scan_fn: 0.144 ms

Interpretation:

sorted_heap_expand_ids() converts the observed block-pruning/locality advantage into a real latency win over the current SQL + CustomScan path
sorted_heap_expand_rerank() removes most of the remaining rerank overhead and is now materially faster than the current sorted_heap SQL rerank path (0.139 ms vs 0.369 ms)
sorted_heap_graph_rag_scan() is only slightly slower than the direct fused helper composition (0.144 ms vs 0.139 ms), so the convenience API does not erase the win
pure heap+btree expansion is still faster on this synthetic workload (0.123 ms vs 0.165 ms)

Relation-filtered probes narrow that gap further:

facts_heap seed_expand_rel_in: 0.074 ms
facts_sh seed_expand_rel_in: 0.151 ms
facts_sh seed_expand_rel_fn: 0.108 ms
facts_heap seed_expand_rerank_rel_in: 0.087 ms
facts_sh seed_expand_rerank_rel_in: 0.167 ms
facts_sh seed_expand_rerank_rel_topk_fn: 0.104 ms
facts_sh seed_graph_rag_rel_scan_fn: 0.120 ms

So the relation-filtered GraphRAG path is materially better than the current SQL + CustomScan form, but it still does not clearly beat heap+btree on this synthetic corpus. The filtered helper path is nevertheless close enough that a real fact graph, wider payloads, or colder cache state may flip the comparison.

Payload-width sensitivity does matter, but not monotonically.

The benchmark harness now supports --payload-bytes to widen synthetic fact rows and test the claim that locality should matter more once facts stop being tiny strings. On the same medium-pressure setup (20K entities, degree 16, 320K rows, shared_buffers=64MB, fresh backend):

with payload_bytes=1024
- facts_heap seed_expand_in: 0.188 ms
- facts_sh seed_expand_in: 0.185 ms
- facts_heap seed_expand_rerank_rel_in: 0.120 ms
- facts_sh seed_expand_rerank_rel_topk_fn: 0.100 ms
- facts_sh seed_graph_rag_rel_scan_fn: 0.125 ms
with payload_bytes=2048
- facts_heap seed_expand_in: 0.113 ms
- facts_sh seed_expand_in: 0.208 ms
- facts_heap seed_expand_rerank_rel_in: 0.090 ms
- facts_sh seed_expand_rerank_rel_topk_fn: 0.122 ms
- facts_sh seed_graph_rag_rel_scan_fn: 0.127 ms

Interpretation:

a wider inline payload can make sorted_heap competitive or slightly better on seed expansion
but the effect is not monotonic, so “wider payload always helps sorted_heap” is false on this synthetic generator
this synthetic text filler is still a weak proxy for real fact payloads because compression/TOAST behavior can change the balance again

So the next falsifier should be a real-dataset GraphRAG harness or a more realistic payload model, not another synthetic-only extrapolation.

Real-text Gutenberg graph

A better falsifier now exists in:

scripts/bench_graph_rag_gutenberg.py

This harness uses real Gutenberg paragraphs instead of synthetic payload text. It builds a small text graph:

relation 1: book -> paragraph (contains)
relation 2: paragraph -> next_paragraph (next)

Embeddings are still deterministic lexical hash vectors, not external model embeddings. That means this harness is good for measuring graph-expansion latency on real text payloads and a real graph topology, but it is not a semantic-quality benchmark.

Two useful runs on shared_buffers=64MB, fresh backend:

64 books x 128 paragraphs/book (14,549 rows):

facts_heap seed_expand_rerank_rel_in: 0.071 ms
facts_sh seed_expand_rerank_rel_in: 0.088 ms
facts_sh seed_expand_rerank_rel_topk_fn: 0.061 ms
facts_sh seed_graph_rag_rel_scan_fn: 0.084 ms

128 books x 256 paragraphs/book (58,954 rows):

facts_heap seed_expand_rel_in: 0.073 ms
facts_sh seed_expand_rel_in: 0.078 ms
facts_sh seed_expand_rel_fn: 0.069 ms
facts_heap seed_expand_rerank_rel_in: 0.079 ms
facts_sh seed_expand_rerank_rel_in: 0.101 ms
facts_sh seed_expand_rerank_rel_topk_fn: 0.063 ms
facts_sh seed_graph_rag_rel_scan_fn: 0.089 ms

This is the first non-synthetic result that materially weakens the earlier “heap+btree simply wins” story:

the plain sorted_heap SQL path is still worse than heap+btree
but the fused filtered helper on the real-text Gutenberg graph is already at parity or slightly better than heap+btree on the rerank path
the one-call wrapper is close enough that its overhead is visible but not disqualifying

So the narrow-helper direction survives the real-text falsifier better than the short-payload synthetic benchmark suggested.

pgvector parity on the real-text graph

The Gutenberg harness also now supports a comparable pgvector path on the same graph:

ANN seeds come from a facts_pgv table with vector(dim) + HNSW
graph expansion and exact rerank still happen in PostgreSQL over the fact rows, which is the relevant GraphRAG shape

This is important because a pure ANN benchmark would miss the real product question: how expensive is “ANN seed + graph expansion + exact rerank” as one workflow?

On fresh-backend runs with shared_buffers=64MB:

64 books x 128 paragraphs/book (14,549 rows):

heap rerank baseline: 0.064 ms
sorted_heap_expand_rerank(... relation=2): 0.060 ms
sorted_heap_graph_rag_scan(... relation=2): 0.075 ms
pgvector ANN -> heap expansion -> exact rerank: 0.180 ms

128 books x 256 paragraphs/book (58,954 rows):

heap rerank baseline: 0.085 ms
sorted_heap_expand_rerank(... relation=2): 0.071 ms
sorted_heap_graph_rag_scan(... relation=2): 0.087 ms
pgvector ANN -> heap expansion -> exact rerank: 0.295 ms

The buffer footprint matches the latency story:

sorted_heap helper path stays around hundreds of shared-buffer hits
the pgvector path needs several thousands of shared-buffer hits before the same exact rerank step

This does not mean pgvector is bad at pure ANN. It means that for this GraphRAG workload shape, once the seed stage is followed by relational graph expansion and exact rerank, the narrow sorted_heap helper path is materially better aligned with the whole workflow than an external ANN seed on a separate table.

zvec parity on the real-text graph

The same Gutenberg harness now also supports a comparable zvec path:

ANN seeds come from a temporary zvec HNSW collection built from the same fact rows
graph expansion and exact rerank still happen in PostgreSQL over facts_heap

This produced a mixed but useful result.

On the medium real-text slice (64 books x 128 paragraphs/book, 14,549 rows, fresh backend, shared_buffers=64MB):

heap rerank baseline: 0.068 ms
sorted_heap_expand_rerank(... relation=2): 0.066 ms
sorted_heap_graph_rag_scan(... relation=2): 0.082 ms
zvec ANN -> heap expansion -> exact rerank: 0.322 ms

So on the medium slice, the zvec path is stable but materially slower than the fused sorted_heap helper. The SQL-side buffer footprint is not the bottleneck there; the external ANN seed stage dominates the total latency.

On the larger real-text slice (128 books x 256 paragraphs/book, 58,954 rows), the result is currently not publishable as a clean latency row:

the sorted_heap helper path remains stable:
- sorted_heap_expand_rerank(... relation=2): 0.070 ms
- sorted_heap_graph_rag_scan(... relation=2): 0.084 ms
the zvec path fails during ANN seed retrieval at ann_k=32

The failure is not coming from PostgreSQL or from the GraphRAG SQL wrapper. A pure zvec-only reproduction on the same 58,954-row lexical-hash corpus shows the same failure mode:

for one probe query, topk=8 and topk=10 return valid document IDs
topk>=16 returns empty doc.id values after:
- Failed to find target chunk for index 58379

The Gutenberg GraphRAG harness now turns that into an explicit benchmark error:

RuntimeError: zvec returned unmapped doc ids (...)

So the objective conclusion today is narrower than for pgvector:

zvec does not currently provide a robust large-slice GraphRAG parity row on this real-text workflow at ann_k=32
on the medium slice where it does run, it is materially slower than the fused sorted_heap helper path
on the larger slice, the current blocker is zvec ANN seed instability, not PostgreSQL expansion/rerank overhead

That instability is now isolated more sharply by the repo-owned reproducer:

scripts/repro_zvec_gutenberg_threshold.py

Current threshold signature on the lexical-hash Gutenberg corpus:

topk=16, dim=32
64x256, 80x256, 96x256, 112x256 slices are stable
- 28,661, 36,064, 43,684, 51,166 rows
128x256 fails
- 58,954 rows
- first bad probe: query #10
- returned ids are empty strings after Failed to find target chunk for index 58379

So the current failure signature is not just “large-ish GraphRAG benchmark”. It looks more like a size-thresholded zvec retrieval bug on this corpus shape.

That theory is now falsified by a second repo-owned reproducer on a plain synthetic FP32 corpus:

scripts/repro_zvec_synthetic_threshold.py

Current synthetic signature:

dim=32, ef_search=64
topk=7 already reproduces the issue
a compact failing case exists at 4,950 rows
- nearby controls:
  - 4,900 rows: ok
  - 4,950 rows: bad
  - 5,000 rows: bad
- topk<=6 is clean on the 4,950-row case
failures are non-monotonic by row count
- bad: 16,000, 20,000, 28,000, 30,000, 45,000, 60,000
- ok: 24,000, 29,000, 75,000 (100 probe queries still clean at 75k)
another local non-monotonic pocket exists around 7k-8k
- 7,000: ok
- 7,500: bad
- 7,800: ok
- 7,900: bad
representative stderr lines:
- Failed to find target chunk for index 4945
- Failed to find target chunk for index 14999
- Failed to find target chunk for index 29999
- Failed to find target chunk for index 59999

So the stronger objective conclusion is:

the failure is not Gutenberg-specific
it is not a simple monotonic “too many rows” threshold either
the current evidence points to a broader zvec retrieval defect around forward-store / chunk lookup, not to PostgreSQL GraphRAG expansion logic

For an upstream-ready summary of the current evidence, see:

docs/zvec-empty-id-bug.md

Two more diagnostic observations make that conclusion sharper:

when the synthetic bug triggers, the ANN scores still come back while doc.id is empty for the whole result set
- 4,950 rows, topk=6: valid ids
- 4,950 rows, topk=7: same score bands, but every doc.id is ''
on a larger synthetic case (16,000 rows), exact cosine inspection shows the best-score bucket spans 1000, 2000, ..., 16000, and zvec already returns empty ids at topk=5

That does not prove the internal root cause, but it strongly suggests the ANN ranking stage is still producing plausible scores while the forward-store document lookup stage is failing. A reasonable working hypothesis is that some tied-score / candidate-materialization paths touch unresolved high indexes and poison metadata resolution for the whole returned batch.

Qdrant parity on the real-text graph

The Gutenberg harness now also supports a comparable Qdrant path:

ANN seeds come from a local Qdrant HNSW collection built from the same fact rows
graph expansion and exact rerank still happen in PostgreSQL over facts_heap

Unlike zvec, this path stayed stable on both the medium and larger real-text slices. The result is simpler:

64 books x 128 paragraphs/book (14,549 rows):

heap rerank baseline: 0.074 ms
sorted_heap_expand_rerank(... relation=2): 0.062 ms
sorted_heap_graph_rag_scan(... relation=2): 0.083 ms
Qdrant ANN -> heap expansion -> exact rerank: 1.535 ms

128 books x 256 paragraphs/book (58,954 rows):

heap rerank baseline: 0.081 ms
sorted_heap_expand_rerank(... relation=2): 0.083 ms
sorted_heap_graph_rag_scan(... relation=2): 0.085 ms
Qdrant ANN -> heap expansion -> exact rerank: 1.769 ms

So on this GraphRAG workflow shape:

Qdrant is robust on the real-text benchmark
but its external ANN seed stage dominates end-to-end latency
the fused sorted_heap helper remains roughly an order of magnitude faster on the rerank path

That again does not mean Qdrant is a bad vector engine in isolation. It means that when the workflow is “external ANN seed + relational graph expansion + exact rerank inside PostgreSQL”, the narrow in-engine helper path is much better aligned with the total job than a remote vector service.

Robustness rerun

The same real-text Gutenberg harness was then rerun with a larger query set (query_count=64, runs=3) to check whether the earlier 16-query results were just small-sample noise.

The ranking stayed the same on both slices:

medium slice (64 x 128):
- sorted_heap_expand_rerank(... relation=2): 0.062 ms
- sorted_heap_graph_rag_scan(... relation=2): 0.081 ms
- pgvector ANN -> heap expansion -> exact rerank: 0.219 ms
- zvec ANN -> heap expansion -> exact rerank: 0.342 ms
- Qdrant ANN -> heap expansion -> exact rerank: 1.567 ms
larger slice (128 x 256):
- sorted_heap_expand_rerank(... relation=2): 0.067 ms
- sorted_heap_graph_rag_scan(... relation=2): 0.088 ms
- pgvector ANN -> heap expansion -> exact rerank: 0.309 ms
- Qdrant ANN -> heap expansion -> exact rerank: 1.911 ms
- zvec remains excluded from this large-slice rerun because the previously observed ann_k=32 instability is still the blocker

So the current GraphRAG conclusion is no longer resting on one short probe set. At least on this real-text Gutenberg workflow, the fused sorted_heap helper still has the best end-to-end latency profile after the query set is expanded.

Two-hop Gutenberg composition

The next adversarial question was whether the current helper story survives a real two-hop workflow, not just the earlier “ANN seeds -> one filtered expansion -> rerank” shape.

The initial Gutenberg falsifier first used a composed path from the existing narrow primitives:

ANN seeds from the fact table
first hop via sorted_heap_expand_ids(..., relation=2)
second hop via sorted_heap_expand_rerank(..., relation=2)

That composition benchmark was intentionally a harsher test than the earlier one-hop helper story, because it asked whether the current primitives were already enough to make multi-hop GraphRAG plausible before inventing a dedicated two-hop helper.

The answer was “yes, barely enough”. That justified one narrow extra helper, not a new storage engine:

sorted_heap_expand_twohop_rerank(...)

This fused helper keeps the same contract shape as the earlier rerank helper, but removes the intermediate SQL/materialization boundary between hop1 and the second-hop rerank.

On the medium real-text slice (64 books x 128 paragraphs/book, 14,549 rows, 32D, query_count=64, runs=3, fresh backend, shared_buffers=64MB):

heap baseline, seed_expand2_rerank_rel_in: 0.102 ms
plain sorted_heap SQL, seed_expand2_rerank_rel_in: 0.136 ms
helper-composed sorted_heap, seed_expand2_rerank_rel_topk_fn: 0.105 ms
fused sorted_heap_expand_twohop_rerank(...): 0.081 ms

So on the medium slice, the dedicated helper now does what the composed path only hinted at:

it beats heap+btree on latency
it materially beats the composed two-hop helper path
it also cuts shared-buffer hits strongly (421 vs 1298 for the heap baseline, and 421 vs 662 for the composed helper)

On the larger real-text slice (128 books x 256 paragraphs/book, 58,954 rows, same settings except the larger corpus):

heap baseline, seed_expand2_rerank_rel_in: 0.114 ms
plain sorted_heap SQL, seed_expand2_rerank_rel_in: 0.153 ms
helper-composed sorted_heap, seed_expand2_rerank_rel_topk_fn: 0.111 ms
fused sorted_heap_expand_twohop_rerank(...): 0.092 ms

So the larger slice confirms the same shape: the dedicated two-hop helper is not a tiny micro-win on one probe set; it keeps the lead over both heap+btree and the composed helper.

The same medium two-hop slice was also benchmarked against the external ANN seed paths:

pgvector ANN -> heap 2-hop expansion -> exact rerank: 0.253 ms
zvec ANN -> heap 2-hop expansion -> exact rerank: 0.374 ms
Qdrant ANN -> heap 2-hop expansion -> exact rerank: 1.789 ms

So the product-level conclusion stays consistent in the two-hop case as well: the narrow in-engine sorted_heap helper remains the fastest end-to-end GraphRAG path among the tested competitors on this real-text slice.

At higher exact-rerank dimension, the advantage narrows again rather than disappearing:

64 books x 128 paragraphs/book, 384D, query_count=64, runs=3:

heap baseline, seed_expand2_rerank_rel_in: 0.225 ms
plain sorted_heap SQL, seed_expand2_rerank_rel_in: 0.266 ms
helper-composed sorted_heap, seed_expand2_rerank_rel_topk_fn: 0.258 ms
fused sorted_heap_expand_twohop_rerank(...): 0.236 ms

Interpretation:

the dedicated helper makes two-hop GraphRAG clearly viable on the real-text Gutenberg path
the latency win is still not universal; at higher dimensions it narrows toward parity with heap+btree
but the locality signal remains stronger than latency alone suggests (1264 shared hits for the fused helper vs 3155 for the heap baseline on the 384D medium run)

So the correct next inference is narrower than “we need a graph storage engine” and also narrower than “we need a broad graph query layer”:

a dedicated but still narrow two-hop helper is justified; anything broader should now be treated as product/API design, not as a prerequisite for making two-hop GraphRAG fast enough to matter.

Higher-dimension rerun

The same medium Gutenberg slice (64 books x 128 paragraphs/book) was then rerun at higher lexical-hash embedding dimensions to test whether the earlier result depended too heavily on the cheap 32D setting.

At 128D (query_count=64, runs=3):

heap rerank baseline: 0.107 ms
sorted_heap_expand_rerank(... relation=2): 0.090 ms
sorted_heap_graph_rag_scan(... relation=2): 0.097 ms
pgvector ANN -> heap expansion -> exact rerank: 0.386 ms
zvec ANN -> heap expansion -> exact rerank: 0.518 ms
Qdrant ANN -> heap expansion -> exact rerank: 1.732 ms

At 384D on the same slice:

heap rerank baseline: 0.185 ms
sorted_heap_expand_rerank(... relation=2): 0.186 ms
sorted_heap_graph_rag_scan(... relation=2): 0.203 ms
pgvector ANN -> heap expansion -> exact rerank: 0.815 ms
zvec ANN -> heap expansion -> exact rerank: 1.101 ms
Qdrant ANN -> heap expansion -> exact rerank: 2.275 ms

This changes the interpretation in one important way:

the sorted_heap helper remains clearly best-aligned with the full GraphRAG workflow versus the external ANN paths
but the win over the pure heap rerank baseline is dimension-sensitive
by 384D, exact rerank cost dominates enough that the fused helper is only at parity with heap+btree rather than clearly ahead

So the current evidence supports a narrower claim than “sorted_heap always wins GraphRAG”:

the fused sorted_heap helper is the best end-to-end path among the tested in-PG and external ANN competitors on this workflow shape, but its advantage over heap+btree narrows substantially as exact rerank dimension grows

One more tuning falsifier was useful here:

dropping ann_k from 32 to 24 on the 384D medium slice does reduce latency
but it is not a free operating-point improvement
a direct result-set comparison for sorted_heap_graph_rag_scan(...) on the 64-query probe set showed mismatches on 62/64 queries versus ann_k=32

So the current faster-than-ann_k=32 settings should be treated as a quality/latency tradeoff, not as a no-regression default recommendation.

One important measurement caveat was also discovered and fixed during this work:

direct filtered ORDER BY embedding <=> $query LIMIT K on a base table with a sorted_hnsw index is not a valid GraphRAG baseline for current Phase 1 semantics
the automatic sorted_hnsw path is now explicitly costed out when extra base-relation quals are present
GraphRAG rerank baselines must therefore materialize the expanded set first, then rerank it

This is enough to falsify the pessimistic branch:

the next useful GraphRAG step is not necessarily a new storage engine; a carefully scoped C primitive can already recover a substantial part of the lost latency

Recommended roadmap

Phase 0 — completed

Build local prototype benchmark
Falsify naive SQL assumptions

Phase 1 — current

sorted_heap_expand_ids() is implemented and regression-covered.

Phase 2 — current

sorted_heap_expand_rerank() is implemented and regression-covered.

Current success criterion that was met:

beats the current sorted_heap SQL seed_expand_in / seed_expand_rerank_in patterns at medium scale

Current gap that remains:

pure heap+btree expansion is still faster on this synthetic benchmark

Phase 3 — next

Add GraphRAG composition query:

ANN seed in SQL via sorted_hnsw
expansion via sorted_heap_expand_ids()
rerank via sorted_heap_expand_rerank() or SQL over materialized expansion

Phase 4 — current

sorted_heap_graph_rag_scan() is now implemented as the narrow one-call composition wrapper.

Phase 5 — current

sorted_heap_expand_twohop_rerank() is now implemented as the narrow fused two-hop helper.

Current success criterion that was met:

beats the previous composed two-hop helper on the real-text Gutenberg graph
beats heap+btree on the medium and larger 32D two-hop slices

Current gap that remains:

at 384D, the fused two-hop helper narrows to near-parity with heap+btree rather than keeping a clear lead

Phase 6 — next

Only if the current two-hop and one-call wrappers still leave meaningful headroom:

consider a broader wrapper for:
- ANN seed IDs
- two-hop expansion
- rerank
or tune candidate count / rerank workload rather than broadening the API

Cogniformerus-style multihop facts

The real missing falsifier was not another paragraph graph slice. It was a benchmark that matches the current cogniformerus multihop question shape:

fact 1: person -> parent
fact 2: parent -> city
query: Where does the parent of Person_i live?

That now exists in:

The benchmark builds a deterministic fact graph and measures:

latency
hit@1
hit@k

for the expected final city fact after two-hop expansion and rerank.

Important contract discovery

This benchmark immediately exposed a semantic limitation in the current convenience wrapper:

sorted_heap_graph_rag_scan() seeds expansion from ANN target_id
that is a good fit for the Gutenberg paragraph -> next_paragraph graph
it is not the right seed contract for the fact benchmark above
the fact benchmark needs ANN seeds based on entity_id, then:
- hop 1 on relation 1
- hop 2 on relation 2

So the current one-call wrapper is still too specialized for this workload shape. The lower-level helper family is fine; the wrapper contract is the narrow part.

That gap is now closed by:

sorted_heap_graph_rag_twohop_scan(...)

This wrapper keeps the fact-shaped contract narrow:

ANN seed on entity_id
hop 1 relation filter
hop 2 relation filter
final rerank delegated to sorted_heap_expand_twohop_rerank(...)

Early failure that mattered

At 32D, the fact benchmark initially produced very poor answer retrieval. That was a benchmark-quality failure, not a helper failure:

the first draft seeded on target_id, which was the wrong graph contract
after fixing that, the deterministic query embedding was still too weak at low dimension to make the question reliably retrievable

So the publishable multihop results start at 384D, where the question shape becomes stable enough that latency numbers mean something.

Tuned 384D result

On 5K multihop chains (10K rows total), 64 queries, 3 runs, shared_buffers=64MB, fresh backend, with:

ann_k=64
sorted_hnsw.ef_search=64
ef_construction=200

the current frontier is:

heap composed two-hop SQL
- 0.515 ms
- hit@1 = 71.9%
- hit@k = 85.9%
sorted_heap composed two-hop helper
- 0.471 ms
- hit@1 = 70.3%
- hit@k = 82.8%
sorted_heap_expand_twohop_rerank()
- 0.442 ms
- hit@1 = 70.3%
- hit@k = 82.8%
sorted_heap_graph_rag_twohop_scan()
- 0.417 ms
- hit@1 = 71.9%
- hit@k = 84.4%
pgvector
- 1.397 ms
- hit@1 = 70.3%
- hit@k = 87.5%
zvec
- 1.076 ms
- hit@1 = 76.6%
- hit@k = 96.9%
Qdrant
- 2.921 ms
- hit@1 = 76.6%
- hit@k = 96.9%

Interpretation:

the fused two-hop helper is now the fastest PostgreSQL path on this fact-shaped workload
the new fact-shaped one-call wrapper stays effectively at parity with the fused helper, so this time the convenience API does not erase the win
it remains materially faster than pgvector on the same workflow
it is not the quality leader at this operating point
zvec and Qdrant still win on answer retrieval quality here, but at much higher latency

Seed frontier after the wrapper fix

The next honest question was not API shape but ANN seed quality. That is now measured directly by:

scripts/sweep_graph_rag_multihop.py

This harness keeps the corpus fixed per ef_construction and sweeps:

m
ann_k
sorted_hnsw.ef_search
ef_construction

without paying a full temp-cluster and schema rewrite for every single probe point.

On the same 5K chains / 10K rows / 384D / 64 queries / fresh-backend benchmark, the stable wrapper frontier is now:

ef_construction=64, ann_k=64, ef_search=64
- 0.386 ms
- hit@1 = 70.3%
- hit@k = 82.8%
ef_construction=200, ann_k=64, ef_search=64
- 0.393 ms
- hit@1 = 71.9%
- hit@k = 84.4%
ef_construction=400, ann_k=64, ef_search=64
- 0.421 ms
- hit@1 = 70.3%
- hit@k = 85.9%
ef_construction=200, ann_k=64, ef_search=128
- 0.651 ms
- hit@1 = 73.4%
- hit@k = 95.3%
ef_construction=400, ann_k=64, ef_search=128
- 0.663 ms
- hit@1 = 75.0%
- hit@k = 95.3%

For a higher-quality but much slower seed tier:

ann_k=96, ef_search=64 lands around 2.2-2.4 ms with hit@k = 96.9%

That leads to a narrower, more honest recommendation:

if latency is the hard constraint, keep the fast tier near ef_construction=200, ann_k=64, ef_search=64
if answer quality matters more, the best balanced point we measured is ef_construction=200, ann_k=64, ef_search=128
ef_construction=400 does improve hit@1 slightly at the same 95.3% hit@k, but it does not improve hit@k over 200, so it should not be the default recommendation without a separate build-cost justification

That build-cost justification now exists too on this exact 10K x 384D multihop benchmark:

ef_construction=64: 43.716 s to build both ANN indexes
ef_construction=200: 80.046 s
ef_construction=400: 91.352 s

So the current recommendation is:

default to ef_construction=200
treat ef_construction=400 as a niche hit@1 knob, not the new default

`m` frontier on the same multihop benchmark

The next useful falsifier was whether graph degree buys more than another ef_construction increase.

Keeping:

ef_construction=200
ann_k=64
64 queries
3 runs
fresh backend

the m sweep came out as:

m=16, ef_search=64
- 0.405 ms
- hit@1 = 71.9%
- hit@k = 87.5%
m=24, ef_search=64
- 0.466 ms
- hit@1 = 75.0%
- hit@k = 93.8%
m=32, ef_search=64
- 0.491 ms
- hit@1 = 78.1%
- hit@k = 93.8%
m=16, ef_search=128
- 0.672 ms
- hit@1 = 73.4%
- hit@k = 95.3%
m=24, ef_search=128
- 0.738 ms
- hit@1 = 75.0%
- hit@k = 96.9%
m=32, ef_search=128
- 0.771 ms
- hit@1 = 76.6%
- hit@k = 96.9%

The one-off build-cost probe for the same 10K x 384D graph was:

m=16, ef_construction=200: 79.425 s
m=24, ef_construction=200: 86.562 s
m=32, ef_construction=200: 75.404 s

That last m=32 build number should be treated cautiously; it was a single one-off probe and is likely noisy enough that only the query-time frontier is trustworthy here.

The stable conclusion is still clear:

m=24 is the best current quality-per-latency tradeoff we measured
m=32 buys a little more hit@1, but no additional hit@k
so for fact-shaped multihop GraphRAG, the best current balanced point is:
- m=24
- ef_construction=200
- ann_k=64
- sorted_hnsw.ef_search=128

One more ann_k falsifier matters here too:

increasing ann_k above 64 at this m=24 / ef_construction=200 / ef_search=128 point did not help
ann_k=80/96/128 all increased latency and reduced hit@k
so ann_k=64 remains the current sweet spot, not just a legacy default

Full parity rerun at the balanced point

Re-running the full multihop parity benchmark on that exact setting:

m=24
ef_construction=200
ann_k=64
sorted_hnsw.ef_search=128
64 queries
3 runs
384D

produced:

heap two-hop SQL
- 0.762 ms
- hit@1 = 75.0%
- hit@k = 96.9%
sorted_heap_expand_twohop_rerank()
- 0.726 ms
- hit@1 = 75.0%
- hit@k = 96.9%
sorted_heap_graph_rag_twohop_scan()
- 0.727 ms
- hit@1 = 75.0%
- hit@k = 96.9%
pgvector
- 1.244 ms
- hit@1 = 70.3%
- hit@k = 85.9%
zvec
- 0.927 ms
- hit@1 = 76.6%
- hit@k = 96.9%
Qdrant
- 2.417 ms
- hit@1 = 76.6%
- hit@k = 96.9%

That is a materially stronger result than the earlier m=16 baseline:

the fused sorted_heap path now matches zvec and Qdrant on hit@k
it stays faster than both external paths
it also beats pgvector on both latency and answer quality on this workload
zvec and Qdrant still keep a small hit@1 edge, so the answer-quality story is now about hit@1, not hit@k

Full parity rerun at the higher-quality point

The next question was whether that remaining hit@1 gap could be closed without giving back the latency lead. Re-running the same full parity benchmark at:

m=32
ef_construction=200
ann_k=64
sorted_hnsw.ef_search=128

produced:

heap two-hop SQL
- 0.810 ms
- hit@1 = 76.6%
- hit@k = 96.9%
sorted_heap_expand_twohop_rerank()
- 0.774 ms
- hit@1 = 76.6%
- hit@k = 96.9%
sorted_heap_graph_rag_twohop_scan()
- 0.786 ms
- hit@1 = 76.6%
- hit@k = 96.9%
pgvector
- 1.220 ms
- hit@1 = 70.3%
- hit@k = 84.4%
zvec
- 0.874 ms
- hit@1 = 76.6%
- hit@k = 96.9%
Qdrant
- 2.487 ms
- hit@1 = 76.6%
- hit@k = 96.9%

So the current picture is now more precise:

m=24 is still the better quality-per-latency recommendation
m=32 is the point where sorted_heap reaches full observed parity with zvec and Qdrant on both hit@1 and hit@k
even at that higher-quality point, the sorted_heap helper remains faster than both external paths
pgvector remains behind on both latency and answer quality on this workload

AWS ARM64 parity rerun (`5K` chains)

The next environment-variance adversary check was to rerun the same 5K-chain / 10K-row / 384D fact benchmark on an AWS ARM64 host (4 vCPU, 8 GiB RAM) using the repo-owned wrapper:

scripts/bench_graph_rag_multihop_aws.sh

At the previously recommended local balanced point:

m=24
ef_construction=200
ann_k=64
sorted_hnsw.ef_search=128
64 queries
3 runs
fresh backend

the AWS rerun produced:

heap two-hop SQL
- 1.087 ms
- hit@1 = 75.0%
- hit@k = 96.9%
sorted_heap_expand_twohop_rerank()
- 0.947 ms
- hit@1 = 76.6%
- hit@k = 98.4%
sorted_heap_graph_rag_twohop_scan()
- 1.004 ms
- hit@1 = 76.6%
- hit@k = 98.4%
pgvector
- 1.296 ms
- hit@1 = 70.3%
- hit@k = 85.9%
zvec
- 1.646 ms
- hit@1 = 76.6%
- hit@k = 96.9%
Qdrant
- 3.396 ms
- hit@1 = 76.6%
- hit@k = 96.9%

That is stronger than the local balanced point in one important way:

on this AWS rerun, sorted_heap does not just match zvec and Qdrant on hit@k; it exceeds them (98.4% vs 96.9%) while staying faster than both

But the second half of the adversary check matters too. Re-running the same AWS benchmark at the local higher-quality point:

m=32
ef_construction=200
ann_k=64
sorted_hnsw.ef_search=128

produced:

sorted_heap_graph_rag_twohop_scan()
- 1.066 ms
- hit@1 = 76.6%
- hit@k = 96.9%

So the local m=32 parity story does not carry over unchanged to this AWS ARM64 environment. The portable conclusion is therefore narrower:

m=24 / ef_construction=200 / ann_k=64 / ef_search=128 is the current best verified cross-environment point
local and AWS frontiers are directionally consistent, but not numerically identical
this is exactly why the AWS rerun is worth keeping as a separate falsifier, not merging blindly into the local tuning story

Larger local scale check (`10K` chains)

The next adversary check was whether the 5K-chain tuning carried forward to a larger local fact graph without retuning.

On 10K chains (20K rows total), 64 queries, 384D, fresh backend:

m=24, ef_construction=200, ann_k=64, ef_search=128
- sorted_heap_graph_rag_twohop_scan() -> 0.885 ms
- hit@1 = 71.9%
- hit@k = 92.2%
m=32, ef_construction=200, ann_k=64, ef_search=128
- sorted_heap_graph_rag_twohop_scan() -> 0.972 ms
- hit@1 = 73.4%
- hit@k = 93.8%

So the 5K-chain operating point does not generalize unchanged.

The next narrow falsifier was whether this larger-graph drop was just a search beam issue. Sweeping ef_search upward at m=32 gave:

ef_search=192
- 1.310 ms
- hit@1 = 76.6%
- hit@k = 95.3%
ef_search=256
- 1.734 ms
- hit@1 = 78.1%
- hit@k = 95.3%

That is a useful but incomplete recovery:

higher ef_search does recover part of the quality loss
it does not recover the earlier 96.9% hit@k local point
so the larger-graph gap is not purely a beam-width problem

The next falsifier after that was stronger graph construction. On the same 10K-chain graph, keeping m=32, ann_k=64, and comparing ef_construction=200 vs 400 gave:

at ef_search=128
- ef_construction=200 -> 0.976 ms, hit@1 = 75.0%, hit@k = 93.8%
- ef_construction=400 -> 1.094 ms, hit@1 = 75.0%, hit@k = 93.8%
at ef_search=192
- ef_construction=200 -> 1.357 ms, hit@1 = 76.6%, hit@k = 95.3%
- ef_construction=400 -> 1.381 ms, hit@1 = 76.6%, hit@k = 95.3%

So this larger-graph gap is not fixed by a simple ef_construction=400 bump either.

The current best explanation is therefore narrower:

the verified 5K-chain local frontier is real
the same operating points do not carry forward unchanged to 10K chains
and the obvious local rescue knobs (ef_search, ef_construction) only recover part of the drop

That is enough to stop local knob-turning for this pass. The next useful step would be a different class of experiment, not more of the same sweep.

The next adversary check after that was whether this larger-graph caveat was just a local-machine artifact. Re-running the 10K-chain benchmark on the same AWS ARM64 host (4 vCPU, 8 GiB RAM) showed that it is not.

At the same balanced portable point:

m=24
ef_construction=200
ann_k=64
sorted_hnsw.ef_search=128

the AWS rerun produced:

heap two-hop SQL
- 1.389 ms
- hit@1 = 71.9%
- hit@k = 92.2%
sorted_heap_expand_twohop_rerank()
- 1.190 ms
- hit@1 = 71.9%
- hit@k = 92.2%
sorted_heap_graph_rag_twohop_scan()
- 1.248 ms
- hit@1 = 71.9%
- hit@k = 92.2%

That essentially matches the larger local result. So the 10K-chain drop is cross-environment robust, not just a local Apple/M-series artifact.

The one meaningful local rescue point transferred cleanly to AWS too. Re-running the 10K-chain benchmark at:

m=32
ef_construction=200
ann_k=64
sorted_hnsw.ef_search=192

produced:

heap two-hop SQL
- 1.896 ms
- hit@1 = 76.6%
- hit@k = 95.3%
sorted_heap_expand_twohop_rerank()
- 1.617 ms
- hit@1 = 76.6%
- hit@k = 95.3%
sorted_heap_graph_rag_twohop_scan()
- 1.687 ms
- hit@1 = 76.6%
- hit@k = 95.3%

So the larger-scale picture is now materially stronger:

the 10K-chain quality drop is cross-environment robust
the best current larger-graph recovery point is also cross-environment robust: m=32 / ef_search=192
but even that recovery point does not restore the earlier 5K-chain 98.4% hit@k AWS frontier
so the remaining gap is unlikely to be solved by another trivial ef_search or m tweak alone

Exact-seed upper-bound diagnostic

The next root-cause check was to remove ANN approximation from the seed stage entirely. The multihop harness now supports an --exact-seed-diagnostics mode, which replaces ANN seed retrieval with exact brute-force top-K seeds on facts_heap, then reuses the same graph expansion/rerank path.

This matters because it separates two very different explanations:

“the remaining gap is caused by approximate ANN seeds”
“the remaining gap is already in the benchmark/query/task shape”

On the 5K-chain balanced local point:

m=24
ef_construction=200
ann_k=64
sorted_hnsw.ef_search=128

the exact-seed diagnostic did not improve quality:

ANN-seeded sorted_heap_expand_twohop_rerank()
- 0.702 ms
- hit@1 = 75.0%
- hit@k = 96.9%
exact-seeded sorted_heap_expand_twohop_rerank()
- 0.811 ms
- hit@1 = 75.0%
- hit@k = 96.9%

And the seed-stage diagnostic showed no hidden ANN loss there either:

ANN seeds
- seed_person_pct = 98.4%
- expanded_city_pct = 98.4%
- avg_person_rank = 1.00
- city_rank_p95 = 6
- city_rank_max = 17
exact seeds
- seed_person_pct = 98.4%
- expanded_city_pct = 98.4%
- avg_person_rank = 1.00
- city_rank_p95 = 6
- city_rank_max = 17

So even at 5K, the final 96.9% hit@k is already below seed coverage. But the rerank distribution is still concentrated: the correct city stays within the top 6 for 95% of reachable queries, and the miss comes from a small number of sharper outliers.

On the 10K-chain balanced local point:

m=24
ef_construction=200
ann_k=64
sorted_hnsw.ef_search=128

the exact-seed diagnostic again did not improve quality:

ANN-seeded sorted_heap_expand_twohop_rerank()
- 0.839 ms
- hit@1 = 71.9%
- hit@k = 92.2%
exact-seeded sorted_heap_expand_twohop_rerank()
- 0.947 ms
- hit@1 = 71.9%
- hit@k = 92.2%

The seed-stage diagnostic was even more revealing on 10K:

ANN seeds
- seed_person_pct = 96.9%
- expanded_city_pct = 96.9%
- avg_person_rank = 1.00
- city_rank_p95 = 3
- city_rank_max = 20
exact seeds
- seed_person_pct = 96.9%
- expanded_city_pct = 96.9%
- avg_person_rank = 1.00
- city_rank_p95 = 3
- city_rank_max = 19

So the larger-graph gap is not coming from missing the correct seed fact. At 10K, seed coverage stays at 96.9%, but final hit@k drops to 92.2%. And it is not a broad rerank collapse either: for 95% of reachable queries the correct city still ranks in the top 3, but a few outliers fall as far as rank 19-20, which is enough to miss top_k = 10.

This is a strong falsifier:

on this synthetic fact benchmark, the current 5K and 10K frontiers are not ANN-approximation limited at the tested operating points
ANN and exact seeds have identical seed coverage on both scales
the remaining gap is mostly an outlier-ranking problem, not a broad seed or rerank failure
exact seeds cost extra latency but do not recover answer quality
so the next meaningful gain is unlikely to come from more seed-ANN tuning alone

The remaining gap now looks more like a property of the task construction, query embedding, or graph benchmark semantics than of sorted_hnsw approximation itself. More specifically: the dominant remaining loss now looks downstream of seed retrieval, not inside it, and it is concentrated in a small set of bad cases rather than a general degradation across the query set.

So the honest story on this fact benchmark is a latency/quality frontier:

sorted_heap_expand_twohop_rerank() leads on latency

Path-aware rerank diagnostic

The next falsifier was to keep the same ANN seeds and the same two-hop expansion, but change only the final scorer. The current multihop helper reranks on the hop-2 city fact embedding alone. A path-aware SQL baseline was added to the harness that scores each candidate as:

path_distance = (hop1_embedding <=> query) + (hop2_embedding <=> query)

That simple change materially improved answer quality on the same balanced points:

5K chains, m=24, ef_construction=200, ann_k=64, sorted_hnsw.ef_search=128
- city-only sorted_heap_graph_rag_twohop_scan()
  - 0.762 ms
  - hit@1 = 75.0%
  - hit@k = 96.9%
- path-aware SQL rerank on facts_sh
  - 0.957 ms
  - hit@1 = 98.4%
  - hit@k = 98.4%
10K chains, same knobs
- city-only sorted_heap_graph_rag_twohop_scan()
  - 0.937 ms
  - hit@1 = 71.9%
  - hit@k = 92.2%
- path-aware SQL rerank on facts_sh
  - 1.179 ms
  - hit@1 = 95.3%
  - hit@k = 96.9%

This is the strongest current architectural signal on the fact-shaped benchmark:

the remaining quality gap is not well explained by seed recall
it is also not well explained by broad rerank collapse
a simple path-aware scorer recovers most of the lost quality with only a modest latency increase

That branch is now implemented locally too:

sorted_heap_expand_twohop_path_rerank(...)
sorted_heap_graph_rag_twohop_path_scan(...)

And the fused helper beats the SQL path-aware baseline on the same balanced points:

5K chains
- SQL path-aware baseline: 0.847 ms, hit@1 = 98.4%, hit@k = 98.4%
- fused helper: 0.726 ms, hit@1 = 98.4%, hit@k = 98.4%
- one-call wrapper: 0.739 ms, hit@1 = 98.4%, hit@k = 98.4%
10K chains
- SQL path-aware baseline: 0.942 ms, hit@1 = 95.3%, hit@k = 96.9%
- fused helper: 0.823 ms, hit@1 = 95.3%, hit@k = 96.9%
- one-call wrapper: 0.834 ms, hit@1 = 95.3%, hit@k = 96.9%

So for multihop fact retrieval, the next serious question is no longer whether path-aware rerank helps. It does. The next question is whether this new helper/wrapper transfers cleanly to AWS and then to a real cogniformerus-like corpus.

That AWS transfer is now verified too. On AWS ARM64 (4 vCPU, 8 GiB RAM), at the same balanced m=24 / ef_construction=200 / ann_k=64 / ef_search=128 point:

5K chains
- heap two-hop SQL: 1.088 ms, hit@1 = 75.0%, hit@k = 96.9%
- city-only wrapper: 1.012 ms, hit@1 = 75.0%, hit@k = 96.9%
- SQL path-aware baseline: 1.204 ms, hit@1 = 98.4%, hit@k = 98.4%
- fused helper: 0.955 ms, hit@1 = 98.4%, hit@k = 98.4%
- one-call path-aware wrapper: 1.018 ms, hit@1 = 98.4%, hit@k = 98.4%
- pgvector + heap expansion, same path-aware scorer: 1.422 ms, hit@1 = 85.9%, hit@k = 85.9%
- zvec + heap expansion, same path-aware scorer: 1.720 ms, hit@1 = 100.0%, hit@k = 100.0%
- Qdrant + heap expansion, same path-aware scorer: 3.435 ms, hit@1 = 100.0%, hit@k = 100.0%
10K chains, same knobs
- heap two-hop SQL: 1.319 ms, hit@1 = 71.9%, hit@k = 92.2%
- city-only wrapper: 1.197 ms, hit@1 = 73.4%, hit@k = 93.8%
- SQL path-aware baseline: 1.436 ms, hit@1 = 96.9%, hit@k = 98.4%
- fused helper: 1.185 ms, hit@1 = 96.9%, hit@k = 98.4%
- one-call path-aware wrapper: 1.212 ms, hit@1 = 96.9%, hit@k = 98.4%

So the answer to the transfer question is now yes: the path-aware helper and wrapper survive the AWS move cleanly, and the old larger-scale caveat narrows substantially once the rerank contract is fixed.

This also closes the earlier apples-to-apples gap. Once all engines are scored under the same path-aware contract:

sorted_heap is the latency leader
zvec and Qdrant hold the strongest observed answer quality
pgvector remains behind on both latency and quality at this operating point

One AWS all-engines rerun briefly dropped the sorted_heap path-aware rows to 96.9% / 96.9%, but an immediate sorted_heap-only control and a second full rerun both returned 98.4% / 98.4%. So the portable parity story now has one verified outlier plus two confirming reruns. That was enough to justify the benchmark note, and it directly motivated the repeated-build protocol recorded below.

Repeated-build local variance

It wraps scripts/bench_graph_rag_multihop.py so each repeat gets a fresh temp cluster and a fresh HNSW build, then reports median / min / max for selected rows.

On the balanced local 5K point (m=24 / ef_construction=200 / ann_k=64 / ef_search=128), three independent rebuilds produced:

sorted_heap_expand_twohop_path_rerank()
- p50_ms: median 0.798, range 0.771-0.819
- hit@1 = 98.4%, hit@k = 98.4% on all three builds
sorted_heap_graph_rag_twohop_path_scan()
- p50_ms: median 0.796, range 0.778-0.804
- hit@1 = 98.4%, hit@k = 98.4% on all three builds
pgvector path-aware parity row
- p50_ms: median 1.405, range 1.318-1.456
- hit@1/hit@k: 85.9-89.1%
zvec path-aware parity row
- p50_ms: median 1.076, range 1.053-1.087
- hit@1 = 100.0%, hit@k = 100.0% on all three builds
Qdrant path-aware parity row
- p50_ms: median 2.799, range 2.792-2.805
- hit@1 = 100.0%, hit@k = 100.0% on all three builds

So the balanced local path-aware sorted_heap point is not just a lucky single build. The answer quality stayed fixed across rebuilds, and the latency spread was narrow. The remaining variance story now looks more like:

local balanced sorted_heap: stable across rebuilds
AWS balanced sorted_heap: also stable across repeated builds on the 5K point, with one earlier outlier now downgraded to an anomaly
pgvector: measurable quality drift across local rebuilds
zvec / Qdrant: stable on this deterministic local fact graph

The AWS repeated-build protocol on the balanced 5K point produced:

sorted_heap_expand_twohop_path_rerank()
- p50_ms: median 0.962, range 0.956-0.965
- hit@1 = 98.4%, hit@k = 98.4% on all three builds
sorted_heap_graph_rag_twohop_path_scan()
- p50_ms: median 1.025, range 1.018-1.043
- hit@1 = 98.4%, hit@k = 98.4% on all three builds
pgvector path-aware parity row
- p50_ms: median 1.434, range 1.370-1.493
- hit@1/hit@k: 84.4-89.1%
zvec path-aware parity row
- p50_ms: median 1.711, range 1.703-1.768
- hit@1 = 100.0%, hit@k = 100.0% on all three builds
Qdrant path-aware parity row
- p50_ms: median 3.355, range 3.302-3.465
- hit@1 = 100.0%, hit@k = 100.0% on all three builds

So the current confidence picture is stronger than before:

local balanced 5K: repeated-build stable
AWS balanced 5K: repeated-build stable
larger 10K AWS path-aware rows: repeated-build stable too, but at a lower quality frontier than 5K

The AWS repeated-build protocol on the larger 10K point produced:

sorted_heap_expand_twohop_path_rerank()
- p50_ms: median 1.177, range 1.148-1.191
- hit@1 = 95.3%, hit@k = 96.9% on all three builds
sorted_heap_graph_rag_twohop_path_scan()
- p50_ms: median 1.236, range 1.211-1.240
- hit@1 = 95.3%, hit@k = 96.9% on all three builds
pgvector path-aware parity row
- p50_ms: median 1.667, range 1.665-1.676
- hit@1/hit@k: 76.6-82.8%
zvec path-aware parity row
- p50_ms: median 2.788, range 2.762-2.789
- hit@1 = 98.4%, hit@k = 100.0% on all three builds
Qdrant path-aware parity row
- p50_ms: median 3.818, range 3.788-3.846
- hit@1 = 98.4%, hit@k = 100.0% on all three builds

This sharpens the conclusion again:

the 10K AWS point is no longer a variance question
it is a real scale frontier
sorted_heap remains the latency leader there
zvec and Qdrant still lead on answer quality

This also falsifies one tempting but wrong simplification:

once the helper is fast, the remaining GraphRAG problem is solved

Not quite. On fact-shaped multihop queries, seed ANN quality and graph build quality still matter enough that ann_k, ef_search, and graph build quality remain first-class tuning knobs. But the old hop-2-only rerank contract was a separate, larger problem, and the new path-aware helper fixes most of it on the current local benchmark.

Current verdict

sorted_heap already has a plausible GraphRAG foundation, and the new helper proves that a narrow C primitive can materially improve the GraphRAG path.

What is now true:

SQL-only GraphRAG composition was not enough
sorted_heap_expand_ids() is enough to recover a large part of that gap
sorted_heap_expand_rerank() recovers most of the rerank overhead on the current sorted_heap path
sorted_heap_graph_rag_scan() makes the composition available as a single SQL call without giving back much latency
sorted_heap_expand_twohop_rerank() turns the earlier two-hop composition evidence into a real latency win on the real-text Gutenberg slices we tested
on the cogniformerus-style person -> parent -> city benchmark, the fused two-hop helper is the fastest PostgreSQL path we tested
sorted_heap_graph_rag_twohop_scan() closes the current fact-shaped wrapper gap without materially giving back latency
sorted_heap_expand_twohop_path_rerank() upgrades the fact-shaped rerank contract to use hop-1 and hop-2 evidence together
sorted_heap_graph_rag_twohop_path_scan() makes that path-aware contract available as a single-call primitive
the path-aware helper and wrapper transfer cleanly from local to AWS ARM64 on the same balanced m=24 / ef_construction=200 / ann_k=64 / ef_search=128 point
the narrow-helper direction is a justified building block
the current helper model already composes into a competitive two-hop real-text GraphRAG path on Gutenberg without requiring a new graph API
on the real-text GraphRAG shape, pgvector parity is already materially worse end-to-end than the fused sorted_heap helper path
on the fact-shaped AWS path-aware benchmark, sorted_heap is now the fastest verified end-to-end path, while zvec and Qdrant remain the answer quality leaders
zvec is stable on the medium slice but currently not robust on the larger real-text slice at ann_k=32
Qdrant is robust on both real-text slices but materially slower than the fused sorted_heap helper on the same workflow

What is not yet true:

sorted_heap is not yet clearly better than heap+btree on pure expansion latency for this synthetic workload
even the relation-filtered GraphRAG path still trails heap+btree slightly on this synthetic benchmark
two-hop helper composition is not yet a universal latency win; at higher rerank dimensions it narrows to parity with heap+btree rather than staying clearly ahead
the current benchmark suite is still deterministic/synthetic rather than a real cogniformerus corpus, so the remaining generalization gap is about workload realism more than about build variance
transfer to a larger real cogniformerus corpus is still unverified; the current fact-shaped benchmark is deterministic and synthetic even though it matches the intended multihop query shape

Actual Butler gate seed-corpus smoke

The next honest step after the synthetic-chain work was to stop guessing and run the path-aware GraphRAG helpers on the actual tiny multihop corpus that cogniformerus already ships in its Butler gate smoke:

source: cogniformerus/bin/butler_small_model_eval.cr
repo-owned fixture: scripts/fixtures/graph_rag_butler_gate_seed.json
harness: scripts/bench_graph_rag_butler_gate.py

This fixture is intentionally tiny:

7 graph facts loaded into facts_heap / facts_sh
2 positive multihop queries
- Project Atlas -> Orion -> Helsinki
- Release 13 -> Aurora -> April

So it is not a publishable latency frontier. Its job is narrower:

verify that the current path-aware helper and wrapper work on the real Butler gate fact texts and prompts
replace the previous blanket statement “real cogniformerus still unverified” with a tighter one: the actual gate seed corpus is covered, but larger real corpora are not

The first local smoke run on this real gate seed corpus used:

384D
ann_k=4
top_k=4
m=24
ef_construction=200
sorted_hnsw.ef_search=64
5 timing runs on a fresh temp cluster

Result:

heap path-aware SQL baseline:
- p50 0.027 ms
- hit@1/hit@k = 100/100
facts_sh path-aware SQL baseline:
- p50 0.026 ms
- hit@1/hit@k = 100/100
sorted_heap_expand_twohop_path_rerank():
- p50 0.017 ms
- hit@1/hit@k = 100/100
sorted_heap_graph_rag_twohop_path_scan():
- p50 0.045 ms
- hit@1/hit@k = 100/100

This does not prove scale behavior. It proves something narrower and still useful: the current path-aware GraphRAG helper/wrapper contract works on the actual Butler gate seed facts and prompts, not only on the synthetic person -> parent -> city generator.

One adversary control also mattered here: this was not only a pass at a near-full seed budget. Re-running the same smoke at ann_k=2, top_k=2 still kept both multihop queries at 100/100.

The correct next step is therefore:

tune the current narrow helper family before considering a bigger graph-specific subsystem

That remains the smallest change that can still convert the observed block-pruning advantage into an end-to-end query win.

Real code-corpus prototype

The next honest check after the Butler gate fact smoke was not another synthetic graph. It was the actual cogniformerus code corpus plus the real cross-file question bank already used by Butler’s own code benchmark.

source tree: cogniformerus/src/cogniformerus
question source: cogniformerus/bin/butler_code_test.cr
harness: scripts/bench_graph_rag_code_corpus.py

This harness builds a narrow code-GraphRAG shape:

each source file is one entity
each chunk in that file becomes one fact row
- entity_id = file_id
- relation_id = HAS_CHUNK
- target_id = chunk_id
query quality is scored against the real CrossFile benchmark keywords from butler_code_test.cr

This is not a full code graph. It is a bounded falsifier for a simpler claim:

if GraphRAG-style seeded expansion is already useful on a real corpus, it should show up even on the natural file -> chunk expansion shape

The first stable local point used:

40 files
747 chunk rows
6 real CrossFile questions
384D
ann_k=16
top_k=4
m=24
ef_construction=200
sorted_hnsw.ef_search=64
shared_buffers=64MB
fresh backend
3 timing runs

Result:

direct ANN over raw chunks:
- heap: p50 0.740 ms
- sorted_heap: p50 0.712 ms
- keyword coverage: 63.3%
- full-keyword hits: 33.3%
file-seeded SQL expansion:
- heap: p50 0.516 ms
- sorted_heap: p50 0.468 ms
- same 63.3% keyword coverage
- same 33.3% full-keyword hits
sorted_heap_expand_rerank() helper:
- p50 0.665 ms
- same 63.3% keyword coverage
- same 33.3% full-keyword hits

The important conclusion is narrow but real:

the real code corpus branch is now reproducible inside this repository
seeded expansion by file preserves answer-support quality on the real CrossFile question set
on this code corpus, the current gain is latency, not answer quality
the helper is not yet the latency leader on this tiny real corpus; the simple SQL expansion shape still wins locally

This means the next code-corpus GraphRAG step is not “invent a bigger graph API”. It is either:

a richer real code-graph relation hypothesis than plain file -> chunk, or
a lower-overhead helper path for this very simple expansion contract

Real `require`-graph falsifier

The obvious next hypothesis was that plain file -> chunk was too weak, and that the real local code graph should help once actual require edges were present.

That hypothesis is now tested in the same harness:

53 local require edges derived from the real cogniformerus source tree
relation REQUIRES_FILE
two new query shapes:
- seed_require_twohop_*
- seed_file_plus_require_in

Stable local result on the same 40-file / 800-row / 6-question point, 3 runs:

plain file-seeded expansion:
- sorted_heap: 0.471 ms
- keyword coverage: 63.3%
- full hits: 33.3%
file plus required files:
- sorted_heap: 0.605 ms
- same 63.3% keyword coverage
- same 33.3% full hits
dependency-only two-hop:
- sorted_heap: 0.391 ms
- keyword coverage: 20.0%
- full hits: 0.0%

So the richer real relation hypothesis is currently refuted on this code corpus:

adding dependency files does not improve answer-support quality
dependency-only traversal is actively worse because it drops own-file context
unioning own files with required files only adds cost, not quality

This is a useful stopping point. The next likely win for real code-GraphRAG is not “just add more code edges”. It is a different retrieval contract or a lower-overhead helper path on the already-good file-seeded shape.

File-summary seed falsifier

The next retrieval-contract hypothesis was also tested locally on the same real code corpus:

add one synthetic-but-data-derived summary row per file
seed on those summary rows
then expand back to the file’s chunk rows

The goal was to test whether the missing factor was simply that chunk-level ANN was a poor way to choose files.

That also failed to improve answer-support quality.

Stable smoke result on the same 40-file / 840-row / 6-question point:

summary-seeded expansion:
- heap: 0.587 ms
- sorted_heap: 0.564 ms
- keyword coverage: 63.3%
- full hits: 33.3%

So the current real code-corpus plateau is now bounded more tightly:

plain file-seeded expansion: same quality, lower latency
file summaries: same quality, higher latency
require edges: no quality gain
require-only traversal: quality regression

That strongly suggests the next code-corpus GraphRAG branch should not be “more local graph structure” or “better file seeds” in the same lexical setup. The remaining frontier is more likely one of:

a different quality metric / question contract,
better embeddings,
or a lower-overhead execution path on the already-best file-seeded shape.

Oracle-seed and oracle-rerank diagnostic

The next adversary question was sharper:

is the plateau really about bad file seeds, or is it already downstream in the rerank / evaluation contract?

The harness now includes two explicit oracle diagnostics on the same real code corpus:

oracle file seeds
- choose seed files by benchmark-keyword overlap against the full file text
- this is not a deployable retrieval contract; it is a diagnostic ceiling
prompt-derived lexical rerank
- keep the same ANN-derived file seeds
- rerank by lexical overlap with terms extracted from the actual user prompt
- this is deployable in principle, but much weaker than the oracle signal
oracle keyword rerank
- keep the same ANN-derived file seeds
- rerank the expanded chunk rows by direct overlap with the benchmark’s gold CrossFile keywords before falling back to embedding distance

Stable local result, 3 runs, same 40-file / 840-row / 6-question point:

plain file-seeded expansion:
- sorted_heap: 0.443 ms
- keyword coverage: 63.3%
- full hits: 33.3%
oracle file seeds:
- sorted_heap: 0.416 ms
- same 63.3% keyword coverage
- same 33.3% full hits
prompt-derived lexical rerank:
- sorted_heap: 3.005 ms
- same 63.3% keyword coverage
- worse 16.7% full hits
oracle keyword rerank:
- heap: 2.905 ms
- sorted_heap: 2.944 ms
- keyword coverage: 90.0%
- full hits: 66.7%

This is a useful but narrow falsifier:

the plateau is not explained by weak file seeds alone
richer local graph structure also did not explain it
a simple prompt-term rerank at top_k=4 also did not explain it
but once the rerank contract is allowed to use the benchmark’s own gold keywords, quality jumps sharply

That does not justify a product claim, because the oracle rerank is using the same keyword signal that the benchmark later scores. It does justify a more targeted next hypothesis:

the remaining quality frontier on the real code corpus is more likely in the query/rerank contract or embedding space than in local graph topology or seed selection

Result-budget and packing diagnostic

The broad “cheap lexical hybrid does not help” claim turned out to be too strong once the same real code-corpus harness was rerun at larger result budgets.

Bounded local sweep, same 40-file / 840-row / 6-question corpus, ann_k=16, 3 runs:

plain file-seeded sorted_heap expansion:
- top_k=4: 0.402 ms, 63.3% keyword coverage, 33.3% full hits
- top_k=8: 0.460 ms, 68.1% keyword coverage, same 33.3% full hits
- top_k=16: 0.469 ms, 84.3% keyword coverage, same 33.3% full hits
- top_k=32: 0.449 ms, 94.3% keyword coverage, 66.7% full hits
prompt-derived lexical rerank:
- top_k=4: 3.005 ms, 63.3%, 16.7%
- top_k=8: 3.176 ms, 86.7%, 50.0%
- top_k=12: 3.149 ms, 90.0%, 66.7%
- top_k=32: 3.147 ms, 96.7%, 83.3%

So the real code-corpus plateau is not just a seed-quality problem. It is also partly a result-budget / packing problem:

with more rows, even the plain file-seeded path recovers much more keyword coverage
prompt-derived lexical rerank starts to help only once the row budget is not extremely tight

That makes the next bounded hypothesis more specific:

the remaining small-top_k gap is likely about how evidence is packed into a tiny chunk budget, not about choosing better files

One more diagnostic supports that narrower claim. On the original top_k=4 point, a diversity-aware prompt-term rerank was also tested:

sorted_heap prompt-diverse rerank:
- 3.229 ms
- 76.7% keyword coverage
- still only 33.3% full hits

That is a partial gain in coverage, but still not the qualitative jump needed to make the current small-budget contract compelling.

Code-aware embedding diagnostic

The next bounded hypothesis was exactly what the code corpus suggests:

maybe the remaining gap is not just about rerank logic, but about the fact that the current harness still uses a Gutenberg-style lexical tokenizer that does not understand CamelCase or _snake_case identifiers well

The harness now supports two embedding modes:

generic
- existing lexical hash over generic text tokens
code_aware
- keeps the full code token, but also splits identifiers on _ and CamelCase before hashing

Stable local comparison on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:

plain file-seeded sorted_heap expansion:
- generic: 0.450 ms, 63.3% keyword coverage, 33.3% full hits
- code_aware: 0.427 ms, 61.4% keyword coverage, 16.7% full hits
prompt-diverse rerank:
- generic: 3.178 ms, 76.7%, 33.3%
- code_aware: 3.351 ms, 76.7%, 50.0%
oracle keyword rerank:
- generic: 2.672 ms, 90.0%, 66.7%
- code_aware: 2.435 ms, 96.7%, 83.3%

This is another mixed but useful falsifier:

code-aware tokenization is not a free win by itself
plain ANN + file expansion actually got slightly worse
but once combined with a diversity-aware rerank, the same code-aware mode did improve the small-budget full_pct

So the current code-corpus frontier is now even narrower:

the next likely win is not “better seeds” or “more edges”, but a tighter coupling between code-aware embeddings and a smarter small-budget rerank / packing contract

Summary-output packing win

The next bounded hypothesis was the most direct one implied by the previous diagnostics:

if the real bottleneck is small-budget packing, then maybe raw chunks are simply the wrong final output unit for this code benchmark

The harness already materializes one summary row per file. The new test keeps the same ANN-derived file seeds, but returns file summaries as the final output rows instead of raw chunks.

Stable local result on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:

generic embedding mode:
- chunk output (seed_file_expand_in, sorted_heap):
  - 0.418 ms
  - 63.3% keyword coverage
  - 33.3% full hits
- summary output (seed_file_summary_output_in, sorted_heap):
  - 0.200 ms
  - 71.0% keyword coverage
  - 33.3% full hits
- prompt summary rerank (prompt_summary_rerank_in, sorted_heap):
  - 0.318 ms
  - 73.3% keyword coverage
  - 50.0% full hits
code-aware embedding mode:
- chunk output:
  - 0.418 ms
  - 61.4%
  - 16.7%
- summary output:
  - 0.207 ms
  - 77.6%
  - 33.3%
- prompt summary rerank:
  - 0.426 ms
  - 77.6%
  - 33.3%

This is the first clean small-budget win on the real code corpus:

summary rows are a better packing unit than raw chunks at top_k=4
they improve coverage while also reducing latency
in the generic mode, prompt-aware reranking over summaries also improves full_pct

So the current strongest product-facing hypothesis is no longer “better seeds” or “more graph edges”. It is:

for real code GraphRAG, file summaries are a stronger final output unit than raw chunks when the answer budget is tiny

Summary rows as seed unit

The next narrow question was whether summaries are only a better output unit, or also a better seed unit.

That was tested by forcing the ANN seed step to rank only REL_FILE_SUMMARY rows and then keeping the final result set on summaries as well.

Stable local result on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:

generic embedding mode:
- summary output from mixed ANN seeds (seed_file_summary_output_in, sorted_heap):
  - 0.199 ms
  - 71.0% keyword coverage
  - 33.3% full hits
- summary output from summary-only seeds (summary_seed_summary_output_in, sorted_heap):
  - 0.116 ms
  - 77.6% keyword coverage
  - 33.3% full hits
- prompt summary rerank from mixed seeds:
  - 0.329 ms
  - 73.3%
  - 50.0%
- prompt summary rerank from summary-only seeds:
  - 0.541 ms
  - 74.3%
  - 33.3%
code-aware embedding mode:
- mixed-seed summary output:
  - 0.193 ms
  - 77.6%
  - 33.3%
- summary-only seed summary output:
  - 0.112 ms
  - 64.3%
  - 33.3%

So the current tiny-budget frontier is now split into two clear points:

fastest coverage point on this corpus:
- generic embedding mode
- summary-only seeds
- summary output
best full-hit point on this corpus:
- generic embedding mode
- mixed ANN seeds
- prompt-aware summary rerank

And one more falsifier is now clear:

summary rows are not universally a better seed unit; the benefit depends on the embedding mode and the final scoring contract

Summary-plus-chunk hybrid output

The next bounded question was whether the best tiny-budget contract should stay purely on summaries, or whether a hybrid output can do better:

use summaries to choose the right files, but also emit one best chunk from each selected file so the final answer set contains both compressed context and one concrete code span

That was tested in two variants:

mixed ANN seeds -> summary ranking -> one best chunk per selected file
summary-only seeds -> summary ranking -> one best chunk per selected file

Stable local result on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:

generic embedding mode:
- best prior full-hit point: mixed-seed prompt summary rerank
  - 0.363 ms
  - 73.3%
  - 50.0%
- mixed-seed summary+chunk hybrid:
  - 1.481 ms
  - 84.3%
  - 33.3%
- summary-seeded summary+chunk hybrid:
  - 1.627 ms
  - 78.1%
  - 50.0%
code-aware embedding mode:
- best prior summary-only point:
  - prompt summary rerank
  - 0.372 ms
  - 77.6%
  - 33.3%
- mixed-seed summary+chunk hybrid:
  - 1.616 ms
  - 84.3%
  - 50.0%
- summary-seeded summary+chunk hybrid:
  - 1.688 ms
  - 77.6%
  - 33.3%

So this branch narrows the frontier again:

hybrid output is not a universal improvement
for the generic mode, pure summary rerank remains the better tiny-budget full-hit point
for the code-aware mode, mixed-seed summary+chunk hybrid is the first path that reaches 50.0% full hits at top_k=4

That means the current strongest small-budget choices are now split:

generic mode:
- summaries-only remain the better contract
code-aware mode:
- hybrid summary+chunk output is now the better contract

Fixed-ratio hybrid packing

The previous hybrid branch still left one obvious ambiguity:

was the hybrid result about having both summaries and chunks at all, or just about how many summary slots the tiny top_k=4 budget reserved?

That was tested with two fixed-ratio mixed-seed hybrids:

summary-light: 1 summary + 3 chunk slots
summary-heavy: 3 summary slots + 1 chunk slot

Stable local result on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:

generic embedding mode:
- prior best full-hit point:
  - prompt summary rerank
  - 0.337 ms
  - 73.3%
  - 50.0%
- prior balanced hybrid:
  - 1.490 ms
  - 84.3%
  - 33.3%
- summary-light hybrid:
  - 1.753 ms
  - 80.0%
  - 33.3%
- summary-heavy hybrid:
  - 1.057 ms
  - 86.7%
  - 50.0%
code-aware embedding mode:
- prior best point:
  - balanced hybrid
  - 1.566 ms
  - 84.3%
  - 50.0%
- summary-light hybrid:
  - 2.246 ms
  - 68.1%
  - 33.3%
- summary-heavy hybrid:
  - 0.879 ms
  - 84.3%
  - 50.0%

This resolves the remaining hybrid ambiguity:

the hybrid win is not about chunks in general
it is specifically about reserving a small number of chunk slots while keeping the budget summary-heavy

So the refined tiny-budget frontier is now:

generic mode:
- best latency/full-hit tradeoff: pure prompt summary rerank
- best coverage at the same full-hit level: summary-heavy hybrid
code-aware mode:
- summary-heavy hybrid is now the strongest point

Summary-heavy hybrid with summary-only seeds

The remaining seed question after the fixed-ratio result was very narrow:

if the winning hybrid is already summary-heavy, should its seed unit also be switched fully to summaries?

That was tested directly against the current summary-heavy mixed-seed hybrid.

Stable local result on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:

generic embedding mode:
- prompt summary rerank:
  - 0.395 ms
  - 73.3%
  - 50.0%
- mixed-seed summary-heavy hybrid:
  - 1.062 ms
  - 86.7%
  - 50.0%
- summary-seeded summary-heavy hybrid:
  - 1.175 ms
  - 87.6%
  - 50.0%
code-aware embedding mode:
- prompt summary rerank:
  - 0.390 ms
  - 77.6%
  - 33.3%
- mixed-seed summary-heavy hybrid:
  - 0.965 ms
  - 84.3%
  - 50.0%
- summary-seeded summary-heavy hybrid:
  - 0.981 ms
  - 77.6%
  - 33.3%

This closes the seed-unit branch for the current frontier:

generic mode:
- summary-only seeds can squeeze out a tiny extra coverage gain, but they do not improve full hits and they cost more latency than the mixed-seed summary-heavy hybrid
code-aware mode:
- summary-only seeds are clearly worse; the mixed-seed summary-heavy hybrid remains the strongest point

Per-question failure pattern

Aggregate percentages were no longer enough to guide the next branch, so the real code-corpus harness now supports targeted diagnostics:

--case-filter
--report-questions

That was used to inspect the current best generic and code-aware contracts on the exact CrossFile prompts from butler_code_test.cr.

Stable local diagnostic, same 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:

generic mode
- best latency/full-hit point:
  - prompt_summary_rerank_in
- best coverage/full-hit point:
  - prompt_summary_chunk_hybrid_s3_in
code-aware mode
- best point:
  - prompt_summary_chunk_hybrid_s3_in

The important result is not just the percentages, but which questions stay hard:

Response memory policy
- still misses under all current best contracts
- current quality stays around 40.0%
Streaming overlap
- still misses under all current best contracts
- current quality stays around 80.0%
Butler response routing
- generic contracts still miss it
- code-aware summary-heavy hybrid fixes it to 100.0%
Memory store flow
- generic best contracts already solve it
- code-aware summary-heavy hybrid still leaves it at 85.7%

This narrows the remaining frontier again:

the next real improvement is likely query-specific or corpus-specific, not a broad packing or seed policy that helps every question equally

The new payload diagnostics make that even more concrete:

Response memory policy
- current best contracts already pull the right file neighborhood:
  - memory/hierarchical.cr
  - memory/pgvector.cr
  - memory/external_store.cr
  - butler/persona.cr
- the summary-heavy hybrid even surfaces the _micro_only chunk from memory/hierarchical.cr
- but the remaining miss is about policy nuance, not file choice: the returned rows still do not cover the full combination of _micro_only, refusal/pollution behavior, and external-storage policy
Streaming overlap
- current best contracts already pull the correct file: streaming/controller.cr
- both summary and chunk rows surface the overlap/chunking topic
- the remaining miss is about exact constants / same-file granularity: the query still does not close the final 1500 / 100 coverage gap

So for these two stubborn real prompts, the problem has narrowed from “retrieval picked the wrong files” to a much smaller statement:

the current system is usually choosing the right file region, but not yet the exact evidence fragment or policy detail needed to close the benchmark

The next bounded hypothesis was:

if the right file is already selected, maybe the fix is simply to give the best file two nearby chunks instead of one

That was tested with a new prompt_summary_chunk_local2_in case:

keep the summary-heavy contract
keep mixed ANN seeds
replace the single best chunk from the top file with a 2-chunk local window around the best chunk anchor

It did not help.

Targeted hard-prompt rerun (Response memory policy + Streaming overlap, ann_k=16, top_k=4, fresh backend):

generic mode:
- existing summary-heavy hybrid:
  - 70.0%
  - 0.945-1.050 ms
- local 2-chunk refinement:
  - 70.0%
  - 1.660-1.729 ms
code-aware mode:
- existing summary-heavy hybrid:
  - 60.0%
  - 1.041-1.078 ms
- local 2-chunk refinement:
  - 60.0%
  - 1.689-1.733 ms

Bounded all-question rerun (40 files, 840 rows, 6 real questions, 3 runs):

generic mode:
- existing summary-heavy hybrid:
  - 0.988 ms
  - 86.7%
  - 50.0%
- local 2-chunk refinement:
  - 1.572 ms
  - 84.3%
  - 33.3%
code-aware mode:
- existing summary-heavy hybrid:
  - 0.979 ms
  - 84.3%
  - 50.0%
- local 2-chunk refinement:
  - 1.523 ms
  - 84.3%
  - 50.0%

So the next frontier is narrower again:

the missing quality is not solved by a simple “take one more nearby chunk” policy; the remaining problem is finer-grained evidence choice, not just a larger same-file window

Semantic chunk selection is a generic-mode win, but not a universal one

The next bounded question was different from the failed local-window branch:

maybe the chunk budget is fine, and the real problem is that the last chunk is being picked with the wrong scoring rule

That was tested with prompt_summary_chunk_semantic_s3_in:

keep the current summary-heavy mixed-seed contract
keep the same 3 summaries + 1 chunk budget
change only the final chunk selection:
- old path: lexical-first within the top file
- new path: semantic-distance-first within the top file

Hard-prompt rerun (Response memory policy + Streaming overlap, fresh backend, ann_k=16, top_k=4):

generic mode:
- old summary-heavy hybrid:
  - 70.0%
  - 0.992 ms
- semantic chunk selection:
  - 70.0%
  - 0.481 ms
code-aware mode:
- old summary-heavy hybrid:
  - 60.0%
  - 1.035 ms
- semantic chunk selection:
  - 60.0%
  - 0.429 ms

Full 6-question rerun (40 files, 840 rows, 3 runs):

generic mode:
- old summary-heavy hybrid:
  - 0.976 ms
  - 86.7%
  - 50.0%
- semantic chunk selection:
  - 0.453 ms
  - 86.7%
  - 50.0%
code-aware mode:
- old summary-heavy hybrid:
  - 0.975 ms
  - 84.3%
  - 50.0%
- semantic chunk selection:
  - 0.474 ms
  - 77.6%
  - 33.3%

This creates a new mode-specific frontier:

generic mode
- prompt_summary_chunk_semantic_s3_in is now the stronger coverage-preserving hybrid
- it keeps the same aggregate quality as the old summary-heavy hybrid while cutting latency by roughly half
code-aware mode
- the same semantic swap is not acceptable
- it buys latency, but loses both coverage and full hits

So the next branch should treat the two embedding modes separately instead of assuming one chunk-selection rule can dominate both.

Prompt-focused file-local snippet extraction

The next successful branch stopped changing retrieval at all.

Instead of asking the SQL layer to return better rows, it asked a narrower question:

if prompt_summary_rerank_in already selects the right files, can we extract better evidence fragments from those files after retrieval?

That is now implemented in the real code-corpus harness as:

prompt_summary_snippet_py

Contract:

keep the existing prompt_summary_rerank_in SQL seed/output path
for each returned summary row, resolve the underlying source file
extract a prompt-focused snippet from the full file using:
- prompt-term matching against code-aware line tokens
- coverage-greedy anchor selection with method-definition tie-breaks
- Crystal method-body expansion instead of fixed-radius windows for selected def anchors
- adjacent helper-method merge for short ? helpers referenced by the selected method body
- nearby config-initializer merge for short ivar-based helpers, so snippets keep concrete defaults like window_size=1500 / overlap=100
- append the snippet to the original summary payload instead of replacing the summary row
cache (file, prompt) snippets in-process so repeated runs are measured in both cold and warm regimes

This is a downstream evidence-selection layer, not a new PostgreSQL query primitive. The main value is answer quality on the real cogniformerus CrossFile benchmark.

Verified local result on the stable real code-corpus point (40 files, 840 rows, 6 questions, 384D, ann_k=16, top_k=4, fresh backend):

generic embedding mode:
- prompt_summary_rerank_in:
  - p50 0.343-0.392 ms
  - 73.3% keyword coverage
  - 50.0% full hits
- prompt_summary_snippet_py:
  - warm-cache p50 0.551-0.698 ms
  - cold first-pass p50 15.316 ms, avg 15.435 ms
  - 100.0% keyword coverage
  - 100.0% full hits
code-aware embedding mode:
- prompt_summary_rerank_in:
  - p50 0.395-0.398 ms
  - 77.6%
  - 33.3%
- prompt_summary_snippet_py:
  - warm-cache p50 0.623 ms
  - 97.6%
  - 83.3%
- prompt_symbol_summary_snippet_py:
  - warm-cache p50 0.970-0.989 ms
  - 100.0%
  - 100.0%

Per-question generic rerun on the same corpus shows what the snippet layer actually fixed:

now solved at 100.0%:
- Butler response routing
- Memory store flow
- Two-stage answering
- NLU hybrid classification
- Response memory policy
- Streaming overlap

Per-question code-aware rerun with prompt_symbol_summary_snippet_py now also solves the full set at 100.0%, including the old remaining miss:

Memory store flow

Interpretation:

the remaining plateau on this real code corpus was not primarily file retrieval
- it was a file-local evidence selection problem
the last code-aware miss turned out to be a seed-ranking problem inside the summary path, not a snippet-window problem
- HierarchicalMemory was already present in the summary row for memory/hierarchical.cr
- the fix was a bounded symbol-aware variant, prompt_symbol_summary_snippet_py, which:
  - extracts exact prompt symbols like HierarchicalMemory, TwoStageAnswerer, DialogueNLU
  - unions a tiny exact-symbol summary seed set with the existing ANN seeds
  - ranks summary rows by symbol_hits before the older prompt-term score
the strongest fix was not “wider windows”
- it was preserving summary rows while adding code-structured snippets underneath them
prompt-focused snippet extraction is the first branch that moves the real code-corpus benchmark from 50.0% to 100.0% full hits at the same tiny-budget top_k=4
the current frontier is now split by embedding mode:
- generic: prompt_summary_snippet_py remains the better latency point
- code-aware: prompt_symbol_summary_snippet_py is the quality winner

Important caveat:

the warm numbers rely on an in-process (file, prompt) snippet cache
the cold first-pass cost is still materially higher than pure SQL rerank
so this is a quality-oriented contract, not a free latency win
the symbol-aware variant is not a generic improvement:
- in generic mode it gives no quality lift and only adds cost

That code-corpus frontier is now also checked under a repeated-build protocol:

scripts/repeat_graph_rag_code_corpus_builds.py
3 independent fresh temp-cluster builds
local facts_sh only, same stable point:
- 384D
- ann_k=16
- top_k=4
- ef_search=64
- ef_construction=200
- m=24
- fresh backend

Verified repeated-build result:

generic:
- prompt_summary_snippet_py
  - p50 median 0.613 ms, range 0.543-0.632 ms
  - stable 100.0% / 100.0%
- prompt_symbol_summary_snippet_py
  - p50 median 0.986 ms, range 0.932-1.047 ms
  - same 100.0% / 100.0%
  - therefore strictly slower on the generic frontier
code-aware:
- prompt_summary_snippet_py
  - p50 median 0.612 ms, range 0.602-0.629 ms
  - stable 97.6% / 83.3%
- prompt_symbol_summary_snippet_py
  - p50 median 0.963 ms, range 0.928-1.022 ms
  - stable 100.0% / 100.0%

Interpretation:

the new symbol-aware code-aware win is build-stable, not a one-off lucky HNSW construction
the generic frontier is also build-stable, and the symbol-aware case remains dominated there

That same repeated-build protocol was then rerun on an AWS ARM64 host (4 vCPU, 8 GiB RAM) using:

scripts/repeat_graph_rag_code_corpus_builds_aws.sh
the same 3 fresh builds
the same minimal synced cogniformerus source tree and butler_code_test.cr prompt set

Verified AWS repeated-build result:

generic:
- prompt_summary_snippet_py
  - p50 median 0.955 ms, range 0.954-0.960 ms
  - stable 100.0% / 100.0%
- prompt_symbol_summary_snippet_py
  - p50 median 1.485 ms, range 1.473-1.487 ms
  - same 100.0% / 100.0%
  - still strictly slower on the generic frontier
code-aware:
- prompt_summary_snippet_py
  - p50 median 1.008 ms, range 1.008-1.009 ms
  - stable 97.6% / 83.3%
- prompt_symbol_summary_snippet_py
  - p50 median 1.541 ms, range 1.537-1.557 ms
  - stable 100.0% / 100.0%

So the code-aware split is now cross-environment verified:

generic keeps the older snippet contract
code-aware keeps the symbol-aware snippet contract
the change in winner is not a local Apple-only artifact

Larger in-repo `cogniformerus` transfer gate

The previous repeated-build result used the smaller synced cogniformerus/src/cogniformerus slice (40 files, 840 rows after summary + chunk expansion). That was a good stable benchmark, but it was still fair to ask whether the contract would survive a materially larger in-repo code corpus.

The next bounded adversary check therefore reran the same repeated-build protocol on the full cogniformerus repository:

source tree: ~/Projects/Crystal/cogniformerus
file count: 183 Crystal files
prompt set: the same real butler_code_test.cr CrossFile prompts
same ANN knobs:
- 384D
- ann_k=16
- ef_search=64
- ef_construction=200
- m=24

The old tiny-budget point (top_k=4) did not transfer cleanly:

generic prompt_summary_snippet_py
- p50 0.770 ms
- 87.1% keyword coverage
- 66.7% full hits
- avg_rows 3.67
code-aware prompt_symbol_summary_snippet_py
- p50 1.824 ms
- 87.6% keyword coverage
- 66.7% full hits
- avg_rows 4.00

That is a real transfer gap, but it is not the same kind of failure as the external folding/src miss. The next bounded hypothesis was simply to raise the final result budget while keeping the same seed contract and the same winner cases.

At top_k=8, 3 fresh builds gave:

generic prompt_summary_snippet_py
- p50 median 0.819 ms, range 0.794-0.855 ms
- stable 100.0% / 100.0%
- avg_rows 6.33
code-aware prompt_symbol_summary_snippet_py
- p50 median 1.814 ms, range 1.669-2.101 ms
- stable 100.0% / 100.0%
- avg_rows 7.50

So the larger in-repo Crystal-side transfer gate is now verified.

The honest correction is:

the current real code-corpus winners are not universal at the old top_k=4 budget
on the full in-repo corpus, they need a slightly larger final result budget
once that budget moves to top_k=8, the current winners recover perfectly without needing a new seed or snippet contract

That narrows the remaining 0.13 real-corpus gap further:

~/Projects/Crystal now has both the small stable slice and a larger full-repo transfer gate
the next unverified generalization work was the mixed-language / archive side (~/Projects/C, ~/SrcArchives)

Mixed-language `~/Projects/C` adversary gate (`pycdc`)

The next release-hardening branch widened the code-corpus harness itself:

JSON question fixtures are now supported
source discovery is no longer hardcoded to *.cr
local dependency edges now also understand quoted C/C++ includes: #include "..." -> REQUIRES_FILE

That made it possible to run the same narrow code-GraphRAG benchmark shape on a real mixed-language corpus under ~/Projects/C without inventing a separate harness family.

The first such corpus was pycdc:

source tree: ~/Projects/C/pycdc
fixture: scripts/fixtures/graph_rag_pycdc_questions.json
source extensions:
- .h
- .cpp
- .txt
- .markdown
corpus size:
- 138 files
- 1281 rows after summary + chunk expansion
- 72 local dependency edges from quoted includes

The first smoke run already gave the key split:

generic prompt_summary_snippet_py
- 75.0% keyword coverage
- 40.0% full hits
generic prompt_symbol_summary_snippet_py
- 90.0%
- 60.0%
code-aware prompt_summary_snippet_py
- 70.0%
- 60.0%
code-aware prompt_compactseed_require_summary_snippet_fn
- 100.0%
- 100.0%

That already falsified the lazy story that mixed-language transfer would look just like the Crystal corpora with only file-summary rerank. On pycdc, the include-aware rescue path matters much more.

Repeated-build verification at top_k=8, 3 fresh builds, then gave:

generic prompt_symbol_summary_snippet_py
- p50 median 0.850 ms, range 0.825-1.118 ms
- stable 90.0% / 60.0%
- avg_rows 6.40
code-aware prompt_compactseed_require_summary_snippet_fn
- p50 median 8.006 ms, range 7.799-8.136 ms
- stable 100.0% / 100.0%
- avg_rows 5.80

So the first real ~/Projects/C gate is now covered, but it does not produce the same frontier as the Crystal corpora:

there is no equally cheap generic 100.0% / 100.0% point here
the quality-complete point currently needs the slower helper-backed compact lexical seed + include rescue

That still narrows the 0.13 release gap meaningfully:

~/Projects/Crystal is covered
~/Projects/C is covered
the remaining unverified archive-side gate is now ~/SrcArchives

Archive-side `~/SrcArchives` gate (`ninja/src`)

The last remaining real-corpus gap named in the 0.13 plan was the archive side under ~/SrcArchives. The new mixed-language harness path made it possible to cover that without another code change, so the next adversary corpus was:

source tree: ~/SrcArchives/apple/ninja/src
fixture: scripts/fixtures/graph_rag_ninja_questions.json
source extensions:
- .h
- .cc
corpus size:
- 103 files
- 1757 rows after summary + chunk expansion
- 282 local dependency edges from quoted includes

The first smoke at the current default-ish budget (top_k=8) already gave a useful signal:

generic prompt_summary_snippet_py
- 95.0% keyword coverage
- 80.0% full hits
code-aware prompt_summary_snippet_py
- 85.0%
- 80.0%

That differed from pycdc in an important way:

the archive corpus was already close on the plain generic path
the code-aware path was not stronger here
there was no immediate evidence that a dependency-rescue branch was needed

The cheapest falsifier was therefore not a new query contract, but just a small increase in the final result budget. At top_k=12:

generic prompt_summary_snippet_py
- 100.0% / 100.0%
- p50 0.996 ms on the first smoke
code-aware prompt_summary_snippet_py
- stayed at 85.0% / 80.0%

Repeated-build verification (3 fresh builds) then confirmed the archive-side winner:

generic prompt_summary_snippet_py
- p50 median 0.914 ms, range 0.827-0.921 ms
- stable 100.0% / 100.0%
- avg_rows 7.80
code-aware prompt_summary_snippet_py
- p50 median 0.871 ms, range 0.848-0.901 ms
- stable 85.0% / 80.0%
- avg_rows 7.60

So the archive-side gate is now covered, and the conclusion is pleasantly narrow:

~/SrcArchives does not require a new rescue contract for the first verified corpus
the simple generic summary-snippet path closes ninja/src
the only change needed versus the smaller code-corpus points was a small result-budget bump from top_k=8 to top_k=12

This means the 0.13 larger real-corpus verification matrix is now complete in the scoped sense the plan asked for:

~/Projects/Crystal
~/Projects/C
~/SrcArchives

External folding corpus check

The next adversary check was a second real code corpus outside this repository:

source tree: folding/src
prompt set: butler_folding_test.cr

This surfaced one real harness bug first:

scripts/bench_graph_rag_code_corpus.py originally globbed *.cr paths without filtering is_file()
on the folding tree that accidentally picked up .crystal-cache directories ending in .cr
the harness now filters to real files only

Once that was fixed, the external corpus produced a useful repeated-build result. Local 3-build protocol on facts_sh, generic mode, same small-budget point (384D, ann_k=16, top_k=4, ef_search=64, ef_construction=200, m=24, fresh backend):

prompt_summary_snippet_py
- p50 median 1.048 ms, range 0.913-4.141 ms
- quality drifted across fresh builds: 90.5-100.0% keyword coverage, 83.3-100.0% full hits
prompt_lexseed_require_summary_snippet_fn
- the first non-oracle rescue to 100.0% / 100.0%
- but under a colder repeated-build protocol it turned out to be much more expensive than the earlier one-build numbers suggested: p50 median 28.266 ms, range 26.887-30.698
prompt_compactseed_require_summary_snippet_fn
- p50 median 5.940 ms, range 5.914-6.128
- stable 100.0% / 100.0%
oracle_prompt_summary_snippet_py
- on a bounded full rerun it also stayed at 100.0% / 100.0%, but the non-oracle compact-seed rescue already matches that quality, so oracle seeds are no longer the interesting external-generic diagnostic

Interpretation:

the old claim that generic external folding was already solved by prompt_summary_snippet_py was too strong
the generic baseline is now clearly less robust on this corpus than on the in-repo cogniformerus slice
the first full-summary lexical rescue proved that the external gap was solvable, but it was too expensive to be a real frontier
the stronger branch was a different lexical-seed representation: a compact per-file seed table built from file path terms, require-target terms, and deduplicated summary tokens
that compact-seed rescue still closes the quality gap to 100.0% / 100.0%, but cuts the old full-summary lexical rescue by about 4.8x locally

An isolated timing split then narrowed where that penalty actually sits. On a fresh local 3-run sweep of only the old full-summary helper-backed rescue:

generic prompt_lexseed_require_summary_snippet_fn
- avg fetch ms/query = 10.674
- avg postprocess ms/query = 8.033
- 24 snippet-cache misses, 48 hits
- avg build time per miss = 6.010 ms
code-aware prompt_lexseed_require_summary_snippet_fn
- avg fetch ms/query = 11.016
- avg postprocess ms/query = 7.742
- 24 snippet-cache misses, 48 hits
- avg build time per miss = 5.787 ms

So the external rescue is not primarily a snippet-extraction problem. Even on the isolated cold pass, the dominant term is still the lexical-seed + REQUIRES_FILE fetch path. Snippet generation is a real secondary tax on the first pass, but it is not where the largest win now sits.

A kept-temp-cluster component probe narrowed that one step further. On the same external folding/src corpus:

ann alone was cheap: about 0.51 ms median across the 6 real prompts
lexical_seed alone was the real dominant stage: about 9.34 ms median
rescue_require landed at about 9.28 ms median because it inherits the same lexical-seed cost
rescue_lexical_require_summaries was about 9.86 ms median

The summary rows explain why this stage is expensive: REL_FILE_SUMMARY payload length was 80 / 2078 / 5441 bytes at min / median / max on the external corpus. So the rescue is paying to run prompt-term substring scoring against multi-kilobyte summary payloads even before snippet extraction starts.

The same external folding/src corpus also answered the code-aware question. At the same repeated-build point:

code-aware prompt_summary_snippet_py
- p50 median 1.080 ms, range 1.048-1.146 ms
- stable 79.8% / 66.7%
code-aware prompt_lexseed_require_summary_snippet_fn
- p50 median 36.676 ms, range 29.806-40.705
- stable 100.0% / 100.0%
code-aware prompt_compactseed_require_summary_snippet_fn
- p50 median 5.804 ms, range 5.776-6.510
- stable 100.0% / 100.0%
code-aware oracle_prompt_summary_snippet_py
- p50 median 1.217 ms, range 1.149-1.303 ms
- stable 100.0% / 100.0%

So the external folding split is now sharper:

both generic and code-aware external folding now have a verified non-oracle rescue to 100.0% / 100.0%
the external problem really was a seed-representation problem, not a snippet extraction problem
the current external default is the compact-seed rescue, not the old full-summary lexical rescue
the old full-summary rescue is now useful mainly as a diagnostic anchor for why the compact representation matters
the honest conclusion is narrower:
- external folding is no longer blocked by an unsolved quality gap
- it still pays a quality/latency tax relative to the primary cogniformerus code corpus, but that tax is now much smaller than before

That local result also transferred to AWS ARM64 (4 vCPU, 8 GiB RAM) under a fresh 3-build repeated-build protocol:

generic prompt_summary_snippet_py
- p50 median 1.540 ms, range 1.535-1.604 ms
- stable 90.5% / 83.3%
generic prompt_lexseed_require_summary_snippet_fn
- p50 median 41.960 ms, range 41.747-42.081
- stable 100.0% / 100.0%
generic prompt_compactseed_require_summary_snippet_fn
- p50 median 8.839 ms, range 8.732-8.846
- stable 100.0% / 100.0%
code-aware prompt_summary_snippet_py
- p50 median 1.775 ms, range 1.729-1.836 ms
- stable 79.8% / 66.7%
code-aware prompt_lexseed_require_summary_snippet_fn
- p50 median 60.413 ms, range 60.298-60.660
- stable 100.0% / 100.0%
code-aware prompt_compactseed_require_summary_snippet_fn
- p50 median 8.392 ms, range 8.329-8.413
- stable 100.0% / 100.0%

So the compact-seed external rescue is now cross-environment verified, not a local artifact. The speedup over the old full-summary lexical rescue also survives the environment change:

generic: 41.960 ms -> 8.839 ms
code-aware: 60.413 ms -> 8.392 ms

The external rescue is still slower than the primary in-repo winners, but it is no longer “full-quality only at tens of milliseconds”.

The next honest optimization target therefore changed. Cheap seed-budget cuts were already falsified (ann_k < 16 and lexical-seed LIMIT 1 both got worse), and the timing split shows that further work should focus on reducing the old full-summary lexical-seed cost. The compact lexical seed table already eliminated most of that cost, so the next branch is no longer “make lexical seeding viable at all”; it is whether the compact representation can be pushed closer to the primary in-repo code-corpus frontier.

One obvious branch was also falsified directly: truncating lexical scoring to a summary prefix. On the external corpus:

left(payload, 512) dropped the rescue query to about 7.9 ms, but quality fell back to 96.7% / 83.3%
left(payload, 1024) restored 100.0% / 100.0%, but it no longer sped the query up
the narrower threshold sweep (640..992) confirmed there was no useful middle ground: 992 bytes recovered 100.0% / 100.0%, but was still slower than the full-payload rescue

So a naive prefix cut is now a documented dead end. The remaining work is not “look at less text in the same way”; it needs a different lexical-seed representation or a different seed-selection contract altogether.

March 26, 2026: `sorted_hnsw.shared_cache` GraphRAG branch

A new bounded speed branch looked promising for fact-shaped GraphRAG: turning sorted_hnsw.shared_cache on for the ANN seed step. A direct local probe on a 2K x 384D multihop graph reduced the path-aware wrapper from roughly 0.911 ms total to 0.623 ms, with most of the gain in the ANN stage.

That did not survive the reliability gate.

On the full local 5K-pair, 64-query multihop harness, keeping the same quality knobs (ann_k=64, ef_search=128, ef_construction=200, m=24) but switching only sorted_hnsw.shared_cache from off to on caused all facts_sh ANN-seeded rows to collapse to 0.0% / 0.0%, while the facts_heap baseline stayed correct in the same run.

The strongest evidence from this branch is:

the simple direct ANN seed query on facts_sh still returned the expected top rows with shared_cache=on
single-query GraphRAG probes could still look correct
the failure only showed up on the full same-session multihop harness, which points to a cache lifecycle / reuse bug rather than a general GraphRAG scoring bug

So the current honest conclusion is narrow:

sorted_hnsw.shared_cache = on remains a promising performance branch for GraphRAG seed scans
it is not currently safe as the default GraphRAG benchmark or release operating point
the benchmark harnesses now expose a --shared-cache on|off switch, but the default stays off until this correctness issue is debugged and fixed

GraphRAG on sorted_heap

Existing anchors

What was benchmarked

Local findings

Small smoke run

Medium warm run

Medium run with lower shared buffers

Design implications

What not to build first

What to build next

Helper result

Real-text Gutenberg graph

pgvector parity on the real-text graph

zvec parity on the real-text graph

Qdrant parity on the real-text graph

Robustness rerun

Two-hop Gutenberg composition

Higher-dimension rerun

Recommended roadmap

Phase 0 — completed

Phase 1 — current

Phase 2 — current

Phase 3 — next

Phase 4 — current

Phase 5 — current

Phase 6 — next

Cogniformerus-style multihop facts

Important contract discovery

Early failure that mattered

Tuned 384D result

Seed frontier after the wrapper fix

m frontier on the same multihop benchmark

Full parity rerun at the balanced point

Full parity rerun at the higher-quality point

AWS ARM64 parity rerun (5K chains)

Larger local scale check (10K chains)

Exact-seed upper-bound diagnostic

Path-aware rerank diagnostic

Repeated-build local variance

Current verdict

Actual Butler gate seed-corpus smoke

Real code-corpus prototype

Real require-graph falsifier

File-summary seed falsifier

Oracle-seed and oracle-rerank diagnostic

Result-budget and packing diagnostic

Code-aware embedding diagnostic

Summary-output packing win

Summary rows as seed unit

Summary-plus-chunk hybrid output

Fixed-ratio hybrid packing

Summary-heavy hybrid with summary-only seeds

Per-question failure pattern

Same-file local chunk refinement does not rescue the hard prompts

Semantic chunk selection is a generic-mode win, but not a universal one

Prompt-focused file-local snippet extraction

Larger in-repo cogniformerus transfer gate

Mixed-language ~/Projects/C adversary gate (pycdc)

Archive-side ~/SrcArchives gate (ninja/src)

External folding corpus check

March 26, 2026: sorted_hnsw.shared_cache GraphRAG branch

GraphRAG on `sorted_heap`

`m` frontier on the same multihop benchmark

AWS ARM64 parity rerun (`5K` chains)

Larger local scale check (`10K` chains)

Real `require`-graph falsifier

Larger in-repo `cogniformerus` transfer gate

Mixed-language `~/Projects/C` adversary gate (`pycdc`)

Archive-side `~/SrcArchives` gate (`ninja/src`)

March 26, 2026: `sorted_hnsw.shared_cache` GraphRAG branch