GraphRAG on sorted_heap
This note evaluates a narrow question:
Can current
sorted_heap+ current vector search already support a useful GraphRAG-style retrieval workflow, or do we need a new storage/model layer?
The conclusion so far is:
- 1-hop fact retrieval by source entity already fits
sorted_heapwell. - Naive SQL join-based multi-hop expansion does not expose much advantage.
ANY(array_of_seed_ids)expansion does triggerSortedHeapScan, but on warm and medium-scale local benchmarks it still loses to heap+btree on end-to-end latency despite reading fewer blocks.- Narrow C helpers for expansion and fused top-K rerank now exist as:
sorted_heap_expand_ids(...)sorted_heap_expand_rerank(...)
- A one-call convenience wrapper now exists as:
sorted_heap_graph_rag_scan(...)
- Those helpers materially improve the
sorted_heappath on the synthetic GraphRAG benchmark, though pure heap+btree expansion is still faster on this synthetic workload. - Therefore the next promising primitive was correctly a narrow C helper, not a new graph storage engine and not a giant monolithic
graph_rag_scan()API.
Existing anchors
The repository already has the main building blocks:
- Zone-map pruning on
sorted_heap- planner hook +
SortedHeapScancustom scan - supports base-relation restriction on the leading PK columns
- planner hook +
- Planner-integrated ANN via
sorted_hnsw- exact ordered results
- works on both heap tables and
sorted_heaptables
- Legacy graph traversal precedent
svec_graph_scan()inpq.c- this is for ANN sidecar graph navigation, not fact graphs
- still useful as evidence that the extension can host graph-like traversal logic in C
What was benchmarked
Synthetic fact graph schema:
CREATE TABLE facts_heap (
entity_id int4 NOT NULL,
relation_id int2 NOT NULL,
target_id int4 NOT NULL,
embedding svec(32) NOT NULL,
payload text NOT NULL,
PRIMARY KEY (entity_id, relation_id, target_id)
);
CREATE TABLE facts_sh (
entity_id int4 NOT NULL,
relation_id int2 NOT NULL,
target_id int4 NOT NULL,
embedding svec(32) NOT NULL,
payload text NOT NULL,
PRIMARY KEY (entity_id, relation_id, target_id)
) USING sorted_heap;
Both tables also receive the same ANN index:
CREATE INDEX ... USING sorted_hnsw (embedding) WITH (m = 16, ef_construction = 64);
Benchmark harness:
scripts/bench_graph_rag.py- local ephemeral PostgreSQL 18 temp cluster
- deterministic synthetic fact graph
- compares:
hop1_entityhop1_entity_relationhop2_joinhop2_inseed_expand_joinseed_expand_inseed_expand_rerank_joinseed_expand_rerank_inseed_expand_fnseed_expand_rerank_fnseed_expand_rerank_topk_fnseed_graph_rag_scan_fn
The key comparison is between:
- join-shaped expansion
ANY(array(seed_ids))expansion
The second shape is the one that allows sorted_heap to expose its pruning logic directly on entity_id.
Local findings
Small smoke run
On a tiny graph (300 entities, 4 edges/entity):
facts_shreduced buffer hits strongly for:hop1_entityhop1_entity_relationhop2_inseed_expand_in
- but end-to-end latency stayed close to heap because the whole dataset was fully warm and tiny
Most importantly:
- join-shaped expansion largely erased the
sorted_heapadvantage ANY(array(...))expansion preservedSortedHeapScan
Medium warm run
On 20K entities, 8 edges/entity (160K rows total), warm local cache:
hop1_entity- heap:
Index Scan - sorted_heap:
Custom Scan:SortedHeapScan - sorted_heap reads fewer blocks and is roughly latency-parity
- heap:
seed_expand_join- bad shape for both
- sorted_heap is not meaningfully better
seed_expand_in- sorted_heap does use
SortedHeapScan - buffer footprint drops
- but heap+btree still wins on total latency
- sorted_heap does use
This means:
current SQL shape can make
sorted_heapread less, but executor/custom-scan overhead can still dominate the total time on warm-medium datasets
Medium run with lower shared buffers
On 20K entities, 16 edges/entity (320K rows total), shared_buffers=64MB:
hop1_entity- sorted_heap stayed strong: fewer hits, same-or-better latency
seed_expand_join- both paths were much worse
- heap and sorted_heap were similar, with read noise dominating
seed_expand_in- heap: lower latency
- sorted_heap: fewer touched blocks / lower expansion footprint
- but still slower end-to-end
This is the most important current result:
On a graph larger than a warm toy dataset,
sorted_heapalready shows the expected locality/pruning behavior for seed expansion, but the current SQL +CustomScanpath is not enough to turn that into a consistent latency win over heap+btree.
Design implications
What not to build first
- Not a new graph storage engine
- current evidence does not justify that jump
- 1-hop retrieval is already good on current storage
- Not a giant monolithic
svec_graph_rag_scan()- it would have to combine:
- ANN seed retrieval
- graph expansion
- rerank
- this is a large surface area
- it also risks duplicating planner/index logic from
sorted_hnsw
- it would have to combine:
What to build next
The next narrow primitive should be something like:
sorted_heap_expand_ids(
rel regclass,
seed_ids int4[],
relation_filter int2 DEFAULT NULL,
limit_rows int4 DEFAULT 0
)
Why this shape:
- ANN seed retrieval can stay in SQL:
SELECT target_id FROM facts ORDER BY embedding <=> $query LIMIT K
- expansion becomes a dedicated low-overhead C primitive
- it avoids:
- repeated executor/planner setup
- generic
CustomScanoverhead for this narrow use case
- it keeps the product boundary small:
- “expand these known entity IDs quickly”
That primitive can later be composed into:
- SQL-only GraphRAG
- a higher-level helper
- maybe a monolithic API if the narrow primitive proves valuable
Helper result
The narrow helpers now exist:
sorted_heap_expand_ids(
rel regclass,
seed_ids int4[],
relation_filter int4 DEFAULT NULL,
limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
entity_id int4,
relation_id int2,
target_id int4,
embedding svec,
payload text
)
and:
sorted_heap_expand_rerank(
rel regclass,
seed_ids int4[],
query svec,
top_k int4,
relation_filter int4 DEFAULT NULL,
limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
entity_id int4,
relation_id int2,
target_id int4,
payload text,
distance float8
)
and:
sorted_heap_expand_twohop_rerank(
rel regclass,
seed_ids int4[],
query svec,
top_k int4,
hop1_relation_filter int4 DEFAULT NULL,
hop2_relation_filter int4 DEFAULT NULL,
limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
entity_id int4,
relation_id int2,
target_id int4,
payload text,
distance float8
)
and:
sorted_heap_graph_rag_scan(
rel regclass,
query svec,
ann_k int4,
top_k int4,
relation_filter int4 DEFAULT NULL,
limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
entity_id int4,
relation_id int2,
target_id int4,
payload text,
distance float8
)
Their current contract is intentionally narrow:
- relation must be a
sorted_heaptable - relation must expose the columns:
entity_id int4relation_id int2target_id int4embedding svecpayload text
- the function reuses the zone-map range builder directly
- it emits fact rows for known source entity IDs
On the medium-pressure benchmark (20K entities, 16 edges/entity, 320K rows, shared_buffers=64MB, fresh backend, runs=3), the helpers produced:
facts_heap seed_expand_in:0.123 msfacts_sh seed_expand_in:0.285 msfacts_sh seed_expand_fn:0.165 msfacts_sh seed_expand_rerank_in:0.369 msfacts_sh seed_expand_rerank_fn:0.234 msfacts_sh seed_expand_rerank_topk_fn:0.139 msfacts_sh seed_graph_rag_scan_fn:0.144 ms
Interpretation:
sorted_heap_expand_ids()converts the observed block-pruning/locality advantage into a real latency win over the current SQL + CustomScan pathsorted_heap_expand_rerank()removes most of the remaining rerank overhead and is now materially faster than the currentsorted_heapSQL rerank path (0.139 msvs0.369 ms)sorted_heap_graph_rag_scan()is only slightly slower than the direct fused helper composition (0.144 msvs0.139 ms), so the convenience API does not erase the win- pure heap+btree expansion is still faster on this synthetic workload (
0.123 msvs0.165 ms)
Relation-filtered probes narrow that gap further:
facts_heap seed_expand_rel_in:0.074 msfacts_sh seed_expand_rel_in:0.151 msfacts_sh seed_expand_rel_fn:0.108 msfacts_heap seed_expand_rerank_rel_in:0.087 msfacts_sh seed_expand_rerank_rel_in:0.167 msfacts_sh seed_expand_rerank_rel_topk_fn:0.104 msfacts_sh seed_graph_rag_rel_scan_fn:0.120 ms
So the relation-filtered GraphRAG path is materially better than the current SQL + CustomScan form, but it still does not clearly beat heap+btree on this synthetic corpus. The filtered helper path is nevertheless close enough that a real fact graph, wider payloads, or colder cache state may flip the comparison.
Payload-width sensitivity does matter, but not monotonically.
The benchmark harness now supports --payload-bytes to widen synthetic fact rows and test the claim that locality should matter more once facts stop being tiny strings. On the same medium-pressure setup (20K entities, degree 16, 320K rows, shared_buffers=64MB, fresh backend):
- with
payload_bytes=1024facts_heap seed_expand_in:0.188 msfacts_sh seed_expand_in:0.185 msfacts_heap seed_expand_rerank_rel_in:0.120 msfacts_sh seed_expand_rerank_rel_topk_fn:0.100 msfacts_sh seed_graph_rag_rel_scan_fn:0.125 ms
- with
payload_bytes=2048facts_heap seed_expand_in:0.113 msfacts_sh seed_expand_in:0.208 msfacts_heap seed_expand_rerank_rel_in:0.090 msfacts_sh seed_expand_rerank_rel_topk_fn:0.122 msfacts_sh seed_graph_rag_rel_scan_fn:0.127 ms
Interpretation:
- a wider inline payload can make
sorted_heapcompetitive or slightly better on seed expansion - but the effect is not monotonic, so “wider payload always helps sorted_heap” is false on this synthetic generator
- this synthetic text filler is still a weak proxy for real fact payloads because compression/TOAST behavior can change the balance again
So the next falsifier should be a real-dataset GraphRAG harness or a more realistic payload model, not another synthetic-only extrapolation.
Real-text Gutenberg graph
A better falsifier now exists in:
This harness uses real Gutenberg paragraphs instead of synthetic payload text. It builds a small text graph:
- relation
1:book -> paragraph(contains) - relation
2:paragraph -> next_paragraph(next)
Embeddings are still deterministic lexical hash vectors, not external model embeddings. That means this harness is good for measuring graph-expansion latency on real text payloads and a real graph topology, but it is not a semantic-quality benchmark.
Two useful runs on shared_buffers=64MB, fresh backend:
64 books x 128 paragraphs/book (14,549 rows):
facts_heap seed_expand_rerank_rel_in:0.071 msfacts_sh seed_expand_rerank_rel_in:0.088 msfacts_sh seed_expand_rerank_rel_topk_fn:0.061 msfacts_sh seed_graph_rag_rel_scan_fn:0.084 ms
128 books x 256 paragraphs/book (58,954 rows):
facts_heap seed_expand_rel_in:0.073 msfacts_sh seed_expand_rel_in:0.078 msfacts_sh seed_expand_rel_fn:0.069 msfacts_heap seed_expand_rerank_rel_in:0.079 msfacts_sh seed_expand_rerank_rel_in:0.101 msfacts_sh seed_expand_rerank_rel_topk_fn:0.063 msfacts_sh seed_graph_rag_rel_scan_fn:0.089 ms
This is the first non-synthetic result that materially weakens the earlier “heap+btree simply wins” story:
- the plain
sorted_heapSQL path is still worse than heap+btree - but the fused filtered helper on the real-text Gutenberg graph is already at parity or slightly better than heap+btree on the rerank path
- the one-call wrapper is close enough that its overhead is visible but not disqualifying
So the narrow-helper direction survives the real-text falsifier better than the short-payload synthetic benchmark suggested.
pgvector parity on the real-text graph
The Gutenberg harness also now supports a comparable pgvector path on the same graph:
- ANN seeds come from a
facts_pgvtable withvector(dim)+ HNSW - graph expansion and exact rerank still happen in PostgreSQL over the fact rows, which is the relevant GraphRAG shape
This is important because a pure ANN benchmark would miss the real product question: how expensive is “ANN seed + graph expansion + exact rerank” as one workflow?
On fresh-backend runs with shared_buffers=64MB:
64 books x 128 paragraphs/book (14,549 rows):
- heap rerank baseline:
0.064 ms sorted_heap_expand_rerank(... relation=2):0.060 mssorted_heap_graph_rag_scan(... relation=2):0.075 mspgvector ANN -> heap expansion -> exact rerank:0.180 ms
128 books x 256 paragraphs/book (58,954 rows):
- heap rerank baseline:
0.085 ms sorted_heap_expand_rerank(... relation=2):0.071 mssorted_heap_graph_rag_scan(... relation=2):0.087 mspgvector ANN -> heap expansion -> exact rerank:0.295 ms
The buffer footprint matches the latency story:
sorted_heaphelper path stays around hundreds of shared-buffer hits- the
pgvectorpath needs several thousands of shared-buffer hits before the same exact rerank step
This does not mean pgvector is bad at pure ANN. It means that for this GraphRAG workload shape, once the seed stage is followed by relational graph expansion and exact rerank, the narrow sorted_heap helper path is materially better aligned with the whole workflow than an external ANN seed on a separate table.
zvec parity on the real-text graph
The same Gutenberg harness now also supports a comparable zvec path:
- ANN seeds come from a temporary
zvecHNSW collection built from the same fact rows - graph expansion and exact rerank still happen in PostgreSQL over
facts_heap
This produced a mixed but useful result.
On the medium real-text slice (64 books x 128 paragraphs/book, 14,549 rows, fresh backend, shared_buffers=64MB):
- heap rerank baseline:
0.068 ms sorted_heap_expand_rerank(... relation=2):0.066 mssorted_heap_graph_rag_scan(... relation=2):0.082 mszvec ANN -> heap expansion -> exact rerank:0.322 ms
So on the medium slice, the zvec path is stable but materially slower than the fused sorted_heap helper. The SQL-side buffer footprint is not the bottleneck there; the external ANN seed stage dominates the total latency.
On the larger real-text slice (128 books x 256 paragraphs/book, 58,954 rows), the result is currently not publishable as a clean latency row:
- the
sorted_heaphelper path remains stable:sorted_heap_expand_rerank(... relation=2):0.070 mssorted_heap_graph_rag_scan(... relation=2):0.084 ms
- the
zvecpath fails during ANN seed retrieval atann_k=32
The failure is not coming from PostgreSQL or from the GraphRAG SQL wrapper. A pure zvec-only reproduction on the same 58,954-row lexical-hash corpus shows the same failure mode:
- for one probe query,
topk=8andtopk=10return valid document IDs topk>=16returns emptydoc.idvalues after:Failed to find target chunk for index 58379
The Gutenberg GraphRAG harness now turns that into an explicit benchmark error:
RuntimeError: zvec returned unmapped doc ids (...)
So the objective conclusion today is narrower than for pgvector:
zvecdoes not currently provide a robust large-slice GraphRAG parity row on this real-text workflow atann_k=32- on the medium slice where it does run, it is materially slower than the fused
sorted_heaphelper path - on the larger slice, the current blocker is
zvecANN seed instability, not PostgreSQL expansion/rerank overhead
That instability is now isolated more sharply by the repo-owned reproducer:
Current threshold signature on the lexical-hash Gutenberg corpus:
topk=16,dim=3264x256,80x256,96x256,112x256slices are stable28,661,36,064,43,684,51,166rows
128x256fails58,954rows- first bad probe:
query #10 - returned ids are empty strings after
Failed to find target chunk for index 58379
So the current failure signature is not just “large-ish GraphRAG benchmark”. It looks more like a size-thresholded zvec retrieval bug on this corpus shape.
That theory is now falsified by a second repo-owned reproducer on a plain synthetic FP32 corpus:
Current synthetic signature:
dim=32,ef_search=64topk=7already reproduces the issue- a compact failing case exists at
4,950rows- nearby controls:
4,900rows: ok4,950rows: bad5,000rows: bad
topk<=6is clean on the4,950-row case
- nearby controls:
- failures are non-monotonic by row count
- bad:
16,000,20,000,28,000,30,000,45,000,60,000 - ok:
24,000,29,000,75,000(100probe queries still clean at75k)
- bad:
- another local non-monotonic pocket exists around
7k-8k7,000: ok7,500: bad7,800: ok7,900: bad
- representative stderr lines:
Failed to find target chunk for index 4945Failed to find target chunk for index 14999Failed to find target chunk for index 29999Failed to find target chunk for index 59999
So the stronger objective conclusion is:
- the failure is not Gutenberg-specific
- it is not a simple monotonic “too many rows” threshold either
- the current evidence points to a broader
zvecretrieval defect around forward-store / chunk lookup, not to PostgreSQL GraphRAG expansion logic
For an upstream-ready summary of the current evidence, see:
Two more diagnostic observations make that conclusion sharper:
- when the synthetic bug triggers, the ANN scores still come back while
doc.idis empty for the whole result set4,950 rows,topk=6: valid ids4,950 rows,topk=7: same score bands, but everydoc.idis''
- on a larger synthetic case (
16,000rows), exact cosine inspection shows the best-score bucket spans1000, 2000, ..., 16000, andzvecalready returns empty ids attopk=5
That does not prove the internal root cause, but it strongly suggests the ANN ranking stage is still producing plausible scores while the forward-store document lookup stage is failing. A reasonable working hypothesis is that some tied-score / candidate-materialization paths touch unresolved high indexes and poison metadata resolution for the whole returned batch.
Qdrant parity on the real-text graph
The Gutenberg harness now also supports a comparable Qdrant path:
- ANN seeds come from a local Qdrant HNSW collection built from the same fact rows
- graph expansion and exact rerank still happen in PostgreSQL over
facts_heap
Unlike zvec, this path stayed stable on both the medium and larger real-text slices. The result is simpler:
64 books x 128 paragraphs/book (14,549 rows):
- heap rerank baseline:
0.074 ms sorted_heap_expand_rerank(... relation=2):0.062 mssorted_heap_graph_rag_scan(... relation=2):0.083 msQdrant ANN -> heap expansion -> exact rerank:1.535 ms
128 books x 256 paragraphs/book (58,954 rows):
- heap rerank baseline:
0.081 ms sorted_heap_expand_rerank(... relation=2):0.083 mssorted_heap_graph_rag_scan(... relation=2):0.085 msQdrant ANN -> heap expansion -> exact rerank:1.769 ms
So on this GraphRAG workflow shape:
- Qdrant is robust on the real-text benchmark
- but its external ANN seed stage dominates end-to-end latency
- the fused
sorted_heaphelper remains roughly an order of magnitude faster on the rerank path
That again does not mean Qdrant is a bad vector engine in isolation. It means that when the workflow is “external ANN seed + relational graph expansion + exact rerank inside PostgreSQL”, the narrow in-engine helper path is much better aligned with the total job than a remote vector service.
Robustness rerun
The same real-text Gutenberg harness was then rerun with a larger query set (query_count=64, runs=3) to check whether the earlier 16-query results were just small-sample noise.
The ranking stayed the same on both slices:
- medium slice (
64 x 128):sorted_heap_expand_rerank(... relation=2):0.062 mssorted_heap_graph_rag_scan(... relation=2):0.081 mspgvector ANN -> heap expansion -> exact rerank:0.219 mszvec ANN -> heap expansion -> exact rerank:0.342 msQdrant ANN -> heap expansion -> exact rerank:1.567 ms
- larger slice (
128 x 256):sorted_heap_expand_rerank(... relation=2):0.067 mssorted_heap_graph_rag_scan(... relation=2):0.088 mspgvector ANN -> heap expansion -> exact rerank:0.309 msQdrant ANN -> heap expansion -> exact rerank:1.911 mszvecremains excluded from this large-slice rerun because the previously observedann_k=32instability is still the blocker
So the current GraphRAG conclusion is no longer resting on one short probe set. At least on this real-text Gutenberg workflow, the fused sorted_heap helper still has the best end-to-end latency profile after the query set is expanded.
Two-hop Gutenberg composition
The next adversarial question was whether the current helper story survives a real two-hop workflow, not just the earlier “ANN seeds -> one filtered expansion -> rerank” shape.
The initial Gutenberg falsifier first used a composed path from the existing narrow primitives:
- ANN seeds from the fact table
- first hop via
sorted_heap_expand_ids(..., relation=2) - second hop via
sorted_heap_expand_rerank(..., relation=2)
That composition benchmark was intentionally a harsher test than the earlier one-hop helper story, because it asked whether the current primitives were already enough to make multi-hop GraphRAG plausible before inventing a dedicated two-hop helper.
The answer was “yes, barely enough”. That justified one narrow extra helper, not a new storage engine:
sorted_heap_expand_twohop_rerank(...)
This fused helper keeps the same contract shape as the earlier rerank helper, but removes the intermediate SQL/materialization boundary between hop1 and the second-hop rerank.
On the medium real-text slice (64 books x 128 paragraphs/book, 14,549 rows, 32D, query_count=64, runs=3, fresh backend, shared_buffers=64MB):
- heap baseline,
seed_expand2_rerank_rel_in:0.102 ms - plain
sorted_heapSQL,seed_expand2_rerank_rel_in:0.136 ms - helper-composed
sorted_heap,seed_expand2_rerank_rel_topk_fn:0.105 ms - fused
sorted_heap_expand_twohop_rerank(...):0.081 ms
So on the medium slice, the dedicated helper now does what the composed path only hinted at:
- it beats heap+btree on latency
- it materially beats the composed two-hop helper path
- it also cuts shared-buffer hits strongly (
421vs1298for the heap baseline, and421vs662for the composed helper)
On the larger real-text slice (128 books x 256 paragraphs/book, 58,954 rows, same settings except the larger corpus):
- heap baseline,
seed_expand2_rerank_rel_in:0.114 ms - plain
sorted_heapSQL,seed_expand2_rerank_rel_in:0.153 ms - helper-composed
sorted_heap,seed_expand2_rerank_rel_topk_fn:0.111 ms - fused
sorted_heap_expand_twohop_rerank(...):0.092 ms
So the larger slice confirms the same shape: the dedicated two-hop helper is not a tiny micro-win on one probe set; it keeps the lead over both heap+btree and the composed helper.
The same medium two-hop slice was also benchmarked against the external ANN seed paths:
pgvector ANN -> heap 2-hop expansion -> exact rerank:0.253 mszvec ANN -> heap 2-hop expansion -> exact rerank:0.374 msQdrant ANN -> heap 2-hop expansion -> exact rerank:1.789 ms
So the product-level conclusion stays consistent in the two-hop case as well: the narrow in-engine sorted_heap helper remains the fastest end-to-end GraphRAG path among the tested competitors on this real-text slice.
At higher exact-rerank dimension, the advantage narrows again rather than disappearing:
64 books x 128 paragraphs/book, 384D, query_count=64, runs=3:
- heap baseline,
seed_expand2_rerank_rel_in:0.225 ms - plain
sorted_heapSQL,seed_expand2_rerank_rel_in:0.266 ms - helper-composed
sorted_heap,seed_expand2_rerank_rel_topk_fn:0.258 ms - fused
sorted_heap_expand_twohop_rerank(...):0.236 ms
Interpretation:
- the dedicated helper makes two-hop GraphRAG clearly viable on the real-text Gutenberg path
- the latency win is still not universal; at higher dimensions it narrows toward parity with heap+btree
- but the locality signal remains stronger than latency alone suggests (
1264shared hits for the fused helper vs3155for the heap baseline on the384Dmedium run)
So the correct next inference is narrower than “we need a graph storage engine” and also narrower than “we need a broad graph query layer”:
a dedicated but still narrow two-hop helper is justified; anything broader should now be treated as product/API design, not as a prerequisite for making two-hop GraphRAG fast enough to matter.
Higher-dimension rerun
The same medium Gutenberg slice (64 books x 128 paragraphs/book) was then rerun at higher lexical-hash embedding dimensions to test whether the earlier result depended too heavily on the cheap 32D setting.
At 128D (query_count=64, runs=3):
- heap rerank baseline:
0.107 ms sorted_heap_expand_rerank(... relation=2):0.090 mssorted_heap_graph_rag_scan(... relation=2):0.097 mspgvector ANN -> heap expansion -> exact rerank:0.386 mszvec ANN -> heap expansion -> exact rerank:0.518 msQdrant ANN -> heap expansion -> exact rerank:1.732 ms
At 384D on the same slice:
- heap rerank baseline:
0.185 ms sorted_heap_expand_rerank(... relation=2):0.186 mssorted_heap_graph_rag_scan(... relation=2):0.203 mspgvector ANN -> heap expansion -> exact rerank:0.815 mszvec ANN -> heap expansion -> exact rerank:1.101 msQdrant ANN -> heap expansion -> exact rerank:2.275 ms
This changes the interpretation in one important way:
- the
sorted_heaphelper remains clearly best-aligned with the full GraphRAG workflow versus the external ANN paths - but the win over the pure heap rerank baseline is dimension-sensitive
- by
384D, exact rerank cost dominates enough that the fused helper is only at parity with heap+btree rather than clearly ahead
So the current evidence supports a narrower claim than “sorted_heap always wins GraphRAG”:
the fused
sorted_heaphelper is the best end-to-end path among the tested in-PG and external ANN competitors on this workflow shape, but its advantage over heap+btree narrows substantially as exact rerank dimension grows
One more tuning falsifier was useful here:
- dropping
ann_kfrom32to24on the384Dmedium slice does reduce latency - but it is not a free operating-point improvement
- a direct result-set comparison for
sorted_heap_graph_rag_scan(...)on the64-query probe set showed mismatches on62/64queries versusann_k=32
So the current faster-than-ann_k=32 settings should be treated as a quality/latency tradeoff, not as a no-regression default recommendation.
One important measurement caveat was also discovered and fixed during this work:
- direct filtered
ORDER BY embedding <=> $query LIMIT Kon a base table with asorted_hnswindex is not a valid GraphRAG baseline for current Phase 1 semantics - the automatic
sorted_hnswpath is now explicitly costed out when extra base-relation quals are present - GraphRAG rerank baselines must therefore materialize the expanded set first, then rerank it
This is enough to falsify the pessimistic branch:
the next useful GraphRAG step is not necessarily a new storage engine; a carefully scoped C primitive can already recover a substantial part of the lost latency
Recommended roadmap
Phase 0 — completed
- Build local prototype benchmark
- Falsify naive SQL assumptions
Phase 1 — current
sorted_heap_expand_ids() is implemented and regression-covered.
Phase 2 — current
sorted_heap_expand_rerank() is implemented and regression-covered.
Current success criterion that was met:
- beats the current
sorted_heapSQLseed_expand_in/seed_expand_rerank_inpatterns at medium scale
Current gap that remains:
- pure heap+btree expansion is still faster on this synthetic benchmark
Phase 3 — next
Add GraphRAG composition query:
- ANN seed in SQL via
sorted_hnsw - expansion via
sorted_heap_expand_ids() - rerank via
sorted_heap_expand_rerank()or SQL over materialized expansion
Phase 4 — current
sorted_heap_graph_rag_scan() is now implemented as the narrow one-call composition wrapper.
Phase 5 — current
sorted_heap_expand_twohop_rerank() is now implemented as the narrow fused two-hop helper.
Current success criterion that was met:
- beats the previous composed two-hop helper on the real-text Gutenberg graph
- beats heap+btree on the medium and larger
32Dtwo-hop slices
Current gap that remains:
- at
384D, the fused two-hop helper narrows to near-parity with heap+btree rather than keeping a clear lead
Phase 6 — next
Only if the current two-hop and one-call wrappers still leave meaningful headroom:
- consider a broader wrapper for:
- ANN seed IDs
- two-hop expansion
- rerank
- or tune candidate count / rerank workload rather than broadening the API
Cogniformerus-style multihop facts
The real missing falsifier was not another paragraph graph slice. It was a benchmark that matches the current cogniformerus multihop question shape:
- fact
1:person -> parent - fact
2:parent -> city - query:
Where does the parent of Person_i live?
That now exists in:
The benchmark builds a deterministic fact graph and measures:
- latency
hit@1hit@k
for the expected final city fact after two-hop expansion and rerank.
Important contract discovery
This benchmark immediately exposed a semantic limitation in the current convenience wrapper:
sorted_heap_graph_rag_scan()seeds expansion from ANNtarget_id- that is a good fit for the Gutenberg
paragraph -> next_paragraphgraph - it is not the right seed contract for the fact benchmark above
- the fact benchmark needs ANN seeds based on
entity_id, then:- hop 1 on relation
1 - hop 2 on relation
2
- hop 1 on relation
So the current one-call wrapper is still too specialized for this workload shape. The lower-level helper family is fine; the wrapper contract is the narrow part.
That gap is now closed by:
sorted_heap_graph_rag_twohop_scan(...)
This wrapper keeps the fact-shaped contract narrow:
- ANN seed on
entity_id - hop 1 relation filter
- hop 2 relation filter
- final rerank delegated to
sorted_heap_expand_twohop_rerank(...)
Early failure that mattered
At 32D, the fact benchmark initially produced very poor answer retrieval. That was a benchmark-quality failure, not a helper failure:
- the first draft seeded on
target_id, which was the wrong graph contract - after fixing that, the deterministic query embedding was still too weak at low dimension to make the question reliably retrievable
So the publishable multihop results start at 384D, where the question shape becomes stable enough that latency numbers mean something.
Tuned 384D result
On 5K multihop chains (10K rows total), 64 queries, 3 runs, shared_buffers=64MB, fresh backend, with:
ann_k=64sorted_hnsw.ef_search=64ef_construction=200
the current frontier is:
- heap composed two-hop SQL
0.515 mshit@1 = 71.9%hit@k = 85.9%
sorted_heapcomposed two-hop helper0.471 mshit@1 = 70.3%hit@k = 82.8%
sorted_heap_expand_twohop_rerank()0.442 mshit@1 = 70.3%hit@k = 82.8%
sorted_heap_graph_rag_twohop_scan()0.417 mshit@1 = 71.9%hit@k = 84.4%
- pgvector
1.397 mshit@1 = 70.3%hit@k = 87.5%
- zvec
1.076 mshit@1 = 76.6%hit@k = 96.9%
- Qdrant
2.921 mshit@1 = 76.6%hit@k = 96.9%
Interpretation:
- the fused two-hop helper is now the fastest PostgreSQL path on this fact-shaped workload
- the new fact-shaped one-call wrapper stays effectively at parity with the fused helper, so this time the convenience API does not erase the win
- it remains materially faster than pgvector on the same workflow
- it is not the quality leader at this operating point
- zvec and Qdrant still win on answer retrieval quality here, but at much higher latency
Seed frontier after the wrapper fix
The next honest question was not API shape but ANN seed quality. That is now measured directly by:
This harness keeps the corpus fixed per ef_construction and sweeps:
mann_ksorted_hnsw.ef_searchef_construction
without paying a full temp-cluster and schema rewrite for every single probe point.
On the same 5K chains / 10K rows / 384D / 64 queries / fresh-backend benchmark, the stable wrapper frontier is now:
ef_construction=64,ann_k=64,ef_search=640.386 mshit@1 = 70.3%hit@k = 82.8%
ef_construction=200,ann_k=64,ef_search=640.393 mshit@1 = 71.9%hit@k = 84.4%
ef_construction=400,ann_k=64,ef_search=640.421 mshit@1 = 70.3%hit@k = 85.9%
ef_construction=200,ann_k=64,ef_search=1280.651 mshit@1 = 73.4%hit@k = 95.3%
ef_construction=400,ann_k=64,ef_search=1280.663 mshit@1 = 75.0%hit@k = 95.3%
For a higher-quality but much slower seed tier:
ann_k=96,ef_search=64lands around2.2-2.4 mswithhit@k = 96.9%
That leads to a narrower, more honest recommendation:
- if latency is the hard constraint, keep the fast tier near
ef_construction=200,ann_k=64,ef_search=64 - if answer quality matters more, the best balanced point we measured is
ef_construction=200,ann_k=64,ef_search=128 ef_construction=400does improvehit@1slightly at the same95.3%hit@k, but it does not improvehit@kover200, so it should not be the default recommendation without a separate build-cost justification
That build-cost justification now exists too on this exact 10K x 384D multihop benchmark:
ef_construction=64:43.716 sto build both ANN indexesef_construction=200:80.046 sef_construction=400:91.352 s
So the current recommendation is:
- default to
ef_construction=200 - treat
ef_construction=400as a nichehit@1knob, not the new default
m frontier on the same multihop benchmark
The next useful falsifier was whether graph degree buys more than another ef_construction increase.
Keeping:
ef_construction=200ann_k=6464queries3runs- fresh backend
the m sweep came out as:
m=16,ef_search=640.405 mshit@1 = 71.9%hit@k = 87.5%
m=24,ef_search=640.466 mshit@1 = 75.0%hit@k = 93.8%
m=32,ef_search=640.491 mshit@1 = 78.1%hit@k = 93.8%
m=16,ef_search=1280.672 mshit@1 = 73.4%hit@k = 95.3%
m=24,ef_search=1280.738 mshit@1 = 75.0%hit@k = 96.9%
m=32,ef_search=1280.771 mshit@1 = 76.6%hit@k = 96.9%
The one-off build-cost probe for the same 10K x 384D graph was:
m=16,ef_construction=200:79.425 sm=24,ef_construction=200:86.562 sm=32,ef_construction=200:75.404 s
That last m=32 build number should be treated cautiously; it was a single one-off probe and is likely noisy enough that only the query-time frontier is trustworthy here.
The stable conclusion is still clear:
m=24is the best current quality-per-latency tradeoff we measuredm=32buys a little morehit@1, but no additionalhit@k- so for fact-shaped multihop GraphRAG, the best current balanced point is:
m=24ef_construction=200ann_k=64sorted_hnsw.ef_search=128
One more ann_k falsifier matters here too:
- increasing
ann_kabove64at thism=24 / ef_construction=200 / ef_search=128point did not help ann_k=80/96/128all increased latency and reducedhit@k- so
ann_k=64remains the current sweet spot, not just a legacy default
Full parity rerun at the balanced point
Re-running the full multihop parity benchmark on that exact setting:
m=24ef_construction=200ann_k=64sorted_hnsw.ef_search=12864queries3runs384D
produced:
- heap two-hop SQL
0.762 mshit@1 = 75.0%hit@k = 96.9%
sorted_heap_expand_twohop_rerank()0.726 mshit@1 = 75.0%hit@k = 96.9%
sorted_heap_graph_rag_twohop_scan()0.727 mshit@1 = 75.0%hit@k = 96.9%
- pgvector
1.244 mshit@1 = 70.3%hit@k = 85.9%
- zvec
0.927 mshit@1 = 76.6%hit@k = 96.9%
- Qdrant
2.417 mshit@1 = 76.6%hit@k = 96.9%
That is a materially stronger result than the earlier m=16 baseline:
- the fused
sorted_heappath now matcheszvecandQdrantonhit@k - it stays faster than both external paths
- it also beats pgvector on both latency and answer quality on this workload
zvecandQdrantstill keep a smallhit@1edge, so the answer-quality story is now abouthit@1, nothit@k
Full parity rerun at the higher-quality point
The next question was whether that remaining hit@1 gap could be closed without giving back the latency lead. Re-running the same full parity benchmark at:
m=32ef_construction=200ann_k=64sorted_hnsw.ef_search=128
produced:
- heap two-hop SQL
0.810 mshit@1 = 76.6%hit@k = 96.9%
sorted_heap_expand_twohop_rerank()0.774 mshit@1 = 76.6%hit@k = 96.9%
sorted_heap_graph_rag_twohop_scan()0.786 mshit@1 = 76.6%hit@k = 96.9%
- pgvector
1.220 mshit@1 = 70.3%hit@k = 84.4%
- zvec
0.874 mshit@1 = 76.6%hit@k = 96.9%
- Qdrant
2.487 mshit@1 = 76.6%hit@k = 96.9%
So the current picture is now more precise:
m=24is still the better quality-per-latency recommendationm=32is the point wheresorted_heapreaches full observed parity withzvecand Qdrant on bothhit@1andhit@k- even at that higher-quality point, the
sorted_heaphelper remains faster than both external paths - pgvector remains behind on both latency and answer quality on this workload
AWS ARM64 parity rerun (5K chains)
The next environment-variance adversary check was to rerun the same 5K-chain / 10K-row / 384D fact benchmark on an AWS ARM64 host (4 vCPU, 8 GiB RAM) using the repo-owned wrapper:
At the previously recommended local balanced point:
m=24ef_construction=200ann_k=64sorted_hnsw.ef_search=12864queries3runs- fresh backend
the AWS rerun produced:
- heap two-hop SQL
1.087 mshit@1 = 75.0%hit@k = 96.9%
sorted_heap_expand_twohop_rerank()0.947 mshit@1 = 76.6%hit@k = 98.4%
sorted_heap_graph_rag_twohop_scan()1.004 mshit@1 = 76.6%hit@k = 98.4%
- pgvector
1.296 mshit@1 = 70.3%hit@k = 85.9%
- zvec
1.646 mshit@1 = 76.6%hit@k = 96.9%
- Qdrant
3.396 mshit@1 = 76.6%hit@k = 96.9%
That is stronger than the local balanced point in one important way:
- on this AWS rerun,
sorted_heapdoes not just matchzvecand Qdrant onhit@k; it exceeds them (98.4%vs96.9%) while staying faster than both
But the second half of the adversary check matters too. Re-running the same AWS benchmark at the local higher-quality point:
m=32ef_construction=200ann_k=64sorted_hnsw.ef_search=128
produced:
sorted_heap_graph_rag_twohop_scan()1.066 mshit@1 = 76.6%hit@k = 96.9%
So the local m=32 parity story does not carry over unchanged to this AWS ARM64 environment. The portable conclusion is therefore narrower:
m=24 / ef_construction=200 / ann_k=64 / ef_search=128is the current best verified cross-environment point- local and AWS frontiers are directionally consistent, but not numerically identical
- this is exactly why the AWS rerun is worth keeping as a separate falsifier, not merging blindly into the local tuning story
Larger local scale check (10K chains)
The next adversary check was whether the 5K-chain tuning carried forward to a larger local fact graph without retuning.
On 10K chains (20K rows total), 64 queries, 384D, fresh backend:
m=24,ef_construction=200,ann_k=64,ef_search=128sorted_heap_graph_rag_twohop_scan()->0.885 mshit@1 = 71.9%hit@k = 92.2%
m=32,ef_construction=200,ann_k=64,ef_search=128sorted_heap_graph_rag_twohop_scan()->0.972 mshit@1 = 73.4%hit@k = 93.8%
So the 5K-chain operating point does not generalize unchanged.
The next narrow falsifier was whether this larger-graph drop was just a search beam issue. Sweeping ef_search upward at m=32 gave:
ef_search=1921.310 mshit@1 = 76.6%hit@k = 95.3%
ef_search=2561.734 mshit@1 = 78.1%hit@k = 95.3%
That is a useful but incomplete recovery:
- higher
ef_searchdoes recover part of the quality loss - it does not recover the earlier
96.9% hit@klocal point - so the larger-graph gap is not purely a beam-width problem
The next falsifier after that was stronger graph construction. On the same 10K-chain graph, keeping m=32, ann_k=64, and comparing ef_construction=200 vs 400 gave:
- at
ef_search=128ef_construction=200->0.976 ms,hit@1 = 75.0%,hit@k = 93.8%ef_construction=400->1.094 ms,hit@1 = 75.0%,hit@k = 93.8%
- at
ef_search=192ef_construction=200->1.357 ms,hit@1 = 76.6%,hit@k = 95.3%ef_construction=400->1.381 ms,hit@1 = 76.6%,hit@k = 95.3%
So this larger-graph gap is not fixed by a simple ef_construction=400 bump either.
The current best explanation is therefore narrower:
- the verified
5K-chain local frontier is real - the same operating points do not carry forward unchanged to
10Kchains - and the obvious local rescue knobs (
ef_search,ef_construction) only recover part of the drop
That is enough to stop local knob-turning for this pass. The next useful step would be a different class of experiment, not more of the same sweep.
The next adversary check after that was whether this larger-graph caveat was just a local-machine artifact. Re-running the 10K-chain benchmark on the same AWS ARM64 host (4 vCPU, 8 GiB RAM) showed that it is not.
At the same balanced portable point:
m=24ef_construction=200ann_k=64sorted_hnsw.ef_search=128
the AWS rerun produced:
- heap two-hop SQL
1.389 mshit@1 = 71.9%hit@k = 92.2%
sorted_heap_expand_twohop_rerank()1.190 mshit@1 = 71.9%hit@k = 92.2%
sorted_heap_graph_rag_twohop_scan()1.248 mshit@1 = 71.9%hit@k = 92.2%
That essentially matches the larger local result. So the 10K-chain drop is cross-environment robust, not just a local Apple/M-series artifact.
The one meaningful local rescue point transferred cleanly to AWS too. Re-running the 10K-chain benchmark at:
m=32ef_construction=200ann_k=64sorted_hnsw.ef_search=192
produced:
- heap two-hop SQL
1.896 mshit@1 = 76.6%hit@k = 95.3%
sorted_heap_expand_twohop_rerank()1.617 mshit@1 = 76.6%hit@k = 95.3%
sorted_heap_graph_rag_twohop_scan()1.687 mshit@1 = 76.6%hit@k = 95.3%
So the larger-scale picture is now materially stronger:
- the
10K-chain quality drop is cross-environment robust - the best current larger-graph recovery point is also cross-environment robust:
m=32 / ef_search=192 - but even that recovery point does not restore the earlier
5K-chain98.4% hit@kAWS frontier - so the remaining gap is unlikely to be solved by another trivial
ef_searchormtweak alone
Exact-seed upper-bound diagnostic
The next root-cause check was to remove ANN approximation from the seed stage entirely. The multihop harness now supports an --exact-seed-diagnostics mode, which replaces ANN seed retrieval with exact brute-force top-K seeds on facts_heap, then reuses the same graph expansion/rerank path.
This matters because it separates two very different explanations:
- “the remaining gap is caused by approximate ANN seeds”
- “the remaining gap is already in the benchmark/query/task shape”
On the 5K-chain balanced local point:
m=24ef_construction=200ann_k=64sorted_hnsw.ef_search=128
the exact-seed diagnostic did not improve quality:
- ANN-seeded
sorted_heap_expand_twohop_rerank()0.702 mshit@1 = 75.0%hit@k = 96.9%
- exact-seeded
sorted_heap_expand_twohop_rerank()0.811 mshit@1 = 75.0%hit@k = 96.9%
And the seed-stage diagnostic showed no hidden ANN loss there either:
- ANN seeds
seed_person_pct = 98.4%expanded_city_pct = 98.4%avg_person_rank = 1.00city_rank_p95 = 6city_rank_max = 17
- exact seeds
seed_person_pct = 98.4%expanded_city_pct = 98.4%avg_person_rank = 1.00city_rank_p95 = 6city_rank_max = 17
So even at 5K, the final 96.9% hit@k is already below seed coverage. But the rerank distribution is still concentrated: the correct city stays within the top 6 for 95% of reachable queries, and the miss comes from a small number of sharper outliers.
On the 10K-chain balanced local point:
m=24ef_construction=200ann_k=64sorted_hnsw.ef_search=128
the exact-seed diagnostic again did not improve quality:
- ANN-seeded
sorted_heap_expand_twohop_rerank()0.839 mshit@1 = 71.9%hit@k = 92.2%
- exact-seeded
sorted_heap_expand_twohop_rerank()0.947 mshit@1 = 71.9%hit@k = 92.2%
The seed-stage diagnostic was even more revealing on 10K:
- ANN seeds
seed_person_pct = 96.9%expanded_city_pct = 96.9%avg_person_rank = 1.00city_rank_p95 = 3city_rank_max = 20
- exact seeds
seed_person_pct = 96.9%expanded_city_pct = 96.9%avg_person_rank = 1.00city_rank_p95 = 3city_rank_max = 19
So the larger-graph gap is not coming from missing the correct seed fact. At 10K, seed coverage stays at 96.9%, but final hit@k drops to 92.2%. And it is not a broad rerank collapse either: for 95% of reachable queries the correct city still ranks in the top 3, but a few outliers fall as far as rank 19-20, which is enough to miss top_k = 10.
This is a strong falsifier:
- on this synthetic fact benchmark, the current
5Kand10Kfrontiers are not ANN-approximation limited at the tested operating points - ANN and exact seeds have identical seed coverage on both scales
- the remaining gap is mostly an outlier-ranking problem, not a broad seed or rerank failure
- exact seeds cost extra latency but do not recover answer quality
- so the next meaningful gain is unlikely to come from more seed-ANN tuning alone
The remaining gap now looks more like a property of the task construction, query embedding, or graph benchmark semantics than of sorted_hnsw approximation itself. More specifically: the dominant remaining loss now looks downstream of seed retrieval, not inside it, and it is concentrated in a small set of bad cases rather than a general degradation across the query set.
So the honest story on this fact benchmark is a latency/quality frontier:
sorted_heap_expand_twohop_rerank()leads on latency
Path-aware rerank diagnostic
The next falsifier was to keep the same ANN seeds and the same two-hop expansion, but change only the final scorer. The current multihop helper reranks on the hop-2 city fact embedding alone. A path-aware SQL baseline was added to the harness that scores each candidate as:
path_distance = (hop1_embedding <=> query) + (hop2_embedding <=> query)
That simple change materially improved answer quality on the same balanced points:
5Kchains,m=24,ef_construction=200,ann_k=64,sorted_hnsw.ef_search=128- city-only
sorted_heap_graph_rag_twohop_scan()0.762 mshit@1 = 75.0%hit@k = 96.9%
- path-aware SQL rerank on
facts_sh0.957 mshit@1 = 98.4%hit@k = 98.4%
- city-only
10Kchains, same knobs- city-only
sorted_heap_graph_rag_twohop_scan()0.937 mshit@1 = 71.9%hit@k = 92.2%
- path-aware SQL rerank on
facts_sh1.179 mshit@1 = 95.3%hit@k = 96.9%
- city-only
This is the strongest current architectural signal on the fact-shaped benchmark:
- the remaining quality gap is not well explained by seed recall
- it is also not well explained by broad rerank collapse
- a simple path-aware scorer recovers most of the lost quality with only a modest latency increase
That branch is now implemented locally too:
sorted_heap_expand_twohop_path_rerank(...)sorted_heap_graph_rag_twohop_path_scan(...)
And the fused helper beats the SQL path-aware baseline on the same balanced points:
5Kchains- SQL path-aware baseline:
0.847 ms,hit@1 = 98.4%,hit@k = 98.4% - fused helper:
0.726 ms,hit@1 = 98.4%,hit@k = 98.4% - one-call wrapper:
0.739 ms,hit@1 = 98.4%,hit@k = 98.4%
- SQL path-aware baseline:
10Kchains- SQL path-aware baseline:
0.942 ms,hit@1 = 95.3%,hit@k = 96.9% - fused helper:
0.823 ms,hit@1 = 95.3%,hit@k = 96.9% - one-call wrapper:
0.834 ms,hit@1 = 95.3%,hit@k = 96.9%
- SQL path-aware baseline:
So for multihop fact retrieval, the next serious question is no longer whether path-aware rerank helps. It does. The next question is whether this new helper/wrapper transfers cleanly to AWS and then to a real cogniformerus-like corpus.
That AWS transfer is now verified too. On AWS ARM64 (4 vCPU, 8 GiB RAM), at the same balanced m=24 / ef_construction=200 / ann_k=64 / ef_search=128 point:
5Kchains- heap two-hop SQL:
1.088 ms,hit@1 = 75.0%,hit@k = 96.9% - city-only wrapper:
1.012 ms,hit@1 = 75.0%,hit@k = 96.9% - SQL path-aware baseline:
1.204 ms,hit@1 = 98.4%,hit@k = 98.4% - fused helper:
0.955 ms,hit@1 = 98.4%,hit@k = 98.4% - one-call path-aware wrapper:
1.018 ms,hit@1 = 98.4%,hit@k = 98.4% - pgvector + heap expansion, same path-aware scorer:
1.422 ms,hit@1 = 85.9%,hit@k = 85.9% - zvec + heap expansion, same path-aware scorer:
1.720 ms,hit@1 = 100.0%,hit@k = 100.0% - Qdrant + heap expansion, same path-aware scorer:
3.435 ms,hit@1 = 100.0%,hit@k = 100.0%
- heap two-hop SQL:
10Kchains, same knobs- heap two-hop SQL:
1.319 ms,hit@1 = 71.9%,hit@k = 92.2% - city-only wrapper:
1.197 ms,hit@1 = 73.4%,hit@k = 93.8% - SQL path-aware baseline:
1.436 ms,hit@1 = 96.9%,hit@k = 98.4% - fused helper:
1.185 ms,hit@1 = 96.9%,hit@k = 98.4% - one-call path-aware wrapper:
1.212 ms,hit@1 = 96.9%,hit@k = 98.4%
- heap two-hop SQL:
So the answer to the transfer question is now yes: the path-aware helper and wrapper survive the AWS move cleanly, and the old larger-scale caveat narrows substantially once the rerank contract is fixed.
This also closes the earlier apples-to-apples gap. Once all engines are scored under the same path-aware contract:
sorted_heapis the latency leaderzvecand Qdrant hold the strongest observed answer qualitypgvectorremains behind on both latency and quality at this operating point
One AWS all-engines rerun briefly dropped the sorted_heap path-aware rows to 96.9% / 96.9%, but an immediate sorted_heap-only control and a second full rerun both returned 98.4% / 98.4%. So the portable parity story now has one verified outlier plus two confirming reruns. That was enough to justify the benchmark note, and it directly motivated the repeated-build protocol recorded below.
Repeated-build local variance
It wraps scripts/bench_graph_rag_multihop.py so each repeat gets a fresh temp cluster and a fresh HNSW build, then reports median / min / max for selected rows.
On the balanced local 5K point (m=24 / ef_construction=200 / ann_k=64 / ef_search=128), three independent rebuilds produced:
sorted_heap_expand_twohop_path_rerank()p50_ms: median0.798, range0.771-0.819hit@1 = 98.4%,hit@k = 98.4%on all three builds
sorted_heap_graph_rag_twohop_path_scan()p50_ms: median0.796, range0.778-0.804hit@1 = 98.4%,hit@k = 98.4%on all three builds
pgvectorpath-aware parity rowp50_ms: median1.405, range1.318-1.456hit@1/hit@k:85.9-89.1%
zvecpath-aware parity rowp50_ms: median1.076, range1.053-1.087hit@1 = 100.0%,hit@k = 100.0%on all three builds
Qdrantpath-aware parity rowp50_ms: median2.799, range2.792-2.805hit@1 = 100.0%,hit@k = 100.0%on all three builds
So the balanced local path-aware sorted_heap point is not just a lucky single build. The answer quality stayed fixed across rebuilds, and the latency spread was narrow. The remaining variance story now looks more like:
- local balanced
sorted_heap: stable across rebuilds - AWS balanced
sorted_heap: also stable across repeated builds on the5Kpoint, with one earlier outlier now downgraded to an anomaly pgvector: measurable quality drift across local rebuildszvec/Qdrant: stable on this deterministic local fact graph
The AWS repeated-build protocol on the balanced 5K point produced:
sorted_heap_expand_twohop_path_rerank()p50_ms: median0.962, range0.956-0.965hit@1 = 98.4%,hit@k = 98.4%on all three builds
sorted_heap_graph_rag_twohop_path_scan()p50_ms: median1.025, range1.018-1.043hit@1 = 98.4%,hit@k = 98.4%on all three builds
pgvectorpath-aware parity rowp50_ms: median1.434, range1.370-1.493hit@1/hit@k:84.4-89.1%
zvecpath-aware parity rowp50_ms: median1.711, range1.703-1.768hit@1 = 100.0%,hit@k = 100.0%on all three builds
Qdrantpath-aware parity rowp50_ms: median3.355, range3.302-3.465hit@1 = 100.0%,hit@k = 100.0%on all three builds
So the current confidence picture is stronger than before:
- local balanced
5K: repeated-build stable - AWS balanced
5K: repeated-build stable - larger
10KAWS path-aware rows: repeated-build stable too, but at a lower quality frontier than5K
The AWS repeated-build protocol on the larger 10K point produced:
sorted_heap_expand_twohop_path_rerank()p50_ms: median1.177, range1.148-1.191hit@1 = 95.3%,hit@k = 96.9%on all three builds
sorted_heap_graph_rag_twohop_path_scan()p50_ms: median1.236, range1.211-1.240hit@1 = 95.3%,hit@k = 96.9%on all three builds
pgvectorpath-aware parity rowp50_ms: median1.667, range1.665-1.676hit@1/hit@k:76.6-82.8%
zvecpath-aware parity rowp50_ms: median2.788, range2.762-2.789hit@1 = 98.4%,hit@k = 100.0%on all three builds
Qdrantpath-aware parity rowp50_ms: median3.818, range3.788-3.846hit@1 = 98.4%,hit@k = 100.0%on all three builds
This sharpens the conclusion again:
- the
10KAWS point is no longer a variance question - it is a real scale frontier
sorted_heapremains the latency leader therezvecand Qdrant still lead on answer quality
This also falsifies one tempting but wrong simplification:
once the helper is fast, the remaining GraphRAG problem is solved
Not quite. On fact-shaped multihop queries, seed ANN quality and graph build quality still matter enough that ann_k, ef_search, and graph build quality remain first-class tuning knobs. But the old hop-2-only rerank contract was a separate, larger problem, and the new path-aware helper fixes most of it on the current local benchmark.
Current verdict
sorted_heap already has a plausible GraphRAG foundation, and the new helper proves that a narrow C primitive can materially improve the GraphRAG path.
What is now true:
- SQL-only GraphRAG composition was not enough
sorted_heap_expand_ids()is enough to recover a large part of that gapsorted_heap_expand_rerank()recovers most of the rerank overhead on the currentsorted_heappathsorted_heap_graph_rag_scan()makes the composition available as a single SQL call without giving back much latencysorted_heap_expand_twohop_rerank()turns the earlier two-hop composition evidence into a real latency win on the real-text Gutenberg slices we tested- on the cogniformerus-style
person -> parent -> citybenchmark, the fused two-hop helper is the fastest PostgreSQL path we tested sorted_heap_graph_rag_twohop_scan()closes the current fact-shaped wrapper gap without materially giving back latencysorted_heap_expand_twohop_path_rerank()upgrades the fact-shaped rerank contract to use hop-1 and hop-2 evidence togethersorted_heap_graph_rag_twohop_path_scan()makes that path-aware contract available as a single-call primitive- the path-aware helper and wrapper transfer cleanly from local to AWS ARM64 on the same balanced
m=24 / ef_construction=200 / ann_k=64 / ef_search=128point - the narrow-helper direction is a justified building block
- the current helper model already composes into a competitive two-hop real-text GraphRAG path on Gutenberg without requiring a new graph API
- on the real-text GraphRAG shape,
pgvectorparity is already materially worse end-to-end than the fusedsorted_heaphelper path - on the fact-shaped AWS path-aware benchmark,
sorted_heapis now the fastest verified end-to-end path, whilezvecand Qdrant remain the answer quality leaders zvecis stable on the medium slice but currently not robust on the larger real-text slice atann_k=32Qdrantis robust on both real-text slices but materially slower than the fusedsorted_heaphelper on the same workflow
What is not yet true:
sorted_heapis not yet clearly better than heap+btree on pure expansion latency for this synthetic workload- even the relation-filtered GraphRAG path still trails heap+btree slightly on this synthetic benchmark
- two-hop helper composition is not yet a universal latency win; at higher rerank dimensions it narrows to parity with heap+btree rather than staying clearly ahead
- the current benchmark suite is still deterministic/synthetic rather than a real
cogniformeruscorpus, so the remaining generalization gap is about workload realism more than about build variance - transfer to a larger real
cogniformeruscorpus is still unverified; the current fact-shaped benchmark is deterministic and synthetic even though it matches the intended multihop query shape
Actual Butler gate seed-corpus smoke
The next honest step after the synthetic-chain work was to stop guessing and run the path-aware GraphRAG helpers on the actual tiny multihop corpus that cogniformerus already ships in its Butler gate smoke:
- source:
cogniformerus/bin/butler_small_model_eval.cr - repo-owned fixture:
scripts/fixtures/graph_rag_butler_gate_seed.json - harness:
scripts/bench_graph_rag_butler_gate.py
This fixture is intentionally tiny:
7graph facts loaded intofacts_heap/facts_sh2positive multihop queriesProject Atlas -> Orion -> HelsinkiRelease 13 -> Aurora -> April
So it is not a publishable latency frontier. Its job is narrower:
- verify that the current path-aware helper and wrapper work on the real Butler gate fact texts and prompts
- replace the previous blanket statement “real cogniformerus still unverified” with a tighter one: the actual gate seed corpus is covered, but larger real corpora are not
The first local smoke run on this real gate seed corpus used:
384Dann_k=4top_k=4m=24ef_construction=200sorted_hnsw.ef_search=645timing runs on a fresh temp cluster
Result:
- heap path-aware SQL baseline:
p50 0.027 mshit@1/hit@k = 100/100
facts_shpath-aware SQL baseline:p50 0.026 mshit@1/hit@k = 100/100
sorted_heap_expand_twohop_path_rerank():p50 0.017 mshit@1/hit@k = 100/100
sorted_heap_graph_rag_twohop_path_scan():p50 0.045 mshit@1/hit@k = 100/100
This does not prove scale behavior. It proves something narrower and still useful: the current path-aware GraphRAG helper/wrapper contract works on the actual Butler gate seed facts and prompts, not only on the synthetic person -> parent -> city generator.
One adversary control also mattered here: this was not only a pass at a near-full seed budget. Re-running the same smoke at ann_k=2, top_k=2 still kept both multihop queries at 100/100.
The correct next step is therefore:
tune the current narrow helper family before considering a bigger graph-specific subsystem
That remains the smallest change that can still convert the observed block-pruning advantage into an end-to-end query win.
Real code-corpus prototype
The next honest check after the Butler gate fact smoke was not another synthetic graph. It was the actual cogniformerus code corpus plus the real cross-file question bank already used by Butler’s own code benchmark.
- source tree:
cogniformerus/src/cogniformerus - question source:
cogniformerus/bin/butler_code_test.cr - harness:
scripts/bench_graph_rag_code_corpus.py
This harness builds a narrow code-GraphRAG shape:
- each source file is one entity
- each chunk in that file becomes one fact row
entity_id = file_idrelation_id = HAS_CHUNKtarget_id = chunk_id
- query quality is scored against the real CrossFile benchmark keywords from
butler_code_test.cr
This is not a full code graph. It is a bounded falsifier for a simpler claim:
if GraphRAG-style seeded expansion is already useful on a real corpus, it should show up even on the natural
file -> chunkexpansion shape
The first stable local point used:
40files747chunk rows6real CrossFile questions384Dann_k=16top_k=4m=24ef_construction=200sorted_hnsw.ef_search=64shared_buffers=64MB- fresh backend
3timing runs
Result:
- direct ANN over raw chunks:
- heap:
p50 0.740 ms sorted_heap:p50 0.712 ms- keyword coverage:
63.3% - full-keyword hits:
33.3%
- heap:
- file-seeded SQL expansion:
- heap:
p50 0.516 ms sorted_heap:p50 0.468 ms- same
63.3%keyword coverage - same
33.3%full-keyword hits
- heap:
sorted_heap_expand_rerank()helper:p50 0.665 ms- same
63.3%keyword coverage - same
33.3%full-keyword hits
The important conclusion is narrow but real:
- the real code corpus branch is now reproducible inside this repository
- seeded expansion by file preserves answer-support quality on the real CrossFile question set
- on this code corpus, the current gain is latency, not answer quality
- the helper is not yet the latency leader on this tiny real corpus; the simple SQL expansion shape still wins locally
This means the next code-corpus GraphRAG step is not “invent a bigger graph API”. It is either:
- a richer real code-graph relation hypothesis than plain
file -> chunk, or - a lower-overhead helper path for this very simple expansion contract
Real require-graph falsifier
The obvious next hypothesis was that plain file -> chunk was too weak, and that the real local code graph should help once actual require edges were present.
That hypothesis is now tested in the same harness:
53localrequireedges derived from the realcogniformerussource tree- relation
REQUIRES_FILE - two new query shapes:
seed_require_twohop_*seed_file_plus_require_in
Stable local result on the same 40-file / 800-row / 6-question point, 3 runs:
- plain file-seeded expansion:
sorted_heap:0.471 ms- keyword coverage:
63.3% - full hits:
33.3%
- file plus required files:
sorted_heap:0.605 ms- same
63.3%keyword coverage - same
33.3%full hits
- dependency-only two-hop:
sorted_heap:0.391 ms- keyword coverage:
20.0% - full hits:
0.0%
So the richer real relation hypothesis is currently refuted on this code corpus:
- adding dependency files does not improve answer-support quality
- dependency-only traversal is actively worse because it drops own-file context
- unioning own files with required files only adds cost, not quality
This is a useful stopping point. The next likely win for real code-GraphRAG is not “just add more code edges”. It is a different retrieval contract or a lower-overhead helper path on the already-good file-seeded shape.
File-summary seed falsifier
The next retrieval-contract hypothesis was also tested locally on the same real code corpus:
- add one synthetic-but-data-derived summary row per file
- seed on those summary rows
- then expand back to the file’s chunk rows
The goal was to test whether the missing factor was simply that chunk-level ANN was a poor way to choose files.
That also failed to improve answer-support quality.
Stable smoke result on the same 40-file / 840-row / 6-question point:
- summary-seeded expansion:
- heap:
0.587 ms sorted_heap:0.564 ms- keyword coverage:
63.3% - full hits:
33.3%
- heap:
So the current real code-corpus plateau is now bounded more tightly:
- plain file-seeded expansion: same quality, lower latency
- file summaries: same quality, higher latency
- require edges: no quality gain
- require-only traversal: quality regression
That strongly suggests the next code-corpus GraphRAG branch should not be “more local graph structure” or “better file seeds” in the same lexical setup. The remaining frontier is more likely one of:
- a different quality metric / question contract,
- better embeddings,
- or a lower-overhead execution path on the already-best file-seeded shape.
Oracle-seed and oracle-rerank diagnostic
The next adversary question was sharper:
is the plateau really about bad file seeds, or is it already downstream in the rerank / evaluation contract?
The harness now includes two explicit oracle diagnostics on the same real code corpus:
- oracle file seeds
- choose seed files by benchmark-keyword overlap against the full file text
- this is not a deployable retrieval contract; it is a diagnostic ceiling
- prompt-derived lexical rerank
- keep the same ANN-derived file seeds
- rerank by lexical overlap with terms extracted from the actual user prompt
- this is deployable in principle, but much weaker than the oracle signal
- oracle keyword rerank
- keep the same ANN-derived file seeds
- rerank the expanded chunk rows by direct overlap with the benchmark’s gold CrossFile keywords before falling back to embedding distance
Stable local result, 3 runs, same 40-file / 840-row / 6-question point:
- plain file-seeded expansion:
sorted_heap:0.443 ms- keyword coverage:
63.3% - full hits:
33.3%
- oracle file seeds:
sorted_heap:0.416 ms- same
63.3%keyword coverage - same
33.3%full hits
- prompt-derived lexical rerank:
sorted_heap:3.005 ms- same
63.3%keyword coverage - worse
16.7%full hits
- oracle keyword rerank:
- heap:
2.905 ms sorted_heap:2.944 ms- keyword coverage:
90.0% - full hits:
66.7%
- heap:
This is a useful but narrow falsifier:
- the plateau is not explained by weak file seeds alone
- richer local graph structure also did not explain it
- a simple prompt-term rerank at
top_k=4also did not explain it - but once the rerank contract is allowed to use the benchmark’s own gold keywords, quality jumps sharply
That does not justify a product claim, because the oracle rerank is using the same keyword signal that the benchmark later scores. It does justify a more targeted next hypothesis:
the remaining quality frontier on the real code corpus is more likely in the query/rerank contract or embedding space than in local graph topology or seed selection
Result-budget and packing diagnostic
The broad “cheap lexical hybrid does not help” claim turned out to be too strong once the same real code-corpus harness was rerun at larger result budgets.
Bounded local sweep, same 40-file / 840-row / 6-question corpus, ann_k=16, 3 runs:
- plain file-seeded
sorted_heapexpansion:top_k=4:0.402 ms,63.3%keyword coverage,33.3%full hitstop_k=8:0.460 ms,68.1%keyword coverage, same33.3%full hitstop_k=16:0.469 ms,84.3%keyword coverage, same33.3%full hitstop_k=32:0.449 ms,94.3%keyword coverage,66.7%full hits
- prompt-derived lexical rerank:
top_k=4:3.005 ms,63.3%,16.7%top_k=8:3.176 ms,86.7%,50.0%top_k=12:3.149 ms,90.0%,66.7%top_k=32:3.147 ms,96.7%,83.3%
So the real code-corpus plateau is not just a seed-quality problem. It is also partly a result-budget / packing problem:
- with more rows, even the plain file-seeded path recovers much more keyword coverage
- prompt-derived lexical rerank starts to help only once the row budget is not extremely tight
That makes the next bounded hypothesis more specific:
the remaining small-
top_kgap is likely about how evidence is packed into a tiny chunk budget, not about choosing better files
One more diagnostic supports that narrower claim. On the original top_k=4 point, a diversity-aware prompt-term rerank was also tested:
sorted_heapprompt-diverse rerank:3.229 ms76.7%keyword coverage- still only
33.3%full hits
That is a partial gain in coverage, but still not the qualitative jump needed to make the current small-budget contract compelling.
Code-aware embedding diagnostic
The next bounded hypothesis was exactly what the code corpus suggests:
maybe the remaining gap is not just about rerank logic, but about the fact that the current harness still uses a Gutenberg-style lexical tokenizer that does not understand
CamelCaseor_snake_caseidentifiers well
The harness now supports two embedding modes:
generic- existing lexical hash over generic text tokens
code_aware- keeps the full code token, but also splits identifiers on
_andCamelCasebefore hashing
- keeps the full code token, but also splits identifiers on
Stable local comparison on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:
- plain file-seeded
sorted_heapexpansion:generic:0.450 ms,63.3%keyword coverage,33.3%full hitscode_aware:0.427 ms,61.4%keyword coverage,16.7%full hits
- prompt-diverse rerank:
generic:3.178 ms,76.7%,33.3%code_aware:3.351 ms,76.7%,50.0%
- oracle keyword rerank:
generic:2.672 ms,90.0%,66.7%code_aware:2.435 ms,96.7%,83.3%
This is another mixed but useful falsifier:
- code-aware tokenization is not a free win by itself
- plain ANN + file expansion actually got slightly worse
- but once combined with a diversity-aware rerank, the same code-aware mode did improve the small-budget
full_pct
So the current code-corpus frontier is now even narrower:
the next likely win is not “better seeds” or “more edges”, but a tighter coupling between code-aware embeddings and a smarter small-budget rerank / packing contract
Summary-output packing win
The next bounded hypothesis was the most direct one implied by the previous diagnostics:
if the real bottleneck is small-budget packing, then maybe raw chunks are simply the wrong final output unit for this code benchmark
The harness already materializes one summary row per file. The new test keeps the same ANN-derived file seeds, but returns file summaries as the final output rows instead of raw chunks.
Stable local result on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:
- generic embedding mode:
- chunk output (
seed_file_expand_in,sorted_heap):0.418 ms63.3%keyword coverage33.3%full hits
- summary output (
seed_file_summary_output_in,sorted_heap):0.200 ms71.0%keyword coverage33.3%full hits
- prompt summary rerank (
prompt_summary_rerank_in,sorted_heap):0.318 ms73.3%keyword coverage50.0%full hits
- chunk output (
- code-aware embedding mode:
- chunk output:
0.418 ms61.4%16.7%
- summary output:
0.207 ms77.6%33.3%
- prompt summary rerank:
0.426 ms77.6%33.3%
- chunk output:
This is the first clean small-budget win on the real code corpus:
- summary rows are a better packing unit than raw chunks at
top_k=4 - they improve coverage while also reducing latency
- in the generic mode, prompt-aware reranking over summaries also improves
full_pct
So the current strongest product-facing hypothesis is no longer “better seeds” or “more graph edges”. It is:
for real code GraphRAG, file summaries are a stronger final output unit than raw chunks when the answer budget is tiny
Summary rows as seed unit
The next narrow question was whether summaries are only a better output unit, or also a better seed unit.
That was tested by forcing the ANN seed step to rank only REL_FILE_SUMMARY rows and then keeping the final result set on summaries as well.
Stable local result on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:
- generic embedding mode:
- summary output from mixed ANN seeds (
seed_file_summary_output_in,sorted_heap):0.199 ms71.0%keyword coverage33.3%full hits
- summary output from summary-only seeds (
summary_seed_summary_output_in,sorted_heap):0.116 ms77.6%keyword coverage33.3%full hits
- prompt summary rerank from mixed seeds:
0.329 ms73.3%50.0%
- prompt summary rerank from summary-only seeds:
0.541 ms74.3%33.3%
- summary output from mixed ANN seeds (
- code-aware embedding mode:
- mixed-seed summary output:
0.193 ms77.6%33.3%
- summary-only seed summary output:
0.112 ms64.3%33.3%
- mixed-seed summary output:
So the current tiny-budget frontier is now split into two clear points:
- fastest coverage point on this corpus:
- generic embedding mode
- summary-only seeds
- summary output
- best full-hit point on this corpus:
- generic embedding mode
- mixed ANN seeds
- prompt-aware summary rerank
And one more falsifier is now clear:
summary rows are not universally a better seed unit; the benefit depends on the embedding mode and the final scoring contract
Summary-plus-chunk hybrid output
The next bounded question was whether the best tiny-budget contract should stay purely on summaries, or whether a hybrid output can do better:
use summaries to choose the right files, but also emit one best chunk from each selected file so the final answer set contains both compressed context and one concrete code span
That was tested in two variants:
- mixed ANN seeds -> summary ranking -> one best chunk per selected file
- summary-only seeds -> summary ranking -> one best chunk per selected file
Stable local result on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:
- generic embedding mode:
- best prior full-hit point: mixed-seed prompt summary rerank
0.363 ms73.3%50.0%
- mixed-seed summary+chunk hybrid:
1.481 ms84.3%33.3%
- summary-seeded summary+chunk hybrid:
1.627 ms78.1%50.0%
- best prior full-hit point: mixed-seed prompt summary rerank
- code-aware embedding mode:
- best prior summary-only point:
- prompt summary rerank
0.372 ms77.6%33.3%
- mixed-seed summary+chunk hybrid:
1.616 ms84.3%50.0%
- summary-seeded summary+chunk hybrid:
1.688 ms77.6%33.3%
- best prior summary-only point:
So this branch narrows the frontier again:
- hybrid output is not a universal improvement
- for the generic mode, pure summary rerank remains the better tiny-budget full-hit point
- for the code-aware mode, mixed-seed summary+chunk hybrid is the first path that reaches
50.0%full hits attop_k=4
That means the current strongest small-budget choices are now split:
- generic mode:
- summaries-only remain the better contract
- code-aware mode:
- hybrid summary+chunk output is now the better contract
Fixed-ratio hybrid packing
The previous hybrid branch still left one obvious ambiguity:
was the hybrid result about having both summaries and chunks at all, or just about how many summary slots the tiny
top_k=4budget reserved?
That was tested with two fixed-ratio mixed-seed hybrids:
- summary-light:
1summary +3chunk slots - summary-heavy:
3summary slots +1chunk slot
Stable local result on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:
- generic embedding mode:
- prior best full-hit point:
- prompt summary rerank
0.337 ms73.3%50.0%
- prior balanced hybrid:
1.490 ms84.3%33.3%
- summary-light hybrid:
1.753 ms80.0%33.3%
- summary-heavy hybrid:
1.057 ms86.7%50.0%
- prior best full-hit point:
- code-aware embedding mode:
- prior best point:
- balanced hybrid
1.566 ms84.3%50.0%
- summary-light hybrid:
2.246 ms68.1%33.3%
- summary-heavy hybrid:
0.879 ms84.3%50.0%
- prior best point:
This resolves the remaining hybrid ambiguity:
- the hybrid win is not about chunks in general
- it is specifically about reserving a small number of chunk slots while keeping the budget summary-heavy
So the refined tiny-budget frontier is now:
- generic mode:
- best latency/full-hit tradeoff: pure prompt summary rerank
- best coverage at the same full-hit level: summary-heavy hybrid
- code-aware mode:
- summary-heavy hybrid is now the strongest point
Summary-heavy hybrid with summary-only seeds
The remaining seed question after the fixed-ratio result was very narrow:
if the winning hybrid is already summary-heavy, should its seed unit also be switched fully to summaries?
That was tested directly against the current summary-heavy mixed-seed hybrid.
Stable local result on the same real 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:
- generic embedding mode:
- prompt summary rerank:
0.395 ms73.3%50.0%
- mixed-seed summary-heavy hybrid:
1.062 ms86.7%50.0%
- summary-seeded summary-heavy hybrid:
1.175 ms87.6%50.0%
- prompt summary rerank:
- code-aware embedding mode:
- prompt summary rerank:
0.390 ms77.6%33.3%
- mixed-seed summary-heavy hybrid:
0.965 ms84.3%50.0%
- summary-seeded summary-heavy hybrid:
0.981 ms77.6%33.3%
- prompt summary rerank:
This closes the seed-unit branch for the current frontier:
- generic mode:
- summary-only seeds can squeeze out a tiny extra coverage gain, but they do not improve full hits and they cost more latency than the mixed-seed summary-heavy hybrid
- code-aware mode:
- summary-only seeds are clearly worse; the mixed-seed summary-heavy hybrid remains the strongest point
Per-question failure pattern
Aggregate percentages were no longer enough to guide the next branch, so the real code-corpus harness now supports targeted diagnostics:
--case-filter--report-questions
That was used to inspect the current best generic and code-aware contracts on the exact CrossFile prompts from butler_code_test.cr.
Stable local diagnostic, same 40-file / 840-row / 6-question point, ann_k=16, top_k=4, 3 runs:
- generic mode
- best latency/full-hit point:
prompt_summary_rerank_in
- best coverage/full-hit point:
prompt_summary_chunk_hybrid_s3_in
- best latency/full-hit point:
- code-aware mode
- best point:
prompt_summary_chunk_hybrid_s3_in
- best point:
The important result is not just the percentages, but which questions stay hard:
Response memory policy- still misses under all current best contracts
- current quality stays around
40.0%
Streaming overlap- still misses under all current best contracts
- current quality stays around
80.0%
Butler response routing- generic contracts still miss it
- code-aware summary-heavy hybrid fixes it to
100.0%
Memory store flow- generic best contracts already solve it
- code-aware summary-heavy hybrid still leaves it at
85.7%
This narrows the remaining frontier again:
the next real improvement is likely query-specific or corpus-specific, not a broad packing or seed policy that helps every question equally
The new payload diagnostics make that even more concrete:
Response memory policy- current best contracts already pull the right file neighborhood:
memory/hierarchical.crmemory/pgvector.crmemory/external_store.crbutler/persona.cr
- the summary-heavy hybrid even surfaces the
_micro_onlychunk frommemory/hierarchical.cr - but the remaining miss is about policy nuance, not file choice: the returned rows still do not cover the full combination of
_micro_only, refusal/pollution behavior, and external-storage policy
- current best contracts already pull the right file neighborhood:
Streaming overlap- current best contracts already pull the correct file:
streaming/controller.cr - both summary and chunk rows surface the overlap/chunking topic
- the remaining miss is about exact constants / same-file granularity: the query still does not close the final
1500/100coverage gap
- current best contracts already pull the correct file:
So for these two stubborn real prompts, the problem has narrowed from “retrieval picked the wrong files” to a much smaller statement:
the current system is usually choosing the right file region, but not yet the exact evidence fragment or policy detail needed to close the benchmark
Same-file local chunk refinement does not rescue the hard prompts
The next bounded hypothesis was:
if the right file is already selected, maybe the fix is simply to give the best file two nearby chunks instead of one
That was tested with a new prompt_summary_chunk_local2_in case:
- keep the summary-heavy contract
- keep mixed ANN seeds
- replace the single best chunk from the top file with a 2-chunk local window around the best chunk anchor
It did not help.
Targeted hard-prompt rerun (Response memory policy + Streaming overlap, ann_k=16, top_k=4, fresh backend):
- generic mode:
- existing summary-heavy hybrid:
70.0%0.945-1.050 ms
- local 2-chunk refinement:
70.0%1.660-1.729 ms
- existing summary-heavy hybrid:
- code-aware mode:
- existing summary-heavy hybrid:
60.0%1.041-1.078 ms
- local 2-chunk refinement:
60.0%1.689-1.733 ms
- existing summary-heavy hybrid:
Bounded all-question rerun (40 files, 840 rows, 6 real questions, 3 runs):
- generic mode:
- existing summary-heavy hybrid:
0.988 ms86.7%50.0%
- local 2-chunk refinement:
1.572 ms84.3%33.3%
- existing summary-heavy hybrid:
- code-aware mode:
- existing summary-heavy hybrid:
0.979 ms84.3%50.0%
- local 2-chunk refinement:
1.523 ms84.3%50.0%
- existing summary-heavy hybrid:
So the next frontier is narrower again:
the missing quality is not solved by a simple “take one more nearby chunk” policy; the remaining problem is finer-grained evidence choice, not just a larger same-file window
Semantic chunk selection is a generic-mode win, but not a universal one
The next bounded question was different from the failed local-window branch:
maybe the chunk budget is fine, and the real problem is that the last chunk is being picked with the wrong scoring rule
That was tested with prompt_summary_chunk_semantic_s3_in:
- keep the current summary-heavy mixed-seed contract
- keep the same
3 summaries + 1 chunkbudget - change only the final chunk selection:
- old path: lexical-first within the top file
- new path: semantic-distance-first within the top file
Hard-prompt rerun (Response memory policy + Streaming overlap, fresh backend, ann_k=16, top_k=4):
- generic mode:
- old summary-heavy hybrid:
70.0%0.992 ms
- semantic chunk selection:
70.0%0.481 ms
- old summary-heavy hybrid:
- code-aware mode:
- old summary-heavy hybrid:
60.0%1.035 ms
- semantic chunk selection:
60.0%0.429 ms
- old summary-heavy hybrid:
Full 6-question rerun (40 files, 840 rows, 3 runs):
- generic mode:
- old summary-heavy hybrid:
0.976 ms86.7%50.0%
- semantic chunk selection:
0.453 ms86.7%50.0%
- old summary-heavy hybrid:
- code-aware mode:
- old summary-heavy hybrid:
0.975 ms84.3%50.0%
- semantic chunk selection:
0.474 ms77.6%33.3%
- old summary-heavy hybrid:
This creates a new mode-specific frontier:
- generic mode
prompt_summary_chunk_semantic_s3_inis now the stronger coverage-preserving hybrid- it keeps the same aggregate quality as the old summary-heavy hybrid while cutting latency by roughly half
- code-aware mode
- the same semantic swap is not acceptable
- it buys latency, but loses both coverage and full hits
So the next branch should treat the two embedding modes separately instead of assuming one chunk-selection rule can dominate both.
Prompt-focused file-local snippet extraction
The next successful branch stopped changing retrieval at all.
Instead of asking the SQL layer to return better rows, it asked a narrower question:
if
prompt_summary_rerank_inalready selects the right files, can we extract better evidence fragments from those files after retrieval?
That is now implemented in the real code-corpus harness as:
prompt_summary_snippet_py
Contract:
- keep the existing
prompt_summary_rerank_inSQL seed/output path - for each returned summary row, resolve the underlying source file
- extract a prompt-focused snippet from the full file using:
- prompt-term matching against code-aware line tokens
- coverage-greedy anchor selection with method-definition tie-breaks
- Crystal method-body expansion instead of fixed-radius windows for selected
defanchors - adjacent helper-method merge for short
?helpers referenced by the selected method body - nearby config-initializer merge for short ivar-based helpers, so snippets keep concrete defaults like
window_size=1500/overlap=100 - append the snippet to the original summary payload instead of replacing the summary row
- cache
(file, prompt)snippets in-process so repeated runs are measured in both cold and warm regimes
This is a downstream evidence-selection layer, not a new PostgreSQL query primitive. The main value is answer quality on the real cogniformerus CrossFile benchmark.
Verified local result on the stable real code-corpus point (40 files, 840 rows, 6 questions, 384D, ann_k=16, top_k=4, fresh backend):
- generic embedding mode:
prompt_summary_rerank_in:p50 0.343-0.392 ms73.3%keyword coverage50.0%full hits
prompt_summary_snippet_py:- warm-cache
p50 0.551-0.698 ms - cold first-pass
p50 15.316 ms,avg 15.435 ms 100.0%keyword coverage100.0%full hits
- warm-cache
- code-aware embedding mode:
prompt_summary_rerank_in:p50 0.395-0.398 ms77.6%33.3%
prompt_summary_snippet_py:- warm-cache
p50 0.623 ms 97.6%83.3%
- warm-cache
prompt_symbol_summary_snippet_py:- warm-cache
p50 0.970-0.989 ms 100.0%100.0%
- warm-cache
Per-question generic rerun on the same corpus shows what the snippet layer actually fixed:
- now solved at
100.0%:Butler response routingMemory store flowTwo-stage answeringNLU hybrid classificationResponse memory policyStreaming overlap
Per-question code-aware rerun with prompt_symbol_summary_snippet_py now also solves the full set at 100.0%, including the old remaining miss:
Memory store flow
Interpretation:
- the remaining plateau on this real code corpus was not primarily file retrieval
- it was a file-local evidence selection problem
- the last code-aware miss turned out to be a seed-ranking problem inside the summary path, not a snippet-window problem
HierarchicalMemorywas already present in the summary row formemory/hierarchical.cr- the fix was a bounded symbol-aware variant,
prompt_symbol_summary_snippet_py, which:- extracts exact prompt symbols like
HierarchicalMemory,TwoStageAnswerer,DialogueNLU - unions a tiny exact-symbol summary seed set with the existing ANN seeds
- ranks summary rows by
symbol_hitsbefore the older prompt-term score
- extracts exact prompt symbols like
- the strongest fix was not “wider windows”
- it was preserving summary rows while adding code-structured snippets underneath them
- prompt-focused snippet extraction is the first branch that moves the real code-corpus benchmark from
50.0%to100.0%full hits at the same tiny-budgettop_k=4 - the current frontier is now split by embedding mode:
- generic:
prompt_summary_snippet_pyremains the better latency point - code-aware:
prompt_symbol_summary_snippet_pyis the quality winner
- generic:
Important caveat:
- the warm numbers rely on an in-process
(file, prompt)snippet cache - the cold first-pass cost is still materially higher than pure SQL rerank
- so this is a quality-oriented contract, not a free latency win
- the symbol-aware variant is not a generic improvement:
- in generic mode it gives no quality lift and only adds cost
That code-corpus frontier is now also checked under a repeated-build protocol:
scripts/repeat_graph_rag_code_corpus_builds.py3independent fresh temp-cluster builds- local
facts_shonly, same stable point:384Dann_k=16top_k=4ef_search=64ef_construction=200m=24- fresh backend
Verified repeated-build result:
- generic:
prompt_summary_snippet_pyp50 median 0.613 ms, range0.543-0.632 ms- stable
100.0% / 100.0%
prompt_symbol_summary_snippet_pyp50 median 0.986 ms, range0.932-1.047 ms- same
100.0% / 100.0% - therefore strictly slower on the generic frontier
- code-aware:
prompt_summary_snippet_pyp50 median 0.612 ms, range0.602-0.629 ms- stable
97.6% / 83.3%
prompt_symbol_summary_snippet_pyp50 median 0.963 ms, range0.928-1.022 ms- stable
100.0% / 100.0%
Interpretation:
- the new symbol-aware code-aware win is build-stable, not a one-off lucky HNSW construction
- the generic frontier is also build-stable, and the symbol-aware case remains dominated there
That same repeated-build protocol was then rerun on an AWS ARM64 host (4 vCPU, 8 GiB RAM) using:
scripts/repeat_graph_rag_code_corpus_builds_aws.sh- the same
3fresh builds - the same minimal synced
cogniformerussource tree andbutler_code_test.crprompt set
Verified AWS repeated-build result:
- generic:
prompt_summary_snippet_pyp50 median 0.955 ms, range0.954-0.960 ms- stable
100.0% / 100.0%
prompt_symbol_summary_snippet_pyp50 median 1.485 ms, range1.473-1.487 ms- same
100.0% / 100.0% - still strictly slower on the generic frontier
- code-aware:
prompt_summary_snippet_pyp50 median 1.008 ms, range1.008-1.009 ms- stable
97.6% / 83.3%
prompt_symbol_summary_snippet_pyp50 median 1.541 ms, range1.537-1.557 ms- stable
100.0% / 100.0%
So the code-aware split is now cross-environment verified:
- generic keeps the older snippet contract
- code-aware keeps the symbol-aware snippet contract
- the change in winner is not a local Apple-only artifact
Larger in-repo cogniformerus transfer gate
The previous repeated-build result used the smaller synced cogniformerus/src/cogniformerus slice (40 files, 840 rows after summary + chunk expansion). That was a good stable benchmark, but it was still fair to ask whether the contract would survive a materially larger in-repo code corpus.
The next bounded adversary check therefore reran the same repeated-build protocol on the full cogniformerus repository:
- source tree:
~/Projects/Crystal/cogniformerus - file count:
183Crystal files - prompt set: the same real
butler_code_test.crCrossFile prompts - same ANN knobs:
384Dann_k=16ef_search=64ef_construction=200m=24
The old tiny-budget point (top_k=4) did not transfer cleanly:
- generic
prompt_summary_snippet_pyp50 0.770 ms87.1%keyword coverage66.7%full hitsavg_rows 3.67
- code-aware
prompt_symbol_summary_snippet_pyp50 1.824 ms87.6%keyword coverage66.7%full hitsavg_rows 4.00
That is a real transfer gap, but it is not the same kind of failure as the external folding/src miss. The next bounded hypothesis was simply to raise the final result budget while keeping the same seed contract and the same winner cases.
At top_k=8, 3 fresh builds gave:
- generic
prompt_summary_snippet_pyp50 median 0.819 ms, range0.794-0.855 ms- stable
100.0% / 100.0% avg_rows 6.33
- code-aware
prompt_symbol_summary_snippet_pyp50 median 1.814 ms, range1.669-2.101 ms- stable
100.0% / 100.0% avg_rows 7.50
So the larger in-repo Crystal-side transfer gate is now verified.
The honest correction is:
- the current real code-corpus winners are not universal at the old
top_k=4budget - on the full in-repo corpus, they need a slightly larger final result budget
- once that budget moves to
top_k=8, the current winners recover perfectly without needing a new seed or snippet contract
That narrows the remaining 0.13 real-corpus gap further:
~/Projects/Crystalnow has both the small stable slice and a larger full-repo transfer gate- the next unverified generalization work was the mixed-language / archive side (
~/Projects/C,~/SrcArchives)
Mixed-language ~/Projects/C adversary gate (pycdc)
The next release-hardening branch widened the code-corpus harness itself:
- JSON question fixtures are now supported
- source discovery is no longer hardcoded to
*.cr - local dependency edges now also understand quoted C/C++ includes:
#include "..."->REQUIRES_FILE
That made it possible to run the same narrow code-GraphRAG benchmark shape on a real mixed-language corpus under ~/Projects/C without inventing a separate harness family.
The first such corpus was pycdc:
- source tree:
~/Projects/C/pycdc - fixture:
scripts/fixtures/graph_rag_pycdc_questions.json - source extensions:
.h.cpp.txt.markdown
- corpus size:
138files1281rows after summary + chunk expansion72local dependency edges from quoted includes
The first smoke run already gave the key split:
- generic
prompt_summary_snippet_py75.0%keyword coverage40.0%full hits
- generic
prompt_symbol_summary_snippet_py90.0%60.0%
- code-aware
prompt_summary_snippet_py70.0%60.0%
- code-aware
prompt_compactseed_require_summary_snippet_fn100.0%100.0%
That already falsified the lazy story that mixed-language transfer would look just like the Crystal corpora with only file-summary rerank. On pycdc, the include-aware rescue path matters much more.
Repeated-build verification at top_k=8, 3 fresh builds, then gave:
- generic
prompt_symbol_summary_snippet_pyp50 median 0.850 ms, range0.825-1.118 ms- stable
90.0% / 60.0% avg_rows 6.40
- code-aware
prompt_compactseed_require_summary_snippet_fnp50 median 8.006 ms, range7.799-8.136 ms- stable
100.0% / 100.0% avg_rows 5.80
So the first real ~/Projects/C gate is now covered, but it does not produce the same frontier as the Crystal corpora:
- there is no equally cheap generic
100.0% / 100.0%point here - the quality-complete point currently needs the slower helper-backed compact lexical seed + include rescue
That still narrows the 0.13 release gap meaningfully:
~/Projects/Crystalis covered~/Projects/Cis covered- the remaining unverified archive-side gate is now
~/SrcArchives
Archive-side ~/SrcArchives gate (ninja/src)
The last remaining real-corpus gap named in the 0.13 plan was the archive side under ~/SrcArchives. The new mixed-language harness path made it possible to cover that without another code change, so the next adversary corpus was:
- source tree:
~/SrcArchives/apple/ninja/src - fixture:
scripts/fixtures/graph_rag_ninja_questions.json - source extensions:
.h.cc
- corpus size:
103files1757rows after summary + chunk expansion282local dependency edges from quoted includes
The first smoke at the current default-ish budget (top_k=8) already gave a useful signal:
- generic
prompt_summary_snippet_py95.0%keyword coverage80.0%full hits
- code-aware
prompt_summary_snippet_py85.0%80.0%
That differed from pycdc in an important way:
- the archive corpus was already close on the plain generic path
- the code-aware path was not stronger here
- there was no immediate evidence that a dependency-rescue branch was needed
The cheapest falsifier was therefore not a new query contract, but just a small increase in the final result budget. At top_k=12:
- generic
prompt_summary_snippet_py100.0% / 100.0%p50 0.996 mson the first smoke
- code-aware
prompt_summary_snippet_py- stayed at
85.0% / 80.0%
- stayed at
Repeated-build verification (3 fresh builds) then confirmed the archive-side winner:
- generic
prompt_summary_snippet_pyp50 median 0.914 ms, range0.827-0.921 ms- stable
100.0% / 100.0% avg_rows 7.80
- code-aware
prompt_summary_snippet_pyp50 median 0.871 ms, range0.848-0.901 ms- stable
85.0% / 80.0% avg_rows 7.60
So the archive-side gate is now covered, and the conclusion is pleasantly narrow:
~/SrcArchivesdoes not require a new rescue contract for the first verified corpus- the simple generic summary-snippet path closes
ninja/src - the only change needed versus the smaller code-corpus points was a small result-budget bump from
top_k=8totop_k=12
This means the 0.13 larger real-corpus verification matrix is now complete in the scoped sense the plan asked for:
~/Projects/Crystal~/Projects/C~/SrcArchives
External folding corpus check
The next adversary check was a second real code corpus outside this repository:
- source tree:
folding/src - prompt set:
butler_folding_test.cr
This surfaced one real harness bug first:
scripts/bench_graph_rag_code_corpus.pyoriginally globbed*.crpaths without filteringis_file()- on the
foldingtree that accidentally picked up.crystal-cachedirectories ending in.cr - the harness now filters to real files only
Once that was fixed, the external corpus produced a useful repeated-build result. Local 3-build protocol on facts_sh, generic mode, same small-budget point (384D, ann_k=16, top_k=4, ef_search=64, ef_construction=200, m=24, fresh backend):
prompt_summary_snippet_pyp50 median 1.048 ms, range0.913-4.141 ms- quality drifted across fresh builds:
90.5-100.0%keyword coverage,83.3-100.0%full hits
prompt_lexseed_require_summary_snippet_fn- the first non-oracle rescue to
100.0% / 100.0% - but under a colder repeated-build protocol it turned out to be much more expensive than the earlier one-build numbers suggested:
p50 median 28.266 ms, range26.887-30.698
- the first non-oracle rescue to
prompt_compactseed_require_summary_snippet_fnp50 median 5.940 ms, range5.914-6.128- stable
100.0% / 100.0%
oracle_prompt_summary_snippet_py- on a bounded full rerun it also stayed at
100.0% / 100.0%, but the non-oracle compact-seed rescue already matches that quality, so oracle seeds are no longer the interesting external-generic diagnostic
- on a bounded full rerun it also stayed at
Interpretation:
- the old claim that generic external folding was already solved by
prompt_summary_snippet_pywas too strong - the generic baseline is now clearly less robust on this corpus than on the in-repo
cogniformerusslice - the first full-summary lexical rescue proved that the external gap was solvable, but it was too expensive to be a real frontier
- the stronger branch was a different lexical-seed representation: a compact per-file seed table built from file path terms, require-target terms, and deduplicated summary tokens
- that compact-seed rescue still closes the quality gap to
100.0% / 100.0%, but cuts the old full-summary lexical rescue by about4.8xlocally
An isolated timing split then narrowed where that penalty actually sits. On a fresh local 3-run sweep of only the old full-summary helper-backed rescue:
- generic
prompt_lexseed_require_summary_snippet_fnavg fetch ms/query = 10.674avg postprocess ms/query = 8.03324snippet-cache misses,48hitsavg build time per miss = 6.010 ms
- code-aware
prompt_lexseed_require_summary_snippet_fnavg fetch ms/query = 11.016avg postprocess ms/query = 7.74224snippet-cache misses,48hitsavg build time per miss = 5.787 ms
So the external rescue is not primarily a snippet-extraction problem. Even on the isolated cold pass, the dominant term is still the lexical-seed + REQUIRES_FILE fetch path. Snippet generation is a real secondary tax on the first pass, but it is not where the largest win now sits.
A kept-temp-cluster component probe narrowed that one step further. On the same external folding/src corpus:
annalone was cheap: about0.51 msmedian across the 6 real promptslexical_seedalone was the real dominant stage: about9.34 msmedianrescue_requirelanded at about9.28 msmedian because it inherits the same lexical-seed costrescue_lexical_require_summarieswas about9.86 msmedian
The summary rows explain why this stage is expensive: REL_FILE_SUMMARY payload length was 80 / 2078 / 5441 bytes at min / median / max on the external corpus. So the rescue is paying to run prompt-term substring scoring against multi-kilobyte summary payloads even before snippet extraction starts.
The same external folding/src corpus also answered the code-aware question. At the same repeated-build point:
- code-aware
prompt_summary_snippet_pyp50 median 1.080 ms, range1.048-1.146 ms- stable
79.8% / 66.7%
- code-aware
prompt_lexseed_require_summary_snippet_fnp50 median 36.676 ms, range29.806-40.705- stable
100.0% / 100.0%
- code-aware
prompt_compactseed_require_summary_snippet_fnp50 median 5.804 ms, range5.776-6.510- stable
100.0% / 100.0%
- code-aware
oracle_prompt_summary_snippet_pyp50 median 1.217 ms, range1.149-1.303 ms- stable
100.0% / 100.0%
So the external folding split is now sharper:
- both generic and code-aware external folding now have a verified non-oracle rescue to
100.0% / 100.0% - the external problem really was a seed-representation problem, not a snippet extraction problem
- the current external default is the compact-seed rescue, not the old full-summary lexical rescue
- the old full-summary rescue is now useful mainly as a diagnostic anchor for why the compact representation matters
- the honest conclusion is narrower:
- external folding is no longer blocked by an unsolved quality gap
- it still pays a quality/latency tax relative to the primary
cogniformeruscode corpus, but that tax is now much smaller than before
That local result also transferred to AWS ARM64 (4 vCPU, 8 GiB RAM) under a fresh 3-build repeated-build protocol:
- generic
prompt_summary_snippet_pyp50 median 1.540 ms, range1.535-1.604 ms- stable
90.5% / 83.3%
- generic
prompt_lexseed_require_summary_snippet_fnp50 median 41.960 ms, range41.747-42.081- stable
100.0% / 100.0%
- generic
prompt_compactseed_require_summary_snippet_fnp50 median 8.839 ms, range8.732-8.846- stable
100.0% / 100.0%
- code-aware
prompt_summary_snippet_pyp50 median 1.775 ms, range1.729-1.836 ms- stable
79.8% / 66.7%
- code-aware
prompt_lexseed_require_summary_snippet_fnp50 median 60.413 ms, range60.298-60.660- stable
100.0% / 100.0%
- code-aware
prompt_compactseed_require_summary_snippet_fnp50 median 8.392 ms, range8.329-8.413- stable
100.0% / 100.0%
So the compact-seed external rescue is now cross-environment verified, not a local artifact. The speedup over the old full-summary lexical rescue also survives the environment change:
- generic:
41.960 ms -> 8.839 ms - code-aware:
60.413 ms -> 8.392 ms
The external rescue is still slower than the primary in-repo winners, but it is no longer “full-quality only at tens of milliseconds”.
The next honest optimization target therefore changed. Cheap seed-budget cuts were already falsified (ann_k < 16 and lexical-seed LIMIT 1 both got worse), and the timing split shows that further work should focus on reducing the old full-summary lexical-seed cost. The compact lexical seed table already eliminated most of that cost, so the next branch is no longer “make lexical seeding viable at all”; it is whether the compact representation can be pushed closer to the primary in-repo code-corpus frontier.
One obvious branch was also falsified directly: truncating lexical scoring to a summary prefix. On the external corpus:
left(payload, 512)dropped the rescue query to about7.9 ms, but quality fell back to96.7% / 83.3%left(payload, 1024)restored100.0% / 100.0%, but it no longer sped the query up- the narrower threshold sweep (
640..992) confirmed there was no useful middle ground:992bytes recovered100.0% / 100.0%, but was still slower than the full-payload rescue
So a naive prefix cut is now a documented dead end. The remaining work is not “look at less text in the same way”; it needs a different lexical-seed representation or a different seed-selection contract altogether.
March 26, 2026: sorted_hnsw.shared_cache GraphRAG branch
A new bounded speed branch looked promising for fact-shaped GraphRAG: turning sorted_hnsw.shared_cache on for the ANN seed step. A direct local probe on a 2K x 384D multihop graph reduced the path-aware wrapper from roughly 0.911 ms total to 0.623 ms, with most of the gain in the ANN stage.
That did not survive the reliability gate.
On the full local 5K-pair, 64-query multihop harness, keeping the same quality knobs (ann_k=64, ef_search=128, ef_construction=200, m=24) but switching only sorted_hnsw.shared_cache from off to on caused all facts_sh ANN-seeded rows to collapse to 0.0% / 0.0%, while the facts_heap baseline stayed correct in the same run.
The strongest evidence from this branch is:
- the simple direct ANN seed query on
facts_shstill returned the expected top rows withshared_cache=on - single-query GraphRAG probes could still look correct
- the failure only showed up on the full same-session multihop harness, which points to a cache lifecycle / reuse bug rather than a general GraphRAG scoring bug
So the current honest conclusion is narrow:
sorted_hnsw.shared_cache = onremains a promising performance branch for GraphRAG seed scans- it is not currently safe as the default GraphRAG benchmark or release operating point
- the benchmark harnesses now expose a
--shared-cache on|offswitch, but the default staysoffuntil this correctness issue is debugged and fixed