Benchmarks
All benchmarks run on PostgreSQL 18, Apple M-series (12 CPU, 64 GB RAM) with shared_buffers=4GB, work_mem=256MB, maintenance_work_mem=2GB. Warm cache, average of 5 runs.
EXPLAIN ANALYZE – query latency and buffer reads
1M rows (71 MB)
| Query | sorted_heap | heap + btree | heap seqscan |
|---|---|---|---|
| Point (1 row) | 0.035 ms / 1 buf | 0.046 ms / 7 bufs | 15.2 ms / 6,370 bufs |
| Narrow (100 rows) | 0.043 ms / 2 bufs | 0.067 ms / 8 bufs | 16.2 ms / 6,370 bufs |
| Medium (5K rows) | 0.434 ms / 33 bufs | 0.492 ms / 52 bufs | 16.1 ms / 6,370 bufs |
| Wide (100K rows) | 7.5 ms / 638 bufs | 8.9 ms / 917 bufs | 17.4 ms / 6,370 bufs |
10M rows (714 MB)
| Query | sorted_heap | heap + btree | heap seqscan |
|---|---|---|---|
| Point (1 row) | 0.034 ms / 1 buf | 0.047 ms / 7 bufs | 117.9 ms / 63,695 bufs |
| Narrow (100 rows) | 0.037 ms / 1 buf | 0.062 ms / 7 bufs | 130.9 ms / 63,695 bufs |
| Medium (5K rows) | 0.435 ms / 32 bufs | 0.549 ms / 51 bufs | 131.0 ms / 63,695 bufs |
| Wide (100K rows) | 7.6 ms / 638 bufs | 8.8 ms / 917 bufs | 131.4 ms / 63,695 bufs |
100M rows (7.8 GB)
| Query | sorted_heap | heap + btree | heap seqscan |
|---|---|---|---|
| Point (1 row) | 0.045 ms / 1 buf | 0.506 ms / 8 bufs | 1,190 ms / 519,906 bufs |
| Narrow (100 rows) | 0.166 ms / 2 bufs | 0.144 ms / 9 bufs | 1,325 ms / 520,782 bufs |
| Medium (5K rows) | 0.479 ms / 38 bufs | 0.812 ms / 58 bufs | 1,326 ms / 519,857 bufs |
| Wide (100K rows) | 7.9 ms / 737 bufs | 10.1 ms / 1,017 bufs | 1,405 ms / 518,896 bufs |
At 100M rows, a point query reads 1 buffer (vs 8 for btree, 519,906 for sequential scan).
pgbench throughput (TPS)
Prepared mode (-M prepared)
Query planned once, re-executed with parameters. 10 s, 1 client.
| Query | 1M (sh / btree) | 10M (sh / btree) | 100M (sh / btree) |
|---|---|---|---|
| Point | 46.9K / 59.4K | 46.5K / 58.0K | 32.6K / 43.6K |
| Narrow | 22.3K / 29.1K | 22.5K / 28.8K | 17.9K / 18.1K |
| Medium | 3.4K / 5.1K | 3.4K / 4.8K | 2.4K / 2.4K |
| Wide | 295 / 289 | 293 / 286 | 168 / 157 |
Simple mode (-M simple)
Each query parsed, planned, and executed separately. 10 s, 1 client.
| Query | 1M (sh / btree) | 10M (sh / btree) | 100M (sh / btree) |
|---|---|---|---|
| Point | 28.4K / 38.0K | 29.1K / 41.4K | 18.7K / 4.6K |
| Narrow | 19.6K / 24.4K | 21.8K / 27.6K | 7.1K / 5.5K |
| Medium | 3.1K / 3.7K | 3.4K / 4.8K | 2.1K / 1.6K |
| Wide | 198 / 290 | 200 / 286 | 163 / 144 |
At 100M rows in simple mode, sorted_heap wins all query types. Point queries reach 18.7K TPS vs 4.6K for btree (4x).
INSERT and compaction throughput
| Scale | sorted_heap | heap + btree | heap (no index) | compact time |
|---|---|---|---|---|
| 1M | 923K rows/s | 961K | 1.91M | 0.3 s |
| 10M | 908K rows/s | 901K | 1.65M | 3.1 s |
| 100M | 840K rows/s | 1.22M | 2.22M | 41.3 s |
Storage
| Scale | sorted_heap | heap + btree | heap (no index) |
|---|---|---|---|
| 1M | 71 MB | 71 MB (50 + 21) | 50 MB |
| 10M | 714 MB | 712 MB (498 + 214) | 498 MB |
| 100M | 7.8 GB | 7.8 GB (5.7 + 2.1) | 5.7 GB |
sorted_heap stores data + zone map in the same space as a heap table. The zone map replaces the btree index, so total storage is comparable to heap-without-index – roughly 30% less than heap + btree at scale.
Vector search
Current local synthetic benchmark (sorted_hnsw Index AM)
Repo-owned harnesses:
python3 scripts/bench_gutenberg_local_dump.py --dump /tmp/cogniformerus_backup/cogniformerus_backup.dump --port 65473REMOTE_PYTHON=/path/to/python SH_EF=32 EXTRA_ARGS='--sh-ef-construction 200' ./scripts/bench_gutenberg_aws.sh <aws-host> /path/to/repo /path/to/dump 65485scripts/bench_sorted_hnsw_vs_pgvector.sh /tmp 65485 10000 20 384 10 vector 64 96python3 scripts/bench_ann_real_dataset.py --dataset nytimes-256 --sample-size 10000 --queries 20 --k 10 --pgv-ef 64 --sh-ef 96 --zvec-ef 64 --qdrant-ef 64python3 scripts/bench_qdrant_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64python3 scripts/bench_zvec_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64
Current AWS restored-corpus benchmark (~104K x 2880D, Gutenberg dump)
AWS ARM64 host (4 vCPU, 8 GiB RAM), top-10, restored PostgreSQL custom dump. Ground truth is recomputed by exact heap search on the restored svec table. In the current rerun the stored bench_hnsw_gt table matched the exact heap GT on 100% of the 50 benchmark queries, so the fresh exact heap GT and the historical GT table agree. This rerun uses sorted_hnsw ef_construction=200 and ef_search=32, and the harness reconnects after build before timing ordered scans.
| Method | p50 latency | Recall@10 | Notes |
|---|---|---|---|
Exact heap (svec) | 458.762 ms | 100.0% | brute-force GT on restored corpus |
sorted_hnsw (svec) | 1.287 ms | 100.0% | ef_construction=200, ef_search=32, index 404 MB, total 1902 MB |
sorted_hnsw (hsvec) | 1.404 ms | 100.0% | ef_construction=200, ef_search=32, index 404 MB, total 1032 MB |
pgvector HNSW (halfvec) | 2.031 ms | 99.8% | ef_search=64, index 804 MB, total 1615 MB |
| zvec HNSW | 50.499 ms | 100.0% | in-process collection, ef=64, ~1.12 GiB on disk |
| Qdrant HNSW | 6.028 ms | 99.2% | local Docker on same AWS host, hnsw_ef=64, 103,260 points |
The precision-matched PostgreSQL comparison on Gutenberg is now sorted_hnsw (hsvec) vs pgvector halfvec: 1.404 ms @ 100.0% versus 2.031 ms @ 99.8%, with total footprint 1032 MB versus 1615 MB. The raw fastest PostgreSQL row on this corpus is still sorted_hnsw (svec) at 1.287 ms, but that uses float32 source storage. sorted_hnsw keeps the same 404 MB index in both cases because the AM stores SQ8 graph state; the storage gain from hsvec appears in the base table and TOAST footprint instead.
Synthetic 10K x 384D cosine corpus, top-10, warm query loop. PostgreSQL methods were rerun across 3 fresh builds and the table below reports median p50 / median recall. Qdrant uses 3 warm measurement passes on one local Docker collection.
| Method | p50 latency | Recall@10 | Notes |
|---|---|---|---|
Exact heap (svec) | 2.03 ms | 100% | Brute-force ground truth |
| sorted_hnsw | 0.158 ms | 100% | shared_cache=on, ef_search=96, index ~5.4 MB |
pgvector HNSW (vector) | 0.446 ms | 90% median (90-95 range) | ef_search=64, same M=16, ef_construction=64, index ~2.0 MB |
| zvec HNSW | 0.611 ms | 100% | local in-process collection, ef=64 |
| Qdrant HNSW | 1.94 ms | 100% | local Docker, hnsw_ef=64 |
Current local real-dataset sample (nytimes-256-angular)
Repo-owned harness:
python3 scripts/bench_ann_real_dataset.py --dataset nytimes-256 --sample-size 10000 --queries 20 --k 10 --pgv-ef 64 --sh-ef 96 --zvec-ef 64 --qdrant-ef 64
ANN-Benchmarks nytimes-256-angular, sampled to 10K base vectors and 20 queries, top-10. The table below reports medians across 3 full harness runs. Ground truth comes from exact PostgreSQL heap search on the sampled svec corpus.
| Method | p50 latency | Recall@10 | Notes |
|---|---|---|---|
Exact heap (svec) | 1.557 ms | 100% | brute-force ground truth |
| sorted_hnsw | 0.327 ms | 85.0% median (83.5-85.5 range) | shared_cache=on, ef_search=96, index ~4.1 MB |
pgvector HNSW (vector) | 0.751 ms | 79.0% median (78.5-79.0 range) | ef_search=64, same M=16, ef_construction=64, index ~13 MB |
| zvec HNSW | 0.403 ms | 99.5% | local in-process collection, ef=64, ~14.1 MB on disk |
| Qdrant HNSW | 1.704 ms | 99.5% | local Docker, hnsw_ef=64 |
This corpus is materially harder than the deterministic synthetic one. It is a better signal for default-parameter recall, while the synthetic table remains useful for controlled same-host engine comparisons and regression tracking.
Current local GraphRAG benchmark (person -> parent -> city, stable fact contract)
Repo-owned harness:
python3 scripts/bench_graph_rag_multihop.py --num-pairs 5000 --query-count 64 --runs 3 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 --pgv-ef-search 64 --zvec-ef 64 --qdrant-ef 64 --shared-buffers-mb 64 --backend-mode fresh
Deterministic fact graph, 5K chains / 10K rows total, 384D, top-10. This is the current balanced stable GraphRAG point for the narrow fact-shaped fact-retrieval workload. The stable-facing SQL entry point is sorted_heap_graph_rag(...); the table below also shows the underlying helper and wrapper paths because the harness measures the dispatched execution path directly.
| Method | p50 latency | hit@1 | hit@k | Notes |
|---|---|---|---|---|
| Heap two-hop SQL | 0.762 ms | 75.0% | 96.9% | exact rerank over expanded heap set |
| sorted_heap_graph_rag_twohop_scan() | 0.720 ms | 75.0% | 96.9% | old city-only rerank contract |
| sorted_heap_expand_twohop_rerank() | 0.712 ms | 75.0% | 96.9% | same city-only seed point |
sorted_heap SQL pathsum baseline | 0.847 ms | 98.4% | 98.4% | hop1_distance + hop2_distance in SQL |
| sorted_heap_graph_rag_twohop_path_scan() | 0.739 ms | 98.4% | 98.4% | fused path-aware wrapper |
| sorted_heap_expand_twohop_path_rerank() | 0.726 ms | 98.4% | 98.4% | same knobs, fused path-aware helper |
| pgvector HNSW + heap expansion | 2.588 ms | 90.6% | 90.6% | path-aware rerank, ef_search=64 |
| zvec HNSW + heap expansion | 2.507 ms | 100.0% | 100.0% | path-aware rerank, ef=64 |
| Qdrant HNSW + heap expansion | 4.947 ms | 100.0% | 100.0% | path-aware rerank, hnsw_ef=64 |
The path-aware helper changes the local conclusion materially: the dominant quality issue on this fact-shaped workload was the old hop-2-only rerank contract, not seed ANN quality. The fused path-aware helper now gives the best local latency/quality point at the same m=24, ef_construction=200, ann_k=64, ef_search=128 operating point.
Under the same path-aware scorer contract, the current local conclusion gets sharper: sorted_heap keeps the latency lead, while zvec and Qdrant reach the strongest observed answer quality on this deterministic fact graph.
A repeated-build local protocol then quantified how much of this is just single-run luck. Using three independent fresh builds of the same 5K/384D balanced point:
sorted_heap_expand_twohop_path_rerank()median0.798 ms, range0.771-0.819 ms,hit@1 = 98.4%,hit@k = 98.4%on every buildsorted_heap_graph_rag_twohop_path_scan()median0.796 ms, range0.778-0.804 ms,hit@1 = 98.4%,hit@k = 98.4%on every buildpgvectorpath-aware parity row median1.405 ms,hit@1/hit@k85.9-89.1%zvecpath-aware parity row median1.076 ms,100.0% / 100.0%Qdrantpath-aware parity row median2.799 ms,100.0% / 100.0%
So on the local balanced point, the current sorted_heap path-aware rows are not a fragile one-off. The latency band is tight, and the answer quality did not drift across the three rebuilds.
shared_cache=on vs off for multihop (post-fix, 0490cc4)
A multi-index shared-cache corruption bug was fixed in commit 0490cc4. Previously, when two or more sorted_hnsw indexes existed in the same database and shared_cache=on, a publish for the second index could overwrite shared memory that an attached cache for the first index was still reading through bare pointers. This caused silent data corruption and 0% retrieval quality on the affected index.
The fix deep-copies all data (L0 neighbors, SQ8 vectors, upper-level neighbor slabs) from shared memory into local palloc’d buffers on attach. The exact failure mode is regression-guarded by the multi-index overwrite phase (B5) in scripts/test_hnsw_chunked_cache.sh.
Post-fix verification on the 5K x 384D multihop benchmark (fresh backends, 3 runs, ann_k=64, ef_search=128, m=24):
python3 scripts/bench_graph_rag_multihop.py \
--num-pairs 5000 --dim 384 --query-count 64 --runs 3 \
--ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 \
--shared-cache {on,off} --backend-mode fresh \
--skip-zvec --skip-qdrant --skip-pgvector
| Method | shared_cache | p50 | hit@1 | hit@k |
|---|---|---|---|---|
| sorted_heap_expand_twohop_path_rerank() | off | 0.780 ms | 98.4% | 98.4% |
| sorted_heap_expand_twohop_path_rerank() | on | 0.766 ms | 98.4% | 98.4% |
| sorted_heap_graph_rag_twohop_path_scan() | off | 0.804 ms | 98.4% | 98.4% |
| sorted_heap_graph_rag_twohop_path_scan() | on | 0.786 ms | 98.4% | 98.4% |
Also verified at 10K x 384D:
python3 scripts/bench_graph_rag_multihop.py \
--num-pairs 10000 --dim 384 --query-count 64 --runs 3 \
--ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 \
--shared-cache {on,off} --backend-mode fresh \
--skip-zvec --skip-qdrant --skip-pgvector
| Method | shared_cache | p50 | hit@1 | hit@k |
|---|---|---|---|---|
| sorted_heap_graph_rag_twohop_path_scan() | off | 0.904 ms | 95.3% | 96.9% |
| sorted_heap_graph_rag_twohop_path_scan() | on | 0.972 ms | 95.3% | 96.9% |
Quality is identical between on and off at both measured scales. Latency at these measured scales is mixed rather than a clear universal win. shared_cache=off is no longer needed as a correctness workaround.
Larger-scale cold-start measurement (100K x 32D, 200K rows, 47 MB index):
The deep-copy fix means attach now copies all L0 neighbors (~38 MB), SQ8 data (~6 MB), and node metadata (~6 MB) from shared memory into local buffers on each fresh backend’s first query. At 200K nodes, this upfront cost exceeds the lazy page-decode path used by shared_cache=off:
# 100K pairs, 32D, m=24, ef_construction=200, ef_search=128, 20 fresh backends per mode
# PG buffer cache warm (shared_buffers=256MB):
shared_cache=off: cold KNN p50=6.31ms
shared_cache=on: cold KNN p50=8.68ms
# PG buffer cache cold (PG restarted between modes, shared_buffers=64MB):
shared_cache=off: cold KNN first=5.3ms p50=5.7ms
shared_cache=on: cold KNN first=8.6ms p50=9.5ms
The off path loads L0 pages lazily (only pages visited by beam search). The on path deep-copies all bulk data upfront under LWLock. At 200K nodes, the upfront copy dominates. Quality remains identical.
Current conclusion: shared_cache=on is a correctness-safe default but not a performance feature at the measured scales. The deep-copy overhead from the multi-index fix neutralized the original latency benefit.
Synthetic multi-hop depth scaling (relation_path depth 1..5)
Repo-owned harness:
python3 scripts/bench_graph_rag_multidepth.py --num-pairs 5000 --max-depth 5 --query-count 32 --runs 3 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 --shared-buffers-mb 64
Deterministic chain-shaped fact graph, 25K rows total, 384D, top-10, fresh backend, path-aware scorer. This is a narrow scaling check for the generic unified syntax:
relation_path := ARRAY[1]relation_path := ARRAY[1,2]relation_path := ARRAY[1,2,3]relation_path := ARRAY[1,2,3,4]relation_path := ARRAY[1,2,3,4,5]
| Method | depth 1 | depth 2 | depth 3 | depth 4 | depth 5 | Quality |
|---|---|---|---|---|---|---|
| Heap SQL path baseline | 0.573 ms | 0.589 ms | 0.613 ms | 0.622 ms | 0.622 ms | 100.0% / 100.0% |
| sorted_heap SQL path baseline | 0.774 ms | 0.732 ms | 0.776 ms | 0.712 ms | 0.772 ms | 100.0% / 100.0% |
| sorted_heap_graph_rag(…, score_mode := ‘path’) | 0.674 ms | 0.651 ms | 0.643 ms | 0.633 ms | 0.649 ms | 100.0% / 100.0% |
On this synthetic chain benchmark, the unified path-aware function does not show a latency cliff through depth 5; it stays in the same ~0.63-0.67 ms band while preserving 100.0% / 100.0% quality. This is a controlled scaling signal, not a claim about arbitrary deep graph workloads.
Larger synthetic GraphRAG scale envelope (1M and 10M)
Repo-owned harnesses:
- local:
python3 scripts/bench_graph_rag_multidepth.py --num-pairs 200000 --max-depth 5 --query-count 8 --runs 1 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 64 --m 24 --shared-buffers-mb 64 --max-wal-size-gb 32 --maintenance-work-mem-mb 4096 --table-scope sorted_heap_only - AWS:
NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=1 RUNS=1 DIM=32 ANN_K=16 TOP_K=10 EF_SEARCH=32 EF_CONSTRUCTION=8 M=8 SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 MAINTENANCE_WORK_MEM_MB=2048 TABLE_SCOPE=sorted_heap_only ./scripts/bench_graph_rag_multidepth_aws.sh <aws-host> /path/to/repo 65494
Current verified scale picture:
- local
1Mrows (200Kpairs x5hops,384D, cheap-build point):generate_csv:162.667 sload_data:45.694 sbuild_indexes:377.968 s- unified path-aware
sorted_heap_graph_rag(...)p50:- depth
1:1.250 ms - depth
2:1.429 ms - depth
3:1.777 ms - depth
4:2.181 ms - depth
5:2.604 ms
- depth
- quality stayed
100.0% / 100.0%at every depth
- local
10M x 384Dno longer dies in Python generation after the streaming CSV change; the process stayed near31 MiB RSSon the early run where the old generator previously blew up. - local
10M x 64Dand AWS10M x 32Dboth survive generation and load, but the first practical frontier is still the singlesorted_hnsw CREATE INDEXbuild, not query execution.
The strongest bounded AWS signal so far is:
10M x 32D(2Mpairs x5hops, ultralight build point)generate_csv:223.114 s- CSV size:
3.63 GiB load_data:485.177 s- on the old build path, it then entered a long
CREATE INDEXthat was still active after~11-12minutes in the index-build phase - temp-cluster footprint reached about
17-18 GiBwith11-13 GiBdisk still free on the AWS host
After removing the per-search visited[] allocation/zeroing from the hot HNSW build loop, the same local diagnostic point that previously established the bottleneck changed materially:
- local
500K x 32D(100Kpairs x5hops,m=8,ef_construction=8)- old total build:
18.1-18.7 s - old graph-construction phase: about
18.27 s - new total build:
2.777-2.996 s - new graph-construction phase: about
2.42-2.59 s - page-writing tail stayed small (
~0.20-0.24 s)
- old total build:
That moved the AWS 10M x 32D scale branch from “still stuck in CREATE INDEX” to the first real 10M query pass on the same cheap-build point:
generate_csv:222.814 sload_data:468.923 sbuild_indexes:212.553 sanalyze:14.562 s- unified path-aware
sorted_heap_graph_rag(...)p50:- depth
1:2.124 ms - depth
2:3.433 ms - depth
3:4.611 ms - depth
4:5.846 ms - depth
5:5.139 ms
- depth
This is not a release-quality GraphRAG point. The build/query knobs here were intentionally ultra-light (ann_k=16, ef_search=32, ef_construction=8, m=8) to get the first 10M latency envelope. At those settings, answer quality on the single-query AWS probe was poor (0.0% / 0.0%) and should be treated as a scale smoke, not a publishable quality frontier.
I then kept that AWS temp cluster alive and ran query-only sweeps on the same cheap-built 10M x 32D graph to test whether query-time tuning alone could recover depth-5 quality without paying for another CREATE INDEX. It did not:
ef_search=128,ann_k=64-> depth-52.626 ms,0.0% / 0.0%ef_search=128,ann_k=128-> depth-53.040 ms,0.0% / 0.0%ef_search=256,ann_k=128-> depth-53.284 ms,0.0% / 0.0%ef_search=256,ann_k=256-> depth-54.076 ms, returned10rows, but still0.0% / 0.0%- one pathological point,
ef_search=32,ann_k=64, spiked to1833.809 msand259709.5shared reads while still returning0.0% / 0.0%
I then ran the same question on a smaller but shape-matched local 1M x 32D graph (200K pairs x 5 hops) to separate build quality from query contract:
- cheap local build points (
m=8,ef_construction=8/32,m=16,ef_construction=32/64) stayed weak at the old narrow query contract (ann_k=64,top_k=10), topping out around25.0% hit@k - but with a wider query contract on the same
1M x 32Dgraph (ann_k=256,top_k=32,ef_search=128), both- the strong build (
m=24,ef_construction=200), and - the much cheaper build (
m=16,ef_construction=64) reached the same96.9% hit@kover32queries
- the strong build (
- exact heap seeds on that
1M x 32Dgraph matched the ANN result exactly at the widened point, so once the query budget is large enough, heavier builds were no longer the decisive lever there
That suggested a second AWS 10M x 32D probe on the same cheap build but with the widened local winner (ann_k=256, top_k=32, ef_search=128). The result still failed:
- ANN path-aware
sorted_heap_graph_rag(...):top_k=10->1832.345 ms,0.0% / 0.0%top_k=32->1832.755 ms,0.0% / 0.0%
- exact heap seeds + the same path-aware expansion/rerank contract:
ann_k=256,top_k=32->0.0% / 0.0%
So the cheap-build 10M x 32D graph can produce latency numbers, but its quality is not recoverable by either:
- larger query-time ANN budgets, or
- exact heap seeds on the same low-dimensional corpus
The 10M x 32D frontier is therefore no longer “weak HNSW build quality.” It is the low-dimensional scale contract itself. The next meaningful scale branch is a higher-dimensional 10M point or a different retrieval contract, not another m/ef_construction tweak on the same 32D setup.
The first higher-dimensional calibration point was 64D, and it changed the local picture immediately. On a local 1M x 64D graph (200K pairs x 5 hops), a relatively cheap build (m=16, ef_construction=64) plus the wider query contract (ann_k=256, top_k=32, ef_search=128) gave:
- ANN path-aware
sorted_heap_graph_rag(...):65.6% hit@1,96.9% hit@k - exact heap seeds + the same path-aware expansion/rerank contract:
65.6% hit@1,96.9% hit@k
At top_k=10, the same 1M x 64D point already held 96.9% hit@k, so unlike 32D, the result-budget cliff largely disappeared there.
The first file-backed 10M x 64D AWS attempt on the current ubuntu@dev.rigelstar.com host (4 vCPU, 8 GiB RAM) failed for operational reasons before query timing:
generate_csv:414.397 s- CSV size:
6.67 GiB - temp dir grew to about
28 GiB /tmpfell to1.3 GiBfree (99%used) while the run was still active
That led to a footprint reduction pass in the harness:
- stream fact rows directly into
COPYinstead of materializing a giant CSV - drop
facts_heapafter loadingfacts_shinsorted_heap_onlymode - allow query-only reuse from a kept temp cluster
- when
facts_heapis not needed, copy rows straight intofacts_shbeforesorted_heap_compact(...)instead of staging throughfacts_heapandINSERT .. ORDER BY
That direct facts_sh load path held up on bounded local checks:
200Krows (40Kpairs x5hops,64D,hop_weight=0.05)- old staged path:
6321.453 msload, depth-5 unified GraphRAG24.504 ms,100.0% / 100.0% - direct
COPY facts_sh:5638.134 msload, depth-5 unified GraphRAG24.312 ms,100.0% / 100.0%
- old staged path:
1Mrows (200Kpairs x5hops,64D, load only)- old staged path:
31.392 s - direct
COPY facts_sh:28.231 s
- old staged path:
So the current conclusion is narrow but useful: for synthetic sorted_heap_only multidepth runs, direct COPY facts_sh is a real ~10% ingest win without giving up the compacted query-time locality that the earlier “skip compaction” falsifier showed we still need.
I also bounded the obvious follow-up on the same direct ordered load: parameterizing the post-load maintenance step as none, merge, or compact.
200Krows (40Kpairs x5hops,64D,hop_weight=0.05)none:5159.888 msload, depth-5 unified GraphRAG62.148 msmerge:5749.190 msload, depth-5 unified GraphRAG23.277 mscompact:5626.887 msload, depth-5 unified GraphRAG24.591 ms
1Mrows (200Kpairs x5hops,64D, load only)none:25.820 smerge:28.142 scompact:28.108 s
So merge is now exposed as an experiment knob in the multidepth harness, but it is not a proven new default. The current evidence says:
noneis too expensive at query timemergeis viable on ordered synthetic loadsmergedoes not materially beatcompacton the larger1Mload point
That leaves compact as the stable default for large-scale multidepth runs.
With that lower-footprint path, the same 10M x 64D AWS point advanced materially further on the same host:
- streamed
generate_csv:0.000 s(csv_bytes=0, no materialized CSV) load_data:916.030 s- temp dir plateaued around
19 GiB - filesystem still had about
11 GiBfree at the start ofCREATE INDEX
The next frontier was no longer disk headroom. On the same 10M x 64D cheap-build point (m=16, ef_construction=64), the old local scan-cache seeding path then failed inside PostgreSQL:
CREATE INDEX facts_sh_ann_idx ON facts_sh USING sorted_hnsw (embedding) WITH (m = 16, ef_construction = 64)ERROR: invalid memory alloc request size 1280000000
That request size matched the old contiguous local L0 neighbor cache layout (n_nodes * 2 * M * sizeof(int32) at 10M x M=16), so the next fix was to replace local l0_neighbors + sq8_data slabs with page-backed storage. Shared immutable scan caches remain contiguous; local build seeding and local shnsw_load_cache() no longer require a single giant L0 allocation.
With the chunked local-cache fix in place, the next retained branch was the constrained-memory rerun via:
sorted_hnsw.build_sq8 = on- lower-hop synthetic contract (
hop_weight = 0.05)
On the same AWS ARM64 host (4 vCPU, 8 GiB RAM, 4 GiB swap), the monolithic 10M x 64D point then completed cleanly:
load_data:787.809 sbuild_indexes:846.795 s- kept temp cluster:
/home/ubuntu/graphrag_tmp/graph_rag_y8cuntf9
That is a real constrained-memory result: the same host that previously hit early large-scale frontiers now built the full monolithic sorted_hnsw index on 10,000,000 rows without being OOM-killed.
The first query-only reuse pass on that exact built graph (query_count=4, runs=1, ann_k=256, top_k=32, ef_search=128, sorted_heap_only) gave:
- depth 1
- SQL path:
1072.525 ms,100.0% / 100.0% - unified GraphRAG path:
840.607 ms,100.0% / 100.0%
- SQL path:
- depth 2
- SQL path:
1070.437 ms,75.0% / 100.0% - unified GraphRAG path:
2055.479 ms,75.0% / 100.0%
- SQL path:
- depth 3
- SQL path:
1050.793 ms,50.0% / 100.0% - unified GraphRAG path:
2070.894 ms,50.0% / 100.0%
- SQL path:
- depth 4
- SQL path:
1080.123 ms,50.0% / 100.0% - unified GraphRAG path:
2066.717 ms,50.0% / 100.0%
- SQL path:
- depth 5
- SQL path:
1064.405 ms,75.0% / 100.0% - unified GraphRAG path:
2084.155 ms,75.0% / 100.0%
- SQL path:
So the 10M x 64D monolithic branch is now narrowed sharply:
- constrained-memory monolithic build is viable
- quality stayed aligned between the SQL baseline and the unified GraphRAG path
- the remaining issue is not build survival and not quality drift
- the remaining issue is latency: at depth
2+, the current generic unified GraphRAG path is still about2xslower than the SQL path baseline on this host and graph
That is exactly why the next scale story should move toward segmentation + pruning rather than trying to turn one monolithic 10M graph into the final operating model for small hosts.
I then added an opt-in stage breakdown path to the multidepth harness via --report-stage-stats, using sorted_heap_graph_rag_stats() after each unified path-aware call. On the local 1M x 64D lower-hop point (hop_weight=0.05, ann_k=256, top_k=32, ef_search=128), that narrowed the runtime picture sharply:
- depth 2:
- end-to-end unified GraphRAG:
109.222 ms - internal stage stats:
ann_ms = 107.590expand_ms = 0.369rerank_ms = 0.004
- end-to-end unified GraphRAG:
- depth 5:
- end-to-end unified GraphRAG:
110.507 ms - internal stage stats:
ann_ms = 109.178expand_ms = 0.691rerank_ms = 0.011
- end-to-end unified GraphRAG:
So the current 1M x 64D depth cost is not expansion-bound. On the widened contract, almost all measured time is already in the ANN seed stage; the multi-hop expansion and rerank work remain sub-millisecond.
I then added an opt-in low-memory sorted_hnsw build path via SET sorted_hnsw.build_sq8 = on. This keeps the final on-disk/index-query contract the same, but the graph is built from SQ8-compressed build vectors instead of a full float32 build slab. The tradeoff is one extra heap scan during CREATE INDEX.
Bounded local A/B on the same 1M x 64D lower-hop point (m=16, ef_construction=64, ann_k=256, top_k=32, ef_search=128, sorted_heap_only, streamed load):
build_sq8=offload_data:28.044 sbuild_indexes:48.606 s- unified depth-5 GraphRAG:
111.578 ms,87.5% / 100.0%
build_sq8=onload_data:27.935 sbuild_indexes:46.541 s- unified depth-5 GraphRAG:
110.541 ms,87.5% / 100.0%
So the first retained result is narrow but real:
- the low-memory build path did not regress the measured
1M x 64DGraphRAG quality point - it slightly improved build time on that point instead of paying a visible penalty for the extra heap scan
- and by construction it cuts the build-vector slab from
4 * N * Dbytes to1 * N * D10M x 64D: about2.56 GiB -> 0.64 GiB10M x 384D: about15.36 GiB -> 3.84 GiB
So the narrow conclusion is:
- the multi-hop GraphRAG path itself survives to at least
1Mrows and gives measured query latencies there - the old
10Mfrontier was build-bound, not Python-memory-bound, not CSV-generation-bound, and not an early PostgreSQL failure - the current build optimization moved that frontier enough to obtain the first real
10Mquery numbers on a cheap-build point - the remaining
10M x 32Dproblem is no longer “can it finish?” and no longer just “is the HNSW build too weak?” - the stronger falsifier is that even exact seeds fail at that scale and dimensionality, so the next branch is a different scale contract, not a narrower HNSW tuning loop
64Dis the first scale contract that looks healthy locally at1M- on the current AWS host, the
10M x 64Dbranch now clears ingest, build, and the first query pass; the remaining cheap-build frontier is deep-path quality, not allocator failure
I then added a segmented multidepth harness in scripts/bench_graph_rag_multidepth_segmented.py. This is the first partitioning benchmark, but it is intentionally harness-side, not an extension API: current GraphRAG helpers require a concrete sorted_heap table, so the benchmark fans out across multiple sorted_heap shards and merges the shard-local top-k rows in Python.
The local 1M x 64D lower-hop point (200K pairs, 5 hops, 32 queries, ann_k=256, top_k=32, ef_search=128, m=16, ef_construction=64, build_sq8=on) gives a clean first answer:
- monolithic unified GraphRAG:
- depth 1:
50.104 ms,100.0% / 100.0% - depth 5:
121.524 ms,81.2% / 100.0%
- depth 1:
- segmented into
8shards, all-shard fanout:- build_indexes:
57.885 sversus monolithic45.714 s - depth 1:
87.677 ms,100.0% / 100.0% - depth 5:
142.472 ms,81.2% / 100.0%
- build_indexes:
- segmented into
8shards, exact routing to the owning shard:- depth 1:
10.574 ms,100.0% / 100.0% - depth 5:
16.822 ms,100.0% / 100.0%
- depth 1:
That makes the partitioning lesson explicit:
- segmentation by itself is not a free latency win
- all-shard fanout preserves quality on this benchmark, but pays a clear fanout tax
- the real win appears only when routing/pruning can avoid most shards
- so the next scale story should be “partitioning + pruning contract”, not just “more shards”
This is the correct shape for large knowledge bases:
- on constrained hosts, sharding bounds per-index build memory
- on real workloads, the query path must also prune by tenant / knowledge-base / relation family / time window (or a future segment router), otherwise the system only trades one monolith for a broad fanout query
That local result transferred cleanly to the constrained AWS host on the first full 10M x 64D streamed segmented run (2M pairs, 5 hops, 8 shards, hop_weight=0.05, ann_k=256, top_k=32, ef_search=128, m=16, ef_construction=64, build_sq8=on):
- streamed segmented build-only:
generate_csv:0.000 sload_data:500.474 sbuild_indexes:784.778 s
- segmented,
route=exactquery-only reuse on the same built graph:- depth 1:
126.057 ms,100.0% / 100.0% - depth 2:
261.986 ms,75.0% / 100.0% - depth 3:
259.794 ms,75.0% / 100.0% - depth 4:
258.879 ms,75.0% / 100.0% - depth 5:
258.766 ms,100.0% / 100.0%
- depth 1:
- segmented,
route=allquery-only reuse on the same built graph:- depth 1:
898.440 ms,100.0% / 100.0% - depth 2:
2090.866 ms,75.0% / 100.0% - depth 3:
2089.650 ms,50.0% / 100.0% - depth 4:
2088.114 ms,50.0% / 100.0% - depth 5:
2093.652 ms,75.0% / 100.0%
- depth 1:
That gives the first full large-scale constrained-memory comparison on the same AWS box:
- monolithic
10M x 64D, low-memory build:- depth 1 unified GraphRAG:
840.607 ms,100.0% / 100.0% - depth 5 unified GraphRAG:
2084.155 ms,75.0% / 100.0%
- depth 1 unified GraphRAG:
- segmented
10M x 64D,route=all:- effectively the same latency/quality envelope as the monolith
- segmented
10M x 64D,route=exact:- about
6.7xfaster than the monolith at depth 1 - about
8.1xfaster than the monolith at depth 5 - and still stable at
100.0% / 100.0%for depth 5 on this benchmark
- about
So the large-scale result now matches the local 1M lesson:
- segmentation without pruning is not a performance story
- streamed segmented ingest is operationally better than front-loaded shard CSV materialization
- segmented routing is the first constrained-memory scale path that improves both build viability and query latency on the same host
The next local step was to move routing out of benchmark-only Python and into a real SQL reference path. On a kept local 5K-row segmented smoke cluster (4 shards, 64D, ann_k=32, top_k=8, ef_search=64), the new range-routed SQL path matched the segmented SQL merge baseline on quality and row counts:
- segmented SQL merge,
route=exact, depth 5:0.215 ms,100.0% / 100.0%
- metadata-routed SQL wrapper, same exact-route point:
0.245 ms,100.0% / 100.0%
So the current reference stack is now:
sorted_heap_graph_rag_segmented(...)for explicit candidate shard arrayssorted_heap_graph_rag_routed(...)for simple metadata-driven range routing
The next local routing pass added an exact-key companion for tenant / KB style selection. On another kept local 5K-row segmented smoke cluster (4 shards, 64D, ann_k=32, top_k=8, ef_search=64), the exact-key routed path stayed aligned with the exact-route segmented SQL merge path:
- exact-route segmented SQL merge, depth 5:
0.183 ms,100.0% / 100.0%
- exact-key routed SQL wrapper, same point:
0.202 ms,100.0% / 100.0%
That is still not the final routing story for large knowledge bases, but it is the first usable SQL-level bridge from harness-side segmentation to productized segmented GraphRAG, with both:
- range-based routing
- exact-key routing
Monolithic vs segmented exact-routed: constrained-memory 10M x 64D comparison
This table consolidates the AWS 10M x 64D results already reported above (monolithic in “Larger synthetic GraphRAG scale envelope”, segmented in “Segmented GraphRAG at scale”) into a direct side-by-side.
Both runs used the same AWS ARM64 host (4 vCPU, 8 GiB RAM, 4 GiB swap), the same workload (2M pairs x 5 hops, hop_weight=0.05), and the same build/query knobs (m=16, ef_construction=64, build_sq8=on, ann_k=256, top_k=32, ef_search=128). Segmented mode used 8 shards.
| Monolithic | Segmented exact | Segmented all | |
|---|---|---|---|
| load_data | 787.8 s | 500.5 s | — |
| build_indexes | 846.8 s | 784.8 s | — |
| depth 1 p50 | 840.6 ms | 126.1 ms | 898.4 ms |
| depth 5 p50 | 2084.2 ms | 258.8 ms | 2093.7 ms |
| depth 1 hit@1/hit@k | 100%/100% | 100%/100% | 100%/100% |
| depth 5 hit@1/hit@k | 75%/100% | 100%/100% | 75%/100% |
| speedup at depth 5 | 1x | 8.1x | ~1x |
Exact commands (monolithic build + query-only reuse):
NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \
ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \
SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \
BUILD_SQ8=on TABLE_SCOPE=sorted_heap_only \
./scripts/bench_graph_rag_multidepth_aws.sh <host> <dir> 65493
Exact commands (segmented build + query-only reuse):
NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \
ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \
SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \
BUILD_SQ8=on SHARDS=8 ROUTE=both \
./scripts/bench_graph_rag_multidepth_segmented_aws.sh <host> <dir> 65494
Conclusion: on the measured 10M x 64D constrained-memory point, segmented exact routing is 8.1x faster at depth 5 with better quality (100%/100% vs 75%/100%), while build time is comparable. All-shard fanout offers no latency benefit — the win comes entirely from shard pruning.
This is not a universal claim. Exact routing requires knowing which shard owns the query entity, which maps to tenant-id / knowledge-base-id style workloads. Queries that cannot be pruned to a single shard will see all-shard-fanout latency (~1x monolithic).
Bounded fanout: routing robustness on 1M x 64D
This measures how the segmented win degrades when routing is imperfect. Instead of routing to exactly one shard (exact) or all shards (all), bounded fanout routes to K adjacent shards (always including the correct one). This simulates imperfect routing where the router knows roughly which shard, but hedges by including neighbors.
Local, 200K pairs x 5 hops, 64D, 8 shards, 32 queries, 3 runs, ann_k=256, top_k=32, ef_search=128, m=16, ef_construction=64, build_sq8=on, hop_weight=0.05:
python3 scripts/bench_graph_rag_multidepth_segmented.py \
--num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \
--dim 64 --ann-k 256 --top-k 32 --ef-search 128 \
--ef-construction 64 --m 16 --build-sq8 on \
--shards 8 --route {exact,bounded,all} --fanout {2,4} --hop-weight 0.05
Monolithic baseline (sorted_heap_graph_rag(...) on same workload):
python3 scripts/bench_graph_rag_multidepth.py \
--num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \
--dim 64 --ann-k 256 --top-k 32 --ef-search 128 \
--ef-construction 64 --m 16 --build-sq8 on \
--hop-weight 0.05 --table-scope sorted_heap_only
Depth-5 comparison (the hardest hop):
| Route mode | Shards hit | p50 (d5) | hit@1 (d5) | hit@k (d5) | Speedup vs monolith |
|---|---|---|---|---|---|
| monolithic | 1 table | 120.7 ms | 81.2% | 100.0% | 1x |
| segmented all (8/8) | 8 | 149.5 ms | 81.2% | 100.0% | 0.8x |
| segmented bounded (4/8) | 4 | 74.1 ms | 93.8% | 100.0% | 1.6x |
| segmented bounded (2/8) | 2 | 35.9 ms | 96.9% | 100.0% | 3.4x |
| segmented exact (1/8) | 1 | 17.2 ms | 100.0% | 100.0% | 7.0x |
Depth-1 comparison:
| Route mode | Shards hit | p50 (d1) | hit@1 (d1) | hit@k (d1) |
|---|---|---|---|---|
| monolithic | 1 table | 49.4 ms | 100.0% | 100.0% |
| segmented all (8/8) | 8 | 94.3 ms | 100.0% | 100.0% |
| segmented bounded (4/8) | 4 | 46.3 ms | 100.0% | 100.0% |
| segmented bounded (2/8) | 2 | 22.8 ms | 100.0% | 100.0% |
| segmented exact (1/8) | 1 | 11.3 ms | 100.0% | 100.0% |
Observations:
- Latency scales roughly linearly with the number of shards hit.
- Quality degrades gracefully: bounded(2/8) still reaches 96.9% hit@1 at depth 5 vs 100.0% for exact, compared to 81.2% for monolithic/all.
- Even bounded(4/8) at half the shards is 1.6x faster than monolithic and has better quality (93.8% vs 81.2% hit@1).
- All-shard fanout is worse than monolithic (fanout overhead dominates).
- The win is not exact-or-nothing — bounded fanout preserves most of the benefit even with imperfect routing.
Bounded fanout at AWS 10M x 64D scale
Same AWS ARM64 host (4 vCPU, 8 GiB RAM, 4 GiB swap), same workload as the 10M x 64D comparison above (2M pairs x 5 hops, hop_weight=0.05, 8 shards, ann_k=256, top_k=32, ef_search=128, m=16, ef_construction=64, build_sq8=on). Build once, query-only reuse for each route mode. 4 queries, 1 run.
Build: load_data=500.7s, build_indexes=764.2s.
Query commands:
# Build once with --keep-temp --stop-after build_indexes
NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \
ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \
SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \
BUILD_SQ8=on SHARDS=8 ROUTE=exact \
EXTRA_ARGS="--stream-copy --keep-temp --stop-after build_indexes" \
bash scripts/bench_graph_rag_multidepth_segmented_aws.sh \
<host> <dir> 65494
# Query-only reuse for each route mode:
ssh <host> "cd <dir> && python3 scripts/bench_graph_rag_multidepth_segmented.py \
--port 65494 --num-pairs 2000000 --max-depth 5 --query-count 4 --runs 1 \
--dim 64 --hop-weight 0.05 --ann-k 256 --top-k 32 --ef-search 128 \
--ef-construction 64 --m 16 --build-sq8 on --shared-buffers-mb 64 \
--shards 8 --route {exact,bounded,all} [--fanout {2,4}] \
--backend-mode fresh --reuse-temp <kept_temp>"
Depth-5 comparison (monolithic baseline from prior section):
| Route mode | Shards hit | p50 (d5) | hit@1 (d5) | hit@k (d5) | vs monolith |
|---|---|---|---|---|---|
| monolithic | 1 table | 2084 ms | 75% | 100% | 1x |
| segmented all (8/8) | 8 | 2107 ms | 75% | 100% | ~1x |
| segmented bounded (4/8) | 4 | 1042 ms | 75% | 100% | 2.0x |
| segmented bounded (2/8) | 2 | 520 ms | 100% | 100% | 4.0x |
| segmented exact (1/8) | 1 | 259 ms | 100% | 100% | 8.0x |
Depth-1 comparison:
| Route mode | Shards hit | p50 (d1) | hit@1 (d1) | hit@k (d1) | vs monolith |
|---|---|---|---|---|---|
| monolithic | 1 table | 841 ms | 100% | 100% | 1x |
| segmented all (8/8) | 8 | 920 ms | 100% | 100% | ~1x |
| segmented bounded (4/8) | 4 | 473 ms | 100% | 100% | 1.8x |
| segmented bounded (2/8) | 2 | 233 ms | 100% | 100% | 3.6x |
| segmented exact (1/8) | 1 | 137 ms | 100% | 100% | 6.1x |
The bounded-fanout gradient from the local 1M point transfers cleanly to 10M:
- Latency scales linearly with the number of shards hit at both scales.
- bounded(2/8) at
10Mis 4.0x faster than monolithic at depth 5 — close to the local 3.4x ratio. - Even bounded(4/8) at half the shards gives a 2.0x win.
- Quality: hit@k stays at 100% across all modes. hit@1 varies with fanout width but stays ≥75%.
- All-shard fanout remains ~1x monolithic (fanout overhead cancels the per-shard size reduction).
Conclusion: bounded fanout remains materially useful on this measured 10M synthetic point. A router that can narrow to 2 out of 8 candidate shards captures ~50% of the exact-routing speedup. The performance gradient is smooth on this benchmark, not a cliff.
Routing-miss tolerance on 1M x 64D
This measures what happens when the router sometimes picks the wrong shards — the correct shard is not always included in the candidate set. Uses --route bounded_recall --fanout 2 --recall-pct N where N% of queries get the correct shard and the rest get 2 adjacent wrong shards. Miss schedule is deterministic per-query via SHA-256 hash (seed=42).
Local, 200K pairs x 5 hops, 64D, 8 shards, 32 queries, 3 runs, ann_k=256, top_k=32, ef_search=128, m=16, ef_construction=64, build_sq8=on, hop_weight=0.05:
python3 scripts/bench_graph_rag_multidepth_segmented.py \
--num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \
--dim 64 --ann-k 256 --top-k 32 --ef-search 128 \
--ef-construction 64 --m 16 --build-sq8 on \
--shards 8 --route bounded_recall --fanout 2 \
--recall-pct {100,90,75,50,25,0} --hop-weight 0.05
Depth-5 routing-miss tolerance (fanout=2, 8 shards):
| Router recall | Queries hitting correct shard | p50 (d5) | hit@1 (d5) | hit@k (d5) |
|---|---|---|---|---|
| 100% | 32/32 | 54.5 ms | 96.9% | 100.0% |
| 90% | 29/32 | 64.7 ms | 87.5% | 90.6% |
| 75% | 23/32 | 57.0 ms | 68.8% | 71.9% |
| 50% | 18/32 | 45.4 ms | 53.1% | 56.2% |
| 25% | 10/32 | 79.1 ms | 31.2% | 31.2% |
| 0% | 0/32 | 44.8 ms | 0.0% | 0.0% |
| (ref: monolithic) | — | 120.7 ms | 81.2% | 100.0% |
Observations:
- Quality tracks router recall nearly linearly. There is no sharp cliff.
- A router with 90% recall keeps 87.5% hit@1 — close to the monolithic baseline’s 81.2%, while remaining 2-3x faster.
- A router with 75% recall gives 68.8% hit@1 — below monolithic quality, so the latency win comes at a quality cost.
- Latency stays roughly constant across recall levels because the fanout (2 shards) is fixed. The router-miss doesn’t cost extra I/O — it just returns wrong answers.
- This means: routing quality determines answer quality, not latency. The segmented latency win is structural (fewer rows per shard), but the quality win depends entirely on the router including the correct shard.
Routing-miss tolerance at AWS 10M x 64D
Same AWS ARM64 host and workload as the bounded-fanout section above. Build once (load=500.5s, build=803.5s), query-only reuse. 4 queries, 1 run, fanout=2, 8 shards.
# Build: same as bounded-fanout section
# Query-only reuse with bounded_recall:
ssh <host> "cd <dir> && python3 scripts/bench_graph_rag_multidepth_segmented.py \
--port 65494 --num-pairs 2000000 --max-depth 5 --query-count 4 --runs 1 \
--dim 64 --hop-weight 0.05 --ann-k 256 --top-k 32 --ef-search 128 \
--ef-construction 64 --m 16 --build-sq8 on --shared-buffers-mb 64 \
--shards 8 --route bounded_recall --fanout 2 --recall-pct {90,75} \
--backend-mode fresh --reuse-temp <kept_temp>"
Depth-5 routing-miss tolerance at 10M:
| Route mode | Router recall | p50 (d5) | hit@1 (d5) | hit@k (d5) |
|---|---|---|---|---|
| bounded optimistic | 100% (4/4 hit) | 519 ms | 100% | 100% |
| bounded_recall | 90% (3/4 hit) | 521 ms | 75% | 75% |
| bounded_recall | 75% (3/4 hit) | 522 ms | 75% | 75% |
| (ref: monolithic) | — | 2084 ms | 75% | 100% |
| (ref: exact) | 100% | 259 ms | 100% | 100% |
The 90% and 75% recall points produce the same result because with only 4 queries, SHA-256 hash gives 3/4 correct-shard inclusion at both settings. The 1 missed query returns wrong results, giving 75% hit@1.
Despite limited query-count resolution, the directional signal is clear:
- Bounded(2/8) latency stays at ~520 ms regardless of recall (~4x faster than monolithic)
- One missed query out of four drops hit@1 from 100% to 75% — matching the monolithic baseline’s own 75% hit@1
- hit@k drops from 100% to 75% (monolithic gets 100% hit@k because its broader candidate set is more likely to include the target)
This confirms the local finding: routing quality determines answer quality, not latency. The latency win is structural and survives routing errors. At 90% router recall on this 10M point, bounded routing matches the monolithic hit@1 (75%) while staying 4x faster.
A higher query_count run would give finer crossover resolution. On the measured 4-query point, the evidence is narrower: at 90% recall, bounded routing matches the monolithic hit@1 while staying 4x faster, and the true crossover is not shown to be above that point.
Current AWS GraphRAG benchmark (person -> parent -> city, stable fact contract)
Repo-owned harness:
REMOTE_PYTHON=/path/to/python QUERY_COUNT=64 RUNS=3 NUM_PAIRS=5000 DIM=384 ANN_K=64 TOP_K=10 EF_SEARCH=128 EF_CONSTRUCTION=200 M=24 PGV_EF_SEARCH=64 ZVEC_EF=64 QDRANT_EF=64 SHARED_BUFFERS_MB=64 BACKEND_MODE=fresh ./scripts/bench_graph_rag_multihop_aws.sh <aws-host> /path/to/repo 65492
AWS ARM64 host (4 vCPU, 8 GiB RAM), deterministic fact graph, 5K chains / 10K rows total, 384D, top-10, 64 queries, 3 runs. This is the current portable stable multihop GraphRAG point for the narrow fact-shaped contract.
| Method | p50 latency | hit@1 | hit@k | Notes |
|---|---|---|---|---|
| Heap two-hop SQL | 1.088 ms | 75.0% | 96.9% | exact rerank over expanded heap set |
| sorted_heap_expand_twohop_rerank() | 0.952 ms | 75.0% | 96.9% | older city-only helper |
| sorted_heap_graph_rag_twohop_scan() | 1.012 ms | 75.0% | 96.9% | older city-only wrapper |
sorted_heap SQL pathsum baseline | 1.204 ms | 98.4% | 98.4% | same ANN seeds, hop1_distance + hop2_distance |
| sorted_heap_expand_twohop_path_rerank() | 0.955 ms | 98.4% | 98.4% | fused path-aware helper |
| sorted_heap_graph_rag_twohop_path_scan() | 1.018 ms | 98.4% | 98.4% | fused path-aware wrapper |
| pgvector HNSW + heap expansion | 1.422 ms | 85.9% | 85.9% | path-aware rerank, ef_search=64 |
| zvec HNSW + heap expansion | 1.720 ms | 100.0% | 100.0% | path-aware rerank, ef=64 |
| Qdrant HNSW + heap expansion | 3.435 ms | 100.0% | 100.0% | path-aware rerank, hnsw_ef=64 |
The new AWS result matches the local diagnostic cleanly: the dominant quality loss on this workload was the old hop-2-only rerank contract, not the seed ANN frontier. The path-aware helper preserves the quality gain on ARM64 with only trivial latency cost versus the older helper.
On the apples-to-apples path-aware contract, the portable frontier is now:
sorted_heapfastestzvecand Qdrant strongest on answer qualitypgvectorstill behind on both latency and quality at this operating point
One intermediate AWS all-engines rerun temporarily dropped the sorted_heap path-aware rows to 96.9% / 96.9%. An immediate sorted_heap-only control and a second full all-engines rerun both returned the stable 98.4% / 98.4% point above, so the published table uses the confirmed rerun rather than the single outlier.
An AWS repeated-build protocol then tightened that confidence band on the same balanced 5K point. Using three independent fresh builds:
sorted_heap_expand_twohop_path_rerank()median0.962 ms, range0.956-0.965 ms,hit@1/hit@k = 98.4/98.4on all three buildssorted_heap_graph_rag_twohop_path_scan()median1.025 ms, range1.018-1.043 ms,hit@1/hit@k = 98.4/98.4on all three buildspgvectorpath-aware parity row median1.434 ms,hit@1/hit@k84.4-89.1zvecpath-aware parity row median1.711 ms,100.0/100.0Qdrantpath-aware parity row median3.355 ms,100.0/100.0
So on the current portable 5K point, the earlier AWS outlier now looks like an anomaly rather than a broad instability. The balanced sorted_heap path-aware rows stayed fixed across all three rebuilds.
The larger 10K-chain AWS rerun now tells a different story than the older city-only benchmark. At the same portable point:
- heap two-hop SQL:
1.319 ms,hit@1 71.9%,hit@k 92.2% - city-only
sorted_heap_graph_rag_twohop_scan():1.197 ms,hit@1 73.4%,hit@k 93.8% - SQL
pathsumbaseline:1.436 ms,hit@1 96.9%,hit@k 98.4% sorted_heap_expand_twohop_path_rerank():1.185 ms,hit@1 96.9%,hit@k 98.4%sorted_heap_graph_rag_twohop_path_scan():1.212 ms,hit@1 96.9%,hit@k 98.4%
So the old larger-scale caveat now narrows materially: the main 10K loss was also the city-only rerank contract, not a fundamental collapse of the seed frontier at that scale.
An AWS repeated-build protocol then checked whether the remaining 10K difference was really a build-variance problem. Using three independent fresh builds on the same 10K path-aware point:
sorted_heap_expand_twohop_path_rerank()median1.177 ms, range1.148-1.191 ms,hit@1/hit@k = 95.3/96.9on all three buildssorted_heap_graph_rag_twohop_path_scan()median1.236 ms, range1.211-1.240 ms,hit@1/hit@k = 95.3/96.9on all three buildspgvectorpath-aware parity row median1.667 ms,hit@1/hit@k76.6-82.8zvecpath-aware parity row median2.788 ms,98.4/100.0Qdrantpath-aware parity row median3.818 ms,98.4/100.0
So the larger 10K AWS point is now repeated-build stable too. The remaining issue is scale frontier, not build instability: the 10K quality band is lower than 5K, but it stayed fixed across fresh builds.
An exact-seed diagnostic on the local 5K and 10K points did not improve hit@1 or hit@k versus the ANN-seeded sorted_heap helper. So on this benchmark shape, the remaining gap is not explained by ANN approximation alone. The stronger result is that seed coverage itself was identical for ANN and exact seeds: 98.4% at 5K and 96.9% at 10K. So the remaining loss is downstream of seeding. The new rerank-rank diagnostic narrows that further: at 5K, the correct city is still within the top 6 for 95% of reachable queries, and at 10K it is still within the top 3 for 95% of reachable queries. The quality drop is therefore driven by a few severe outliers (max rank 17 at 5K, 20 at 10K), not by a broad collapse.
A path-aware SQL rerank baseline then changed the picture materially. Keeping the same ANN seeds and the same two-hop expansion, but scoring candidates as hop1_distance + hop2_distance, moved the local balanced points to:
5K:0.957 ms,hit@1 98.4%,hit@k 98.4%10K:1.179 ms,hit@1 95.3%,hit@k 96.9%
That branch is now implemented in the extension and verified on both local and AWS ARM64 runs. The fused path-aware helper measured:
- local
5K:0.726 ms,hit@1 98.4%,hit@k 98.4% - local
10K:0.823 ms,hit@1 95.3%,hit@k 96.9% - AWS
5K:0.955 ms,hit@1 98.4%,hit@k 98.4% - AWS
10K:1.185 ms,hit@1 96.9%,hit@k 98.4%
So the current strongest portable GraphRAG result is no longer the SQL baseline or the old city-only helper. It is the fused path-aware helper.
Current real code-corpus GraphRAG reference benchmark (cogniformerus CrossFile)
Repo-owned harnesses:
python3 scripts/bench_graph_rag_code_corpus.py --runs 3 --backend-mode fresh --ann-k 16 --top-k 4python3 scripts/repeat_graph_rag_code_corpus_builds.py --repeats 3 --runs 3 --backend-mode freshREMOTE_PYTHON=/path/to/python REPEATS=3 RUNS=3 BACKEND_MODE=fresh bash scripts/repeat_graph_rag_code_corpus_builds_aws.sh <aws-host> /path/to/repo 65320
This benchmark uses the actual cogniformerus source tree (40 files, 840 rows after chunk + summary expansion) and the real CrossFile prompts from butler_code_test.cr. The current real-corpus conclusion is not a single universal winner. The frontier splits by embedding mode:
| Mode | Best case | Local repeated-build p50 | AWS repeated-build p50 | Keyword coverage | Full hits | Notes |
|---|---|---|---|---|---|---|
| generic | prompt_summary_snippet_py | 0.613 ms | 0.955 ms | 100.0% | 100.0% | symbol-aware variant is strictly slower with no quality gain |
| code-aware | prompt_symbol_summary_snippet_py | 0.963 ms | 1.541 ms | 100.0% | 100.0% | exact prompt-symbol rescue is required in summary seeding |
The most important diagnostic result was the old code-aware miss:
prompt_summary_snippet_pyonfacts_sh- local repeated-build:
97.6%keyword coverage,83.3%full hits - AWS repeated-build:
97.6%,83.3%
- local repeated-build:
prompt_symbol_summary_snippet_pyonfacts_sh- local repeated-build:
100.0%,100.0% - AWS repeated-build:
100.0%,100.0%
- local repeated-build:
So the code-aware quality win is now both repeated-build stable and cross-environment stable. The change in winner is not a local-only artifact.
Larger in-repo cogniformerus transfer gate
The smaller 40-file code-corpus slice above is a useful stable benchmark, but it is not the only in-repo transfer check anymore. The same repeated-build protocol was rerun on the full cogniformerus repository (183 Crystal files), still using the real CrossFile prompts from butler_code_test.cr.
Control point at the old tiny-budget contract (top_k=4, 1 fresh build):
| Mode | Best case | Local p50 | Keyword coverage | Full hits | Avg returned rows | Notes |
|---|---|---|---|---|---|---|
| generic | prompt_summary_snippet_py | 0.770 ms | 87.1% | 66.7% | 3.67 | larger corpus exposes a result-budget cliff |
| code-aware | prompt_symbol_summary_snippet_py | 1.824 ms | 87.6% | 66.7% | 4.00 | same cliff under code-aware embeddings |
Bounded recovery point (top_k=8, 3 fresh builds):
| Mode | Best case | Local repeated-build p50 | Keyword coverage | Full hits | Avg returned rows | Notes |
|---|---|---|---|---|---|---|
| generic | prompt_summary_snippet_py | 0.819 ms | 100.0% | 100.0% | 6.33 | larger in-repo Crystal transfer now verified |
| code-aware | prompt_symbol_summary_snippet_py | 1.814 ms | 100.0% | 100.0% | 7.50 | same winner, but needs the larger final budget |
Interpretation:
- the current real code-corpus contracts do transfer beyond the tiny
40-file slice - the dominant larger-corpus issue on the in-repo Crystal side is result budget, not a new retrieval failure
- this larger-corpus Crystal gate is now covered and serves as a transfer check for the benchmark-side code-retrieval logic, not as the stable release contract for GraphRAG
Mixed-language external code-corpus GraphRAG reference benchmark (pycdc)
The code-corpus harness now also supports:
- JSON question fixtures
- configurable source extensions
- quoted local
#include "..."dependency edges for C/C++ corpora
The first mixed-language adversary corpus was pycdc, using a repo-owned fixture in scripts/fixtures/graph_rag_pycdc_questions.json. This run used the real pycdc source tree (138 files, 1281 rows after chunk + summary expansion, 72 local dependency edges) and 3 fresh builds at top_k=8.
| Mode | Best case | Local repeated-build p50 | Keyword coverage | Full hits | Avg returned rows | Notes |
|---|---|---|---|---|---|---|
| generic | prompt_symbol_summary_snippet_py | 0.850 ms | 90.0% | 60.0% | 6.40 | fastest mixed-language point, but it does not close the corpus |
| code-aware | prompt_compactseed_require_summary_snippet_fn | 8.006 ms | 100.0% | 100.0% | 5.80 | helper-backed compact lexical seed + one-hop include rescue closes the corpus |
Interpretation:
- the mixed-language gate is now covered for a real
~/Projects/Ccorpus - the result split is sharper than on the Crystal corpora:
- the fast generic path plateaus below perfect quality
- the code-aware include-rescue path closes the corpus, but at a much higher latency
- so
~/Projects/Cis now covered as part of the broader code-corpus reference matrix, even though the fastest generic point remains partial
Archive-side code-corpus GraphRAG reference benchmark (ninja/src)
The same widened harness was then pointed at an archive-side corpus under ~/SrcArchives: apple/ninja/src. This run used a second repo-owned fixture in scripts/fixtures/graph_rag_ninja_questions.json and the local include graph inside ninja/src (103 files, 1757 rows after chunk + summary expansion, 282 dependency edges).
Initial smoke at top_k=8:
| Mode | Best fast case | Local p50 | Keyword coverage | Full hits | Notes |
|---|---|---|---|---|---|
| generic | prompt_summary_snippet_py | 0.898 ms | 95.0% | 80.0% | already close without rescue |
| code-aware | prompt_summary_snippet_py | 0.928 ms | 85.0% | 80.0% | code-aware mode is weaker on this corpus |
Bounded budget probe (top_k=12, 3 fresh builds):
| Mode | Best case | Local repeated-build p50 | Keyword coverage | Full hits | Avg returned rows | Notes |
|---|---|---|---|---|---|---|
| generic | prompt_summary_snippet_py | 0.914 ms | 100.0% | 100.0% | 7.80 | archive-side gate closes with a small result-budget bump |
| code-aware | prompt_summary_snippet_py | 0.871 ms | 85.0% | 80.0% | 7.60 | still not the winner on this corpus |
Interpretation:
- the
~/SrcArchivesside is now covered by a real repeated-build gate - unlike
pycdc, this archive corpus does not need a dependency-rescue contract to close - the winner is the simple generic summary-snippet path at a slightly larger final budget (
top_k=12) - the larger real-corpus verification matrix for the narrow
0.13fact-graph release now spans:~/Projects/Crystal~/Projects/C~/SrcArchives
External folding stress corpus for GraphRAG reference logic (folding/src)
The same harness was then pointed at a second real code corpus outside this repository: folding/src with prompts from butler_folding_test.cr. This is not the primary publishable frontier for the repository, but it is a strong adversary corpus because it falsifies overfit retrieval contracts quickly.
Current repeated-build result:
| Mode | Case | Local repeated-build p50 | AWS repeated-build p50 | Keyword coverage | Full hits | Notes |
|---|---|---|---|---|---|---|
| generic | prompt_summary_snippet_py | 1.048 ms | 1.540 ms | 90.5% | 83.3% | fast baseline drifts below perfect quality on this corpus |
| generic | prompt_compactseed_require_summary_snippet_fn | 5.940 ms | 8.839 ms | 100.0% | 100.0% | compact lexical seed table + helper-backed one-hop REQUIRES_FILE rescue |
| generic | prompt_lexseed_require_summary_snippet_fn | 28.266 ms | 41.960 ms | 100.0% | 100.0% | historical full-summary lexical rescue, now dominated |
| code-aware | prompt_summary_snippet_py | 1.080 ms | 1.775 ms | 79.8% | 66.7% | worse baseline than the primary cogniformerus corpus |
| code-aware | prompt_compactseed_require_summary_snippet_fn | 5.804 ms | 8.392 ms | 100.0% | 100.0% | compact lexical seed table + helper-backed one-hop REQUIRES_FILE rescue |
| code-aware | prompt_lexseed_require_summary_snippet_fn | 36.676 ms | 60.457 ms | 100.0% | 100.0% | historical full-summary lexical rescue, now dominated |
Interpretation:
- the external folding miss was a real seed-selection problem, not a snippet extraction bug
- the rescue is now verified on both local Apple Silicon and AWS ARM64
- the current documented external rescue is no longer the old full-summary lexical path; it is the compact lexical-seed table variant
- compact lexical seeding keeps
100.0% / 100.0%while cutting the old rescue by about4.8xlocally and4.7-7.2xon AWS, depending on mode - an isolated local timing split shows the helper-backed rescue is still dominated by lexical-seed +
REQUIRES_FILEfetch work (~10.7-11.0 ms/query) with snippet postprocess as a secondary cold-start cost (~7.7-8.0 ms/query) - the old full-summary lexical rescue remains useful as a diagnostic, but it is no longer the external default frontier
- even the compact rescue is still slower than the primary in-repo winners, so it does not replace them as the default GraphRAG contract
Legacy/manual IVF-PQ benchmark
The sections below are still useful for the explicit IVF-PQ API (svec_ann_scan), but they are no longer the default ANN baseline for the repository. Those measurements target the legacy/manual vector path, not the planner-integrated sorted_hnsw Index AM.
All IVF-PQ benchmarks below use svec_ann_scan (C-level) with residual PQ. 1 Gi k8s pod, PostgreSQL 18.
103K vectors, 2880-dim (Gutenberg corpus)
Residual PQ (M=720, dsub=4), 256 IVF partitions. 100 cross-queries (self-match excluded):
| Config | R@1 | Recall@10 | Avg latency |
|---|---|---|---|
| nprobe=1, PQ-only | 54% | 48% | 5.5 ms |
| nprobe=3, PQ-only | 79% | 71% | 8 ms |
| nprobe=3, rerank=96 | 82% | 74% | 10 ms |
| nprobe=5, rerank=96 | 89% | 86% | 12 ms |
| nprobe=10, rerank=200 | 97% | 94% | 22 ms |
Self-query (vector in dataset): R@1 = 100% at nprobe=3 / 8 ms.
10K vectors, 2880-dim (float32 precision test)
Same corpus, pure svec (float32), nlist=64, M=720 residual PQ. 100 cross-queries:
| Config | R@1 | Recall@10 |
|---|---|---|
| nprobe=1, PQ-only | 56% | 56% |
| nprobe=3, PQ-only | 72% | 82% |
| nprobe=5, rerank=96 | 93% | 93% |
| nprobe=10, rerank=200 | 99% | 99% |
float32 vs float16 precision impact
Tested the same 10K Gutenberg vectors in two configurations:
- float32 (svec): native 32-bit storage, independently trained codebooks
- float16-degraded (hsvec): svec → hsvec → svec roundtrip, independently trained
Result: no measurable recall difference. Float16 precision loss (~1e-7) is 1000× smaller than typical distance gaps between consecutive neighbors (~1e-4). The recall bottleneck is PQ quantization and IVF routing, not input precision. This confirms hsvec is a safe storage choice for ANN workloads.
CRUD performance (500K rows, svec(128), prepared mode)
| Operation | eager / heap | lazy / heap | Notes |
|---|---|---|---|
| SELECT PK | 85% | 85% | Index Scan via btree |
| SELECT range 1K | 97% | – | Custom Scan pruning (eager only) |
| Bulk INSERT | 100% | 100% | Always eager |
| DELETE + INSERT | 63% | 63% | INSERT always eager |
| UPDATE non-vec | 46% | 100% | Lazy skips zone map flush |
| UPDATE vec col | 102% | 100% | Parity both modes |
| Mixed OLTP | 83% | 97% | Near-parity with lazy |
Eager mode (default) maintains zone maps on every UPDATE for scan pruning. Lazy mode (sorted_heap.lazy_update = on) trades scan pruning for UPDATE parity with heap. Compact/merge restores pruning.
Self-query vs cross-query
Self-query: query vector is in the dataset (typical RAG case — you embedded documents, now you search them). The vector is always found as its own closest neighbor, so R@1 = 100%.
Cross-query: query vector is NOT in the dataset (e.g., user question embedded at search time). R@1 depends on nprobe and PQ fidelity.
When comparing benchmarks, verify whether self-match is included or excluded. The tables above use cross-query (self-match excluded) for honest comparison.
Methodology notes
- EXPLAIN ANALYZE: warm cache (pg_prewarm), average of 5 runs, actual execution time + buffer reads reported
- pgbench: 10 s runtime, 1 client, includes pgbench overhead (connection management, query dispatch); useful for relative throughput comparison
- INSERT: COPY path via
INSERT ... SELECT generate_series() - Compact time: wall-clock time for
sorted_heap_compact()on warm data - Vector search: 100 random queries from the dataset, self-match excluded by requesting
lim := 11and taking positions 2–11. Ground truth via exact brute-force cosine (<=>operator). Latency measured viaclock_timestamp()per-query in PL/pgSQL loop (20 queries, warm cache) - Current local
sorted_hnswcomparison: deterministic synthetic 10K x 384D corpus viascripts/bench_sorted_hnsw_vs_pgvector.sh, 3 fresh builds for PostgreSQL methods, median p50 / median recall reported; Qdrant viascripts/bench_qdrant_synthetic.py, 3 warm measurement passes on one local Docker collection; zvec viascripts/bench_zvec_synthetic.py, 3 warm measurement passes on one local in-process collection - Current local real-dataset sample:
scripts/bench_ann_real_dataset.pyon ANN-Benchmarksnytimes-256-angular, sampled to 10K base vectors and 20 queries. Ground truth is exact PostgreSQLsvecheap search on the sampled corpus. Numbers above are medians across 3 full harness runs.