Benchmarks

All benchmarks run on PostgreSQL 18, Apple M-series (12 CPU, 64 GB RAM) with shared_buffers=4GB, work_mem=256MB, maintenance_work_mem=2GB. Warm cache, average of 5 runs.


EXPLAIN ANALYZE – query latency and buffer reads

1M rows (71 MB)

Query sorted_heap heap + btree heap seqscan
Point (1 row) 0.035 ms / 1 buf 0.046 ms / 7 bufs 15.2 ms / 6,370 bufs
Narrow (100 rows) 0.043 ms / 2 bufs 0.067 ms / 8 bufs 16.2 ms / 6,370 bufs
Medium (5K rows) 0.434 ms / 33 bufs 0.492 ms / 52 bufs 16.1 ms / 6,370 bufs
Wide (100K rows) 7.5 ms / 638 bufs 8.9 ms / 917 bufs 17.4 ms / 6,370 bufs

10M rows (714 MB)

Query sorted_heap heap + btree heap seqscan
Point (1 row) 0.034 ms / 1 buf 0.047 ms / 7 bufs 117.9 ms / 63,695 bufs
Narrow (100 rows) 0.037 ms / 1 buf 0.062 ms / 7 bufs 130.9 ms / 63,695 bufs
Medium (5K rows) 0.435 ms / 32 bufs 0.549 ms / 51 bufs 131.0 ms / 63,695 bufs
Wide (100K rows) 7.6 ms / 638 bufs 8.8 ms / 917 bufs 131.4 ms / 63,695 bufs

100M rows (7.8 GB)

Query sorted_heap heap + btree heap seqscan
Point (1 row) 0.045 ms / 1 buf 0.506 ms / 8 bufs 1,190 ms / 519,906 bufs
Narrow (100 rows) 0.166 ms / 2 bufs 0.144 ms / 9 bufs 1,325 ms / 520,782 bufs
Medium (5K rows) 0.479 ms / 38 bufs 0.812 ms / 58 bufs 1,326 ms / 519,857 bufs
Wide (100K rows) 7.9 ms / 737 bufs 10.1 ms / 1,017 bufs 1,405 ms / 518,896 bufs

At 100M rows, a point query reads 1 buffer (vs 8 for btree, 519,906 for sequential scan).


pgbench throughput (TPS)

Prepared mode (-M prepared)

Query planned once, re-executed with parameters. 10 s, 1 client.

Query 1M (sh / btree) 10M (sh / btree) 100M (sh / btree)
Point 46.9K / 59.4K 46.5K / 58.0K 32.6K / 43.6K
Narrow 22.3K / 29.1K 22.5K / 28.8K 17.9K / 18.1K
Medium 3.4K / 5.1K 3.4K / 4.8K 2.4K / 2.4K
Wide 295 / 289 293 / 286 168 / 157

Simple mode (-M simple)

Each query parsed, planned, and executed separately. 10 s, 1 client.

Query 1M (sh / btree) 10M (sh / btree) 100M (sh / btree)
Point 28.4K / 38.0K 29.1K / 41.4K 18.7K / 4.6K
Narrow 19.6K / 24.4K 21.8K / 27.6K 7.1K / 5.5K
Medium 3.1K / 3.7K 3.4K / 4.8K 2.1K / 1.6K
Wide 198 / 290 200 / 286 163 / 144

At 100M rows in simple mode, sorted_heap wins all query types. Point queries reach 18.7K TPS vs 4.6K for btree (4x).


INSERT and compaction throughput

Scale sorted_heap heap + btree heap (no index) compact time
1M 923K rows/s 961K 1.91M 0.3 s
10M 908K rows/s 901K 1.65M 3.1 s
100M 840K rows/s 1.22M 2.22M 41.3 s

Storage

Scale sorted_heap heap + btree heap (no index)
1M 71 MB 71 MB (50 + 21) 50 MB
10M 714 MB 712 MB (498 + 214) 498 MB
100M 7.8 GB 7.8 GB (5.7 + 2.1) 5.7 GB

sorted_heap stores data + zone map in the same space as a heap table. The zone map replaces the btree index, so total storage is comparable to heap-without-index – roughly 30% less than heap + btree at scale.


Current local synthetic benchmark (sorted_hnsw Index AM)

Repo-owned harnesses:

  • python3 scripts/bench_gutenberg_local_dump.py --dump /tmp/cogniformerus_backup/cogniformerus_backup.dump --port 65473
  • REMOTE_PYTHON=/path/to/python SH_EF=32 EXTRA_ARGS='--sh-ef-construction 200' ./scripts/bench_gutenberg_aws.sh <aws-host> /path/to/repo /path/to/dump 65485
  • scripts/bench_sorted_hnsw_vs_pgvector.sh /tmp 65485 10000 20 384 10 vector 64 96
  • python3 scripts/bench_ann_real_dataset.py --dataset nytimes-256 --sample-size 10000 --queries 20 --k 10 --pgv-ef 64 --sh-ef 96 --zvec-ef 64 --qdrant-ef 64
  • python3 scripts/bench_qdrant_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64
  • python3 scripts/bench_zvec_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64

Current AWS restored-corpus benchmark (~104K x 2880D, Gutenberg dump)

AWS ARM64 host (4 vCPU, 8 GiB RAM), top-10, restored PostgreSQL custom dump. Ground truth is recomputed by exact heap search on the restored svec table. In the current rerun the stored bench_hnsw_gt table matched the exact heap GT on 100% of the 50 benchmark queries, so the fresh exact heap GT and the historical GT table agree. This rerun uses sorted_hnsw ef_construction=200 and ef_search=32, and the harness reconnects after build before timing ordered scans.

Method p50 latency Recall@10 Notes
Exact heap (svec) 458.762 ms 100.0% brute-force GT on restored corpus
sorted_hnsw (svec) 1.287 ms 100.0% ef_construction=200, ef_search=32, index 404 MB, total 1902 MB
sorted_hnsw (hsvec) 1.404 ms 100.0% ef_construction=200, ef_search=32, index 404 MB, total 1032 MB
pgvector HNSW (halfvec) 2.031 ms 99.8% ef_search=64, index 804 MB, total 1615 MB
zvec HNSW 50.499 ms 100.0% in-process collection, ef=64, ~1.12 GiB on disk
Qdrant HNSW 6.028 ms 99.2% local Docker on same AWS host, hnsw_ef=64, 103,260 points

The precision-matched PostgreSQL comparison on Gutenberg is now sorted_hnsw (hsvec) vs pgvector halfvec: 1.404 ms @ 100.0% versus 2.031 ms @ 99.8%, with total footprint 1032 MB versus 1615 MB. The raw fastest PostgreSQL row on this corpus is still sorted_hnsw (svec) at 1.287 ms, but that uses float32 source storage. sorted_hnsw keeps the same 404 MB index in both cases because the AM stores SQ8 graph state; the storage gain from hsvec appears in the base table and TOAST footprint instead.

Synthetic 10K x 384D cosine corpus, top-10, warm query loop. PostgreSQL methods were rerun across 3 fresh builds and the table below reports median p50 / median recall. Qdrant uses 3 warm measurement passes on one local Docker collection.

Method p50 latency Recall@10 Notes
Exact heap (svec) 2.03 ms 100% Brute-force ground truth
sorted_hnsw 0.158 ms 100% shared_cache=on, ef_search=96, index ~5.4 MB
pgvector HNSW (vector) 0.446 ms 90% median (90-95 range) ef_search=64, same M=16, ef_construction=64, index ~2.0 MB
zvec HNSW 0.611 ms 100% local in-process collection, ef=64
Qdrant HNSW 1.94 ms 100% local Docker, hnsw_ef=64

Current local real-dataset sample (nytimes-256-angular)

Repo-owned harness:

  • python3 scripts/bench_ann_real_dataset.py --dataset nytimes-256 --sample-size 10000 --queries 20 --k 10 --pgv-ef 64 --sh-ef 96 --zvec-ef 64 --qdrant-ef 64

ANN-Benchmarks nytimes-256-angular, sampled to 10K base vectors and 20 queries, top-10. The table below reports medians across 3 full harness runs. Ground truth comes from exact PostgreSQL heap search on the sampled svec corpus.

Method p50 latency Recall@10 Notes
Exact heap (svec) 1.557 ms 100% brute-force ground truth
sorted_hnsw 0.327 ms 85.0% median (83.5-85.5 range) shared_cache=on, ef_search=96, index ~4.1 MB
pgvector HNSW (vector) 0.751 ms 79.0% median (78.5-79.0 range) ef_search=64, same M=16, ef_construction=64, index ~13 MB
zvec HNSW 0.403 ms 99.5% local in-process collection, ef=64, ~14.1 MB on disk
Qdrant HNSW 1.704 ms 99.5% local Docker, hnsw_ef=64

This corpus is materially harder than the deterministic synthetic one. It is a better signal for default-parameter recall, while the synthetic table remains useful for controlled same-host engine comparisons and regression tracking.

Current local GraphRAG benchmark (person -> parent -> city, stable fact contract)

Repo-owned harness:

  • python3 scripts/bench_graph_rag_multihop.py --num-pairs 5000 --query-count 64 --runs 3 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 --pgv-ef-search 64 --zvec-ef 64 --qdrant-ef 64 --shared-buffers-mb 64 --backend-mode fresh

Deterministic fact graph, 5K chains / 10K rows total, 384D, top-10. This is the current balanced stable GraphRAG point for the narrow fact-shaped fact-retrieval workload. The stable-facing SQL entry point is sorted_heap_graph_rag(...); the table below also shows the underlying helper and wrapper paths because the harness measures the dispatched execution path directly.

Method p50 latency hit@1 hit@k Notes
Heap two-hop SQL 0.762 ms 75.0% 96.9% exact rerank over expanded heap set
sorted_heap_graph_rag_twohop_scan() 0.720 ms 75.0% 96.9% old city-only rerank contract
sorted_heap_expand_twohop_rerank() 0.712 ms 75.0% 96.9% same city-only seed point
sorted_heap SQL pathsum baseline 0.847 ms 98.4% 98.4% hop1_distance + hop2_distance in SQL
sorted_heap_graph_rag_twohop_path_scan() 0.739 ms 98.4% 98.4% fused path-aware wrapper
sorted_heap_expand_twohop_path_rerank() 0.726 ms 98.4% 98.4% same knobs, fused path-aware helper
pgvector HNSW + heap expansion 2.588 ms 90.6% 90.6% path-aware rerank, ef_search=64
zvec HNSW + heap expansion 2.507 ms 100.0% 100.0% path-aware rerank, ef=64
Qdrant HNSW + heap expansion 4.947 ms 100.0% 100.0% path-aware rerank, hnsw_ef=64

The path-aware helper changes the local conclusion materially: the dominant quality issue on this fact-shaped workload was the old hop-2-only rerank contract, not seed ANN quality. The fused path-aware helper now gives the best local latency/quality point at the same m=24, ef_construction=200, ann_k=64, ef_search=128 operating point.

Under the same path-aware scorer contract, the current local conclusion gets sharper: sorted_heap keeps the latency lead, while zvec and Qdrant reach the strongest observed answer quality on this deterministic fact graph.

A repeated-build local protocol then quantified how much of this is just single-run luck. Using three independent fresh builds of the same 5K/384D balanced point:

  • sorted_heap_expand_twohop_path_rerank() median 0.798 ms, range 0.771-0.819 ms, hit@1 = 98.4%, hit@k = 98.4% on every build
  • sorted_heap_graph_rag_twohop_path_scan() median 0.796 ms, range 0.778-0.804 ms, hit@1 = 98.4%, hit@k = 98.4% on every build
  • pgvector path-aware parity row median 1.405 ms, hit@1/hit@k 85.9-89.1%
  • zvec path-aware parity row median 1.076 ms, 100.0% / 100.0%
  • Qdrant path-aware parity row median 2.799 ms, 100.0% / 100.0%

So on the local balanced point, the current sorted_heap path-aware rows are not a fragile one-off. The latency band is tight, and the answer quality did not drift across the three rebuilds.

shared_cache=on vs off for multihop (post-fix, 0490cc4)

A multi-index shared-cache corruption bug was fixed in commit 0490cc4. Previously, when two or more sorted_hnsw indexes existed in the same database and shared_cache=on, a publish for the second index could overwrite shared memory that an attached cache for the first index was still reading through bare pointers. This caused silent data corruption and 0% retrieval quality on the affected index.

The fix deep-copies all data (L0 neighbors, SQ8 vectors, upper-level neighbor slabs) from shared memory into local palloc’d buffers on attach. The exact failure mode is regression-guarded by the multi-index overwrite phase (B5) in scripts/test_hnsw_chunked_cache.sh.

Post-fix verification on the 5K x 384D multihop benchmark (fresh backends, 3 runs, ann_k=64, ef_search=128, m=24):

python3 scripts/bench_graph_rag_multihop.py \
  --num-pairs 5000 --dim 384 --query-count 64 --runs 3 \
  --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 \
  --shared-cache {on,off} --backend-mode fresh \
  --skip-zvec --skip-qdrant --skip-pgvector
Method shared_cache p50 hit@1 hit@k
sorted_heap_expand_twohop_path_rerank() off 0.780 ms 98.4% 98.4%
sorted_heap_expand_twohop_path_rerank() on 0.766 ms 98.4% 98.4%
sorted_heap_graph_rag_twohop_path_scan() off 0.804 ms 98.4% 98.4%
sorted_heap_graph_rag_twohop_path_scan() on 0.786 ms 98.4% 98.4%

Also verified at 10K x 384D:

python3 scripts/bench_graph_rag_multihop.py \
  --num-pairs 10000 --dim 384 --query-count 64 --runs 3 \
  --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 \
  --shared-cache {on,off} --backend-mode fresh \
  --skip-zvec --skip-qdrant --skip-pgvector
Method shared_cache p50 hit@1 hit@k
sorted_heap_graph_rag_twohop_path_scan() off 0.904 ms 95.3% 96.9%
sorted_heap_graph_rag_twohop_path_scan() on 0.972 ms 95.3% 96.9%

Quality is identical between on and off at both measured scales. Latency at these measured scales is mixed rather than a clear universal win. shared_cache=off is no longer needed as a correctness workaround.

Larger-scale cold-start measurement (100K x 32D, 200K rows, 47 MB index):

The deep-copy fix means attach now copies all L0 neighbors (~38 MB), SQ8 data (~6 MB), and node metadata (~6 MB) from shared memory into local buffers on each fresh backend’s first query. At 200K nodes, this upfront cost exceeds the lazy page-decode path used by shared_cache=off:

# 100K pairs, 32D, m=24, ef_construction=200, ef_search=128, 20 fresh backends per mode
# PG buffer cache warm (shared_buffers=256MB):
shared_cache=off: cold KNN p50=6.31ms
shared_cache=on:  cold KNN p50=8.68ms

# PG buffer cache cold (PG restarted between modes, shared_buffers=64MB):
shared_cache=off: cold KNN first=5.3ms  p50=5.7ms
shared_cache=on:  cold KNN first=8.6ms  p50=9.5ms

The off path loads L0 pages lazily (only pages visited by beam search). The on path deep-copies all bulk data upfront under LWLock. At 200K nodes, the upfront copy dominates. Quality remains identical.

Current conclusion: shared_cache=on is a correctness-safe default but not a performance feature at the measured scales. The deep-copy overhead from the multi-index fix neutralized the original latency benefit.

Synthetic multi-hop depth scaling (relation_path depth 1..5)

Repo-owned harness:

  • python3 scripts/bench_graph_rag_multidepth.py --num-pairs 5000 --max-depth 5 --query-count 32 --runs 3 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 --shared-buffers-mb 64

Deterministic chain-shaped fact graph, 25K rows total, 384D, top-10, fresh backend, path-aware scorer. This is a narrow scaling check for the generic unified syntax:

  • relation_path := ARRAY[1]
  • relation_path := ARRAY[1,2]
  • relation_path := ARRAY[1,2,3]
  • relation_path := ARRAY[1,2,3,4]
  • relation_path := ARRAY[1,2,3,4,5]
Method depth 1 depth 2 depth 3 depth 4 depth 5 Quality
Heap SQL path baseline 0.573 ms 0.589 ms 0.613 ms 0.622 ms 0.622 ms 100.0% / 100.0%
sorted_heap SQL path baseline 0.774 ms 0.732 ms 0.776 ms 0.712 ms 0.772 ms 100.0% / 100.0%
sorted_heap_graph_rag(…, score_mode := ‘path’) 0.674 ms 0.651 ms 0.643 ms 0.633 ms 0.649 ms 100.0% / 100.0%

On this synthetic chain benchmark, the unified path-aware function does not show a latency cliff through depth 5; it stays in the same ~0.63-0.67 ms band while preserving 100.0% / 100.0% quality. This is a controlled scaling signal, not a claim about arbitrary deep graph workloads.

Larger synthetic GraphRAG scale envelope (1M and 10M)

Repo-owned harnesses:

  • local: python3 scripts/bench_graph_rag_multidepth.py --num-pairs 200000 --max-depth 5 --query-count 8 --runs 1 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 64 --m 24 --shared-buffers-mb 64 --max-wal-size-gb 32 --maintenance-work-mem-mb 4096 --table-scope sorted_heap_only
  • AWS: NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=1 RUNS=1 DIM=32 ANN_K=16 TOP_K=10 EF_SEARCH=32 EF_CONSTRUCTION=8 M=8 SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 MAINTENANCE_WORK_MEM_MB=2048 TABLE_SCOPE=sorted_heap_only ./scripts/bench_graph_rag_multidepth_aws.sh <aws-host> /path/to/repo 65494

Current verified scale picture:

  • local 1M rows (200K pairs x 5 hops, 384D, cheap-build point):
    • generate_csv: 162.667 s
    • load_data: 45.694 s
    • build_indexes: 377.968 s
    • unified path-aware sorted_heap_graph_rag(...) p50:
      • depth 1: 1.250 ms
      • depth 2: 1.429 ms
      • depth 3: 1.777 ms
      • depth 4: 2.181 ms
      • depth 5: 2.604 ms
    • quality stayed 100.0% / 100.0% at every depth
  • local 10M x 384D no longer dies in Python generation after the streaming CSV change; the process stayed near 31 MiB RSS on the early run where the old generator previously blew up.
  • local 10M x 64D and AWS 10M x 32D both survive generation and load, but the first practical frontier is still the single sorted_hnsw CREATE INDEX build, not query execution.

The strongest bounded AWS signal so far is:

  • 10M x 32D (2M pairs x 5 hops, ultralight build point)
    • generate_csv: 223.114 s
    • CSV size: 3.63 GiB
    • load_data: 485.177 s
    • on the old build path, it then entered a long CREATE INDEX that was still active after ~11-12 minutes in the index-build phase
    • temp-cluster footprint reached about 17-18 GiB with 11-13 GiB disk still free on the AWS host

After removing the per-search visited[] allocation/zeroing from the hot HNSW build loop, the same local diagnostic point that previously established the bottleneck changed materially:

  • local 500K x 32D (100K pairs x 5 hops, m=8, ef_construction=8)
    • old total build: 18.1-18.7 s
    • old graph-construction phase: about 18.27 s
    • new total build: 2.777-2.996 s
    • new graph-construction phase: about 2.42-2.59 s
    • page-writing tail stayed small (~0.20-0.24 s)

That moved the AWS 10M x 32D scale branch from “still stuck in CREATE INDEX” to the first real 10M query pass on the same cheap-build point:

  • generate_csv: 222.814 s
  • load_data: 468.923 s
  • build_indexes: 212.553 s
  • analyze: 14.562 s
  • unified path-aware sorted_heap_graph_rag(...) p50:
    • depth 1: 2.124 ms
    • depth 2: 3.433 ms
    • depth 3: 4.611 ms
    • depth 4: 5.846 ms
    • depth 5: 5.139 ms

This is not a release-quality GraphRAG point. The build/query knobs here were intentionally ultra-light (ann_k=16, ef_search=32, ef_construction=8, m=8) to get the first 10M latency envelope. At those settings, answer quality on the single-query AWS probe was poor (0.0% / 0.0%) and should be treated as a scale smoke, not a publishable quality frontier.

I then kept that AWS temp cluster alive and ran query-only sweeps on the same cheap-built 10M x 32D graph to test whether query-time tuning alone could recover depth-5 quality without paying for another CREATE INDEX. It did not:

  • ef_search=128, ann_k=64 -> depth-5 2.626 ms, 0.0% / 0.0%
  • ef_search=128, ann_k=128 -> depth-5 3.040 ms, 0.0% / 0.0%
  • ef_search=256, ann_k=128 -> depth-5 3.284 ms, 0.0% / 0.0%
  • ef_search=256, ann_k=256 -> depth-5 4.076 ms, returned 10 rows, but still 0.0% / 0.0%
  • one pathological point, ef_search=32, ann_k=64, spiked to 1833.809 ms and 259709.5 shared reads while still returning 0.0% / 0.0%

I then ran the same question on a smaller but shape-matched local 1M x 32D graph (200K pairs x 5 hops) to separate build quality from query contract:

  • cheap local build points (m=8, ef_construction=8/32, m=16, ef_construction=32/64) stayed weak at the old narrow query contract (ann_k=64, top_k=10), topping out around 25.0% hit@k
  • but with a wider query contract on the same 1M x 32D graph (ann_k=256, top_k=32, ef_search=128), both
    • the strong build (m=24, ef_construction=200), and
    • the much cheaper build (m=16, ef_construction=64) reached the same 96.9% hit@k over 32 queries
  • exact heap seeds on that 1M x 32D graph matched the ANN result exactly at the widened point, so once the query budget is large enough, heavier builds were no longer the decisive lever there

That suggested a second AWS 10M x 32D probe on the same cheap build but with the widened local winner (ann_k=256, top_k=32, ef_search=128). The result still failed:

  • ANN path-aware sorted_heap_graph_rag(...):
    • top_k=10 -> 1832.345 ms, 0.0% / 0.0%
    • top_k=32 -> 1832.755 ms, 0.0% / 0.0%
  • exact heap seeds + the same path-aware expansion/rerank contract:
    • ann_k=256, top_k=32 -> 0.0% / 0.0%

So the cheap-build 10M x 32D graph can produce latency numbers, but its quality is not recoverable by either:

  • larger query-time ANN budgets, or
  • exact heap seeds on the same low-dimensional corpus

The 10M x 32D frontier is therefore no longer “weak HNSW build quality.” It is the low-dimensional scale contract itself. The next meaningful scale branch is a higher-dimensional 10M point or a different retrieval contract, not another m/ef_construction tweak on the same 32D setup.

The first higher-dimensional calibration point was 64D, and it changed the local picture immediately. On a local 1M x 64D graph (200K pairs x 5 hops), a relatively cheap build (m=16, ef_construction=64) plus the wider query contract (ann_k=256, top_k=32, ef_search=128) gave:

  • ANN path-aware sorted_heap_graph_rag(...): 65.6% hit@1, 96.9% hit@k
  • exact heap seeds + the same path-aware expansion/rerank contract: 65.6% hit@1, 96.9% hit@k

At top_k=10, the same 1M x 64D point already held 96.9% hit@k, so unlike 32D, the result-budget cliff largely disappeared there.

The first file-backed 10M x 64D AWS attempt on the current ubuntu@dev.rigelstar.com host (4 vCPU, 8 GiB RAM) failed for operational reasons before query timing:

  • generate_csv: 414.397 s
  • CSV size: 6.67 GiB
  • temp dir grew to about 28 GiB
  • /tmp fell to 1.3 GiB free (99% used) while the run was still active

That led to a footprint reduction pass in the harness:

  • stream fact rows directly into COPY instead of materializing a giant CSV
  • drop facts_heap after loading facts_sh in sorted_heap_only mode
  • allow query-only reuse from a kept temp cluster
  • when facts_heap is not needed, copy rows straight into facts_sh before sorted_heap_compact(...) instead of staging through facts_heap and INSERT .. ORDER BY

That direct facts_sh load path held up on bounded local checks:

  • 200K rows (40K pairs x 5 hops, 64D, hop_weight=0.05)
    • old staged path: 6321.453 ms load, depth-5 unified GraphRAG 24.504 ms, 100.0% / 100.0%
    • direct COPY facts_sh: 5638.134 ms load, depth-5 unified GraphRAG 24.312 ms, 100.0% / 100.0%
  • 1M rows (200K pairs x 5 hops, 64D, load only)
    • old staged path: 31.392 s
    • direct COPY facts_sh: 28.231 s

So the current conclusion is narrow but useful: for synthetic sorted_heap_only multidepth runs, direct COPY facts_sh is a real ~10% ingest win without giving up the compacted query-time locality that the earlier “skip compaction” falsifier showed we still need.

I also bounded the obvious follow-up on the same direct ordered load: parameterizing the post-load maintenance step as none, merge, or compact.

  • 200K rows (40K pairs x 5 hops, 64D, hop_weight=0.05)
    • none: 5159.888 ms load, depth-5 unified GraphRAG 62.148 ms
    • merge: 5749.190 ms load, depth-5 unified GraphRAG 23.277 ms
    • compact: 5626.887 ms load, depth-5 unified GraphRAG 24.591 ms
  • 1M rows (200K pairs x 5 hops, 64D, load only)
    • none: 25.820 s
    • merge: 28.142 s
    • compact: 28.108 s

So merge is now exposed as an experiment knob in the multidepth harness, but it is not a proven new default. The current evidence says:

  • none is too expensive at query time
  • merge is viable on ordered synthetic loads
  • merge does not materially beat compact on the larger 1M load point

That leaves compact as the stable default for large-scale multidepth runs.

With that lower-footprint path, the same 10M x 64D AWS point advanced materially further on the same host:

  • streamed generate_csv: 0.000 s (csv_bytes=0, no materialized CSV)
  • load_data: 916.030 s
  • temp dir plateaued around 19 GiB
  • filesystem still had about 11 GiB free at the start of CREATE INDEX

The next frontier was no longer disk headroom. On the same 10M x 64D cheap-build point (m=16, ef_construction=64), the old local scan-cache seeding path then failed inside PostgreSQL:

  • CREATE INDEX facts_sh_ann_idx ON facts_sh USING sorted_hnsw (embedding) WITH (m = 16, ef_construction = 64)
  • ERROR: invalid memory alloc request size 1280000000

That request size matched the old contiguous local L0 neighbor cache layout (n_nodes * 2 * M * sizeof(int32) at 10M x M=16), so the next fix was to replace local l0_neighbors + sq8_data slabs with page-backed storage. Shared immutable scan caches remain contiguous; local build seeding and local shnsw_load_cache() no longer require a single giant L0 allocation.

With the chunked local-cache fix in place, the next retained branch was the constrained-memory rerun via:

  • sorted_hnsw.build_sq8 = on
  • lower-hop synthetic contract (hop_weight = 0.05)

On the same AWS ARM64 host (4 vCPU, 8 GiB RAM, 4 GiB swap), the monolithic 10M x 64D point then completed cleanly:

  • load_data: 787.809 s
  • build_indexes: 846.795 s
  • kept temp cluster: /home/ubuntu/graphrag_tmp/graph_rag_y8cuntf9

That is a real constrained-memory result: the same host that previously hit early large-scale frontiers now built the full monolithic sorted_hnsw index on 10,000,000 rows without being OOM-killed.

The first query-only reuse pass on that exact built graph (query_count=4, runs=1, ann_k=256, top_k=32, ef_search=128, sorted_heap_only) gave:

  • depth 1
    • SQL path: 1072.525 ms, 100.0% / 100.0%
    • unified GraphRAG path: 840.607 ms, 100.0% / 100.0%
  • depth 2
    • SQL path: 1070.437 ms, 75.0% / 100.0%
    • unified GraphRAG path: 2055.479 ms, 75.0% / 100.0%
  • depth 3
    • SQL path: 1050.793 ms, 50.0% / 100.0%
    • unified GraphRAG path: 2070.894 ms, 50.0% / 100.0%
  • depth 4
    • SQL path: 1080.123 ms, 50.0% / 100.0%
    • unified GraphRAG path: 2066.717 ms, 50.0% / 100.0%
  • depth 5
    • SQL path: 1064.405 ms, 75.0% / 100.0%
    • unified GraphRAG path: 2084.155 ms, 75.0% / 100.0%

So the 10M x 64D monolithic branch is now narrowed sharply:

  • constrained-memory monolithic build is viable
  • quality stayed aligned between the SQL baseline and the unified GraphRAG path
  • the remaining issue is not build survival and not quality drift
  • the remaining issue is latency: at depth 2+, the current generic unified GraphRAG path is still about 2x slower than the SQL path baseline on this host and graph

That is exactly why the next scale story should move toward segmentation + pruning rather than trying to turn one monolithic 10M graph into the final operating model for small hosts.

I then added an opt-in stage breakdown path to the multidepth harness via --report-stage-stats, using sorted_heap_graph_rag_stats() after each unified path-aware call. On the local 1M x 64D lower-hop point (hop_weight=0.05, ann_k=256, top_k=32, ef_search=128), that narrowed the runtime picture sharply:

  • depth 2:
    • end-to-end unified GraphRAG: 109.222 ms
    • internal stage stats:
      • ann_ms = 107.590
      • expand_ms = 0.369
      • rerank_ms = 0.004
  • depth 5:
    • end-to-end unified GraphRAG: 110.507 ms
    • internal stage stats:
      • ann_ms = 109.178
      • expand_ms = 0.691
      • rerank_ms = 0.011

So the current 1M x 64D depth cost is not expansion-bound. On the widened contract, almost all measured time is already in the ANN seed stage; the multi-hop expansion and rerank work remain sub-millisecond.

I then added an opt-in low-memory sorted_hnsw build path via SET sorted_hnsw.build_sq8 = on. This keeps the final on-disk/index-query contract the same, but the graph is built from SQ8-compressed build vectors instead of a full float32 build slab. The tradeoff is one extra heap scan during CREATE INDEX.

Bounded local A/B on the same 1M x 64D lower-hop point (m=16, ef_construction=64, ann_k=256, top_k=32, ef_search=128, sorted_heap_only, streamed load):

  • build_sq8=off
    • load_data: 28.044 s
    • build_indexes: 48.606 s
    • unified depth-5 GraphRAG: 111.578 ms, 87.5% / 100.0%
  • build_sq8=on
    • load_data: 27.935 s
    • build_indexes: 46.541 s
    • unified depth-5 GraphRAG: 110.541 ms, 87.5% / 100.0%

So the first retained result is narrow but real:

  • the low-memory build path did not regress the measured 1M x 64D GraphRAG quality point
  • it slightly improved build time on that point instead of paying a visible penalty for the extra heap scan
  • and by construction it cuts the build-vector slab from 4 * N * D bytes to 1 * N * D
    • 10M x 64D: about 2.56 GiB -> 0.64 GiB
    • 10M x 384D: about 15.36 GiB -> 3.84 GiB

So the narrow conclusion is:

  • the multi-hop GraphRAG path itself survives to at least 1M rows and gives measured query latencies there
  • the old 10M frontier was build-bound, not Python-memory-bound, not CSV-generation-bound, and not an early PostgreSQL failure
  • the current build optimization moved that frontier enough to obtain the first real 10M query numbers on a cheap-build point
  • the remaining 10M x 32D problem is no longer “can it finish?” and no longer just “is the HNSW build too weak?”
  • the stronger falsifier is that even exact seeds fail at that scale and dimensionality, so the next branch is a different scale contract, not a narrower HNSW tuning loop
  • 64D is the first scale contract that looks healthy locally at 1M
  • on the current AWS host, the 10M x 64D branch now clears ingest, build, and the first query pass; the remaining cheap-build frontier is deep-path quality, not allocator failure

I then added a segmented multidepth harness in scripts/bench_graph_rag_multidepth_segmented.py. This is the first partitioning benchmark, but it is intentionally harness-side, not an extension API: current GraphRAG helpers require a concrete sorted_heap table, so the benchmark fans out across multiple sorted_heap shards and merges the shard-local top-k rows in Python.

The local 1M x 64D lower-hop point (200K pairs, 5 hops, 32 queries, ann_k=256, top_k=32, ef_search=128, m=16, ef_construction=64, build_sq8=on) gives a clean first answer:

  • monolithic unified GraphRAG:
    • depth 1: 50.104 ms, 100.0% / 100.0%
    • depth 5: 121.524 ms, 81.2% / 100.0%
  • segmented into 8 shards, all-shard fanout:
    • build_indexes: 57.885 s versus monolithic 45.714 s
    • depth 1: 87.677 ms, 100.0% / 100.0%
    • depth 5: 142.472 ms, 81.2% / 100.0%
  • segmented into 8 shards, exact routing to the owning shard:
    • depth 1: 10.574 ms, 100.0% / 100.0%
    • depth 5: 16.822 ms, 100.0% / 100.0%

That makes the partitioning lesson explicit:

  • segmentation by itself is not a free latency win
  • all-shard fanout preserves quality on this benchmark, but pays a clear fanout tax
  • the real win appears only when routing/pruning can avoid most shards
  • so the next scale story should be “partitioning + pruning contract”, not just “more shards”

This is the correct shape for large knowledge bases:

  • on constrained hosts, sharding bounds per-index build memory
  • on real workloads, the query path must also prune by tenant / knowledge-base / relation family / time window (or a future segment router), otherwise the system only trades one monolith for a broad fanout query

That local result transferred cleanly to the constrained AWS host on the first full 10M x 64D streamed segmented run (2M pairs, 5 hops, 8 shards, hop_weight=0.05, ann_k=256, top_k=32, ef_search=128, m=16, ef_construction=64, build_sq8=on):

  • streamed segmented build-only:
    • generate_csv: 0.000 s
    • load_data: 500.474 s
    • build_indexes: 784.778 s
  • segmented, route=exact query-only reuse on the same built graph:
    • depth 1: 126.057 ms, 100.0% / 100.0%
    • depth 2: 261.986 ms, 75.0% / 100.0%
    • depth 3: 259.794 ms, 75.0% / 100.0%
    • depth 4: 258.879 ms, 75.0% / 100.0%
    • depth 5: 258.766 ms, 100.0% / 100.0%
  • segmented, route=all query-only reuse on the same built graph:
    • depth 1: 898.440 ms, 100.0% / 100.0%
    • depth 2: 2090.866 ms, 75.0% / 100.0%
    • depth 3: 2089.650 ms, 50.0% / 100.0%
    • depth 4: 2088.114 ms, 50.0% / 100.0%
    • depth 5: 2093.652 ms, 75.0% / 100.0%

That gives the first full large-scale constrained-memory comparison on the same AWS box:

  • monolithic 10M x 64D, low-memory build:
    • depth 1 unified GraphRAG: 840.607 ms, 100.0% / 100.0%
    • depth 5 unified GraphRAG: 2084.155 ms, 75.0% / 100.0%
  • segmented 10M x 64D, route=all:
    • effectively the same latency/quality envelope as the monolith
  • segmented 10M x 64D, route=exact:
    • about 6.7x faster than the monolith at depth 1
    • about 8.1x faster than the monolith at depth 5
    • and still stable at 100.0% / 100.0% for depth 5 on this benchmark

So the large-scale result now matches the local 1M lesson:

  • segmentation without pruning is not a performance story
  • streamed segmented ingest is operationally better than front-loaded shard CSV materialization
  • segmented routing is the first constrained-memory scale path that improves both build viability and query latency on the same host

The next local step was to move routing out of benchmark-only Python and into a real SQL reference path. On a kept local 5K-row segmented smoke cluster (4 shards, 64D, ann_k=32, top_k=8, ef_search=64), the new range-routed SQL path matched the segmented SQL merge baseline on quality and row counts:

  • segmented SQL merge, route=exact, depth 5:
    • 0.215 ms, 100.0% / 100.0%
  • metadata-routed SQL wrapper, same exact-route point:
    • 0.245 ms, 100.0% / 100.0%

So the current reference stack is now:

  • sorted_heap_graph_rag_segmented(...) for explicit candidate shard arrays
  • sorted_heap_graph_rag_routed(...) for simple metadata-driven range routing

The next local routing pass added an exact-key companion for tenant / KB style selection. On another kept local 5K-row segmented smoke cluster (4 shards, 64D, ann_k=32, top_k=8, ef_search=64), the exact-key routed path stayed aligned with the exact-route segmented SQL merge path:

  • exact-route segmented SQL merge, depth 5:
    • 0.183 ms, 100.0% / 100.0%
  • exact-key routed SQL wrapper, same point:
    • 0.202 ms, 100.0% / 100.0%

That is still not the final routing story for large knowledge bases, but it is the first usable SQL-level bridge from harness-side segmentation to productized segmented GraphRAG, with both:

  • range-based routing
  • exact-key routing

Monolithic vs segmented exact-routed: constrained-memory 10M x 64D comparison

This table consolidates the AWS 10M x 64D results already reported above (monolithic in “Larger synthetic GraphRAG scale envelope”, segmented in “Segmented GraphRAG at scale”) into a direct side-by-side.

Both runs used the same AWS ARM64 host (4 vCPU, 8 GiB RAM, 4 GiB swap), the same workload (2M pairs x 5 hops, hop_weight=0.05), and the same build/query knobs (m=16, ef_construction=64, build_sq8=on, ann_k=256, top_k=32, ef_search=128). Segmented mode used 8 shards.

  Monolithic Segmented exact Segmented all
load_data 787.8 s 500.5 s
build_indexes 846.8 s 784.8 s
depth 1 p50 840.6 ms 126.1 ms 898.4 ms
depth 5 p50 2084.2 ms 258.8 ms 2093.7 ms
depth 1 hit@1/hit@k 100%/100% 100%/100% 100%/100%
depth 5 hit@1/hit@k 75%/100% 100%/100% 75%/100%
speedup at depth 5 1x 8.1x ~1x

Exact commands (monolithic build + query-only reuse):

NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \
  ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \
  SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \
  BUILD_SQ8=on TABLE_SCOPE=sorted_heap_only \
  ./scripts/bench_graph_rag_multidepth_aws.sh <host> <dir> 65493

Exact commands (segmented build + query-only reuse):

NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \
  ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \
  SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \
  BUILD_SQ8=on SHARDS=8 ROUTE=both \
  ./scripts/bench_graph_rag_multidepth_segmented_aws.sh <host> <dir> 65494

Conclusion: on the measured 10M x 64D constrained-memory point, segmented exact routing is 8.1x faster at depth 5 with better quality (100%/100% vs 75%/100%), while build time is comparable. All-shard fanout offers no latency benefit — the win comes entirely from shard pruning.

This is not a universal claim. Exact routing requires knowing which shard owns the query entity, which maps to tenant-id / knowledge-base-id style workloads. Queries that cannot be pruned to a single shard will see all-shard-fanout latency (~1x monolithic).

Bounded fanout: routing robustness on 1M x 64D

This measures how the segmented win degrades when routing is imperfect. Instead of routing to exactly one shard (exact) or all shards (all), bounded fanout routes to K adjacent shards (always including the correct one). This simulates imperfect routing where the router knows roughly which shard, but hedges by including neighbors.

Local, 200K pairs x 5 hops, 64D, 8 shards, 32 queries, 3 runs, ann_k=256, top_k=32, ef_search=128, m=16, ef_construction=64, build_sq8=on, hop_weight=0.05:

python3 scripts/bench_graph_rag_multidepth_segmented.py \
  --num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \
  --dim 64 --ann-k 256 --top-k 32 --ef-search 128 \
  --ef-construction 64 --m 16 --build-sq8 on \
  --shards 8 --route {exact,bounded,all} --fanout {2,4} --hop-weight 0.05

Monolithic baseline (sorted_heap_graph_rag(...) on same workload):

python3 scripts/bench_graph_rag_multidepth.py \
  --num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \
  --dim 64 --ann-k 256 --top-k 32 --ef-search 128 \
  --ef-construction 64 --m 16 --build-sq8 on \
  --hop-weight 0.05 --table-scope sorted_heap_only

Depth-5 comparison (the hardest hop):

Route mode Shards hit p50 (d5) hit@1 (d5) hit@k (d5) Speedup vs monolith
monolithic 1 table 120.7 ms 81.2% 100.0% 1x
segmented all (8/8) 8 149.5 ms 81.2% 100.0% 0.8x
segmented bounded (4/8) 4 74.1 ms 93.8% 100.0% 1.6x
segmented bounded (2/8) 2 35.9 ms 96.9% 100.0% 3.4x
segmented exact (1/8) 1 17.2 ms 100.0% 100.0% 7.0x

Depth-1 comparison:

Route mode Shards hit p50 (d1) hit@1 (d1) hit@k (d1)
monolithic 1 table 49.4 ms 100.0% 100.0%
segmented all (8/8) 8 94.3 ms 100.0% 100.0%
segmented bounded (4/8) 4 46.3 ms 100.0% 100.0%
segmented bounded (2/8) 2 22.8 ms 100.0% 100.0%
segmented exact (1/8) 1 11.3 ms 100.0% 100.0%

Observations:

  • Latency scales roughly linearly with the number of shards hit.
  • Quality degrades gracefully: bounded(2/8) still reaches 96.9% hit@1 at depth 5 vs 100.0% for exact, compared to 81.2% for monolithic/all.
  • Even bounded(4/8) at half the shards is 1.6x faster than monolithic and has better quality (93.8% vs 81.2% hit@1).
  • All-shard fanout is worse than monolithic (fanout overhead dominates).
  • The win is not exact-or-nothing — bounded fanout preserves most of the benefit even with imperfect routing.

Bounded fanout at AWS 10M x 64D scale

Same AWS ARM64 host (4 vCPU, 8 GiB RAM, 4 GiB swap), same workload as the 10M x 64D comparison above (2M pairs x 5 hops, hop_weight=0.05, 8 shards, ann_k=256, top_k=32, ef_search=128, m=16, ef_construction=64, build_sq8=on). Build once, query-only reuse for each route mode. 4 queries, 1 run.

Build: load_data=500.7s, build_indexes=764.2s.

Query commands:

# Build once with --keep-temp --stop-after build_indexes
NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \
  ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \
  SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \
  BUILD_SQ8=on SHARDS=8 ROUTE=exact \
  EXTRA_ARGS="--stream-copy --keep-temp --stop-after build_indexes" \
  bash scripts/bench_graph_rag_multidepth_segmented_aws.sh \
  <host> <dir> 65494

# Query-only reuse for each route mode:
ssh <host> "cd <dir> && python3 scripts/bench_graph_rag_multidepth_segmented.py \
  --port 65494 --num-pairs 2000000 --max-depth 5 --query-count 4 --runs 1 \
  --dim 64 --hop-weight 0.05 --ann-k 256 --top-k 32 --ef-search 128 \
  --ef-construction 64 --m 16 --build-sq8 on --shared-buffers-mb 64 \
  --shards 8 --route {exact,bounded,all} [--fanout {2,4}] \
  --backend-mode fresh --reuse-temp <kept_temp>"

Depth-5 comparison (monolithic baseline from prior section):

Route mode Shards hit p50 (d5) hit@1 (d5) hit@k (d5) vs monolith
monolithic 1 table 2084 ms 75% 100% 1x
segmented all (8/8) 8 2107 ms 75% 100% ~1x
segmented bounded (4/8) 4 1042 ms 75% 100% 2.0x
segmented bounded (2/8) 2 520 ms 100% 100% 4.0x
segmented exact (1/8) 1 259 ms 100% 100% 8.0x

Depth-1 comparison:

Route mode Shards hit p50 (d1) hit@1 (d1) hit@k (d1) vs monolith
monolithic 1 table 841 ms 100% 100% 1x
segmented all (8/8) 8 920 ms 100% 100% ~1x
segmented bounded (4/8) 4 473 ms 100% 100% 1.8x
segmented bounded (2/8) 2 233 ms 100% 100% 3.6x
segmented exact (1/8) 1 137 ms 100% 100% 6.1x

The bounded-fanout gradient from the local 1M point transfers cleanly to 10M:

  • Latency scales linearly with the number of shards hit at both scales.
  • bounded(2/8) at 10M is 4.0x faster than monolithic at depth 5 — close to the local 3.4x ratio.
  • Even bounded(4/8) at half the shards gives a 2.0x win.
  • Quality: hit@k stays at 100% across all modes. hit@1 varies with fanout width but stays ≥75%.
  • All-shard fanout remains ~1x monolithic (fanout overhead cancels the per-shard size reduction).

Conclusion: bounded fanout remains materially useful on this measured 10M synthetic point. A router that can narrow to 2 out of 8 candidate shards captures ~50% of the exact-routing speedup. The performance gradient is smooth on this benchmark, not a cliff.

Routing-miss tolerance on 1M x 64D

This measures what happens when the router sometimes picks the wrong shards — the correct shard is not always included in the candidate set. Uses --route bounded_recall --fanout 2 --recall-pct N where N% of queries get the correct shard and the rest get 2 adjacent wrong shards. Miss schedule is deterministic per-query via SHA-256 hash (seed=42).

Local, 200K pairs x 5 hops, 64D, 8 shards, 32 queries, 3 runs, ann_k=256, top_k=32, ef_search=128, m=16, ef_construction=64, build_sq8=on, hop_weight=0.05:

python3 scripts/bench_graph_rag_multidepth_segmented.py \
  --num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \
  --dim 64 --ann-k 256 --top-k 32 --ef-search 128 \
  --ef-construction 64 --m 16 --build-sq8 on \
  --shards 8 --route bounded_recall --fanout 2 \
  --recall-pct {100,90,75,50,25,0} --hop-weight 0.05

Depth-5 routing-miss tolerance (fanout=2, 8 shards):

Router recall Queries hitting correct shard p50 (d5) hit@1 (d5) hit@k (d5)
100% 32/32 54.5 ms 96.9% 100.0%
90% 29/32 64.7 ms 87.5% 90.6%
75% 23/32 57.0 ms 68.8% 71.9%
50% 18/32 45.4 ms 53.1% 56.2%
25% 10/32 79.1 ms 31.2% 31.2%
0% 0/32 44.8 ms 0.0% 0.0%
(ref: monolithic) 120.7 ms 81.2% 100.0%

Observations:

  • Quality tracks router recall nearly linearly. There is no sharp cliff.
  • A router with 90% recall keeps 87.5% hit@1 — close to the monolithic baseline’s 81.2%, while remaining 2-3x faster.
  • A router with 75% recall gives 68.8% hit@1 — below monolithic quality, so the latency win comes at a quality cost.
  • Latency stays roughly constant across recall levels because the fanout (2 shards) is fixed. The router-miss doesn’t cost extra I/O — it just returns wrong answers.
  • This means: routing quality determines answer quality, not latency. The segmented latency win is structural (fewer rows per shard), but the quality win depends entirely on the router including the correct shard.

Routing-miss tolerance at AWS 10M x 64D

Same AWS ARM64 host and workload as the bounded-fanout section above. Build once (load=500.5s, build=803.5s), query-only reuse. 4 queries, 1 run, fanout=2, 8 shards.

# Build: same as bounded-fanout section
# Query-only reuse with bounded_recall:
ssh <host> "cd <dir> && python3 scripts/bench_graph_rag_multidepth_segmented.py \
  --port 65494 --num-pairs 2000000 --max-depth 5 --query-count 4 --runs 1 \
  --dim 64 --hop-weight 0.05 --ann-k 256 --top-k 32 --ef-search 128 \
  --ef-construction 64 --m 16 --build-sq8 on --shared-buffers-mb 64 \
  --shards 8 --route bounded_recall --fanout 2 --recall-pct {90,75} \
  --backend-mode fresh --reuse-temp <kept_temp>"

Depth-5 routing-miss tolerance at 10M:

Route mode Router recall p50 (d5) hit@1 (d5) hit@k (d5)
bounded optimistic 100% (4/4 hit) 519 ms 100% 100%
bounded_recall 90% (3/4 hit) 521 ms 75% 75%
bounded_recall 75% (3/4 hit) 522 ms 75% 75%
(ref: monolithic) 2084 ms 75% 100%
(ref: exact) 100% 259 ms 100% 100%

The 90% and 75% recall points produce the same result because with only 4 queries, SHA-256 hash gives 3/4 correct-shard inclusion at both settings. The 1 missed query returns wrong results, giving 75% hit@1.

Despite limited query-count resolution, the directional signal is clear:

  • Bounded(2/8) latency stays at ~520 ms regardless of recall (~4x faster than monolithic)
  • One missed query out of four drops hit@1 from 100% to 75% — matching the monolithic baseline’s own 75% hit@1
  • hit@k drops from 100% to 75% (monolithic gets 100% hit@k because its broader candidate set is more likely to include the target)

This confirms the local finding: routing quality determines answer quality, not latency. The latency win is structural and survives routing errors. At 90% router recall on this 10M point, bounded routing matches the monolithic hit@1 (75%) while staying 4x faster.

A higher query_count run would give finer crossover resolution. On the measured 4-query point, the evidence is narrower: at 90% recall, bounded routing matches the monolithic hit@1 while staying 4x faster, and the true crossover is not shown to be above that point.

Current AWS GraphRAG benchmark (person -> parent -> city, stable fact contract)

Repo-owned harness:

  • REMOTE_PYTHON=/path/to/python QUERY_COUNT=64 RUNS=3 NUM_PAIRS=5000 DIM=384 ANN_K=64 TOP_K=10 EF_SEARCH=128 EF_CONSTRUCTION=200 M=24 PGV_EF_SEARCH=64 ZVEC_EF=64 QDRANT_EF=64 SHARED_BUFFERS_MB=64 BACKEND_MODE=fresh ./scripts/bench_graph_rag_multihop_aws.sh <aws-host> /path/to/repo 65492

AWS ARM64 host (4 vCPU, 8 GiB RAM), deterministic fact graph, 5K chains / 10K rows total, 384D, top-10, 64 queries, 3 runs. This is the current portable stable multihop GraphRAG point for the narrow fact-shaped contract.

Method p50 latency hit@1 hit@k Notes
Heap two-hop SQL 1.088 ms 75.0% 96.9% exact rerank over expanded heap set
sorted_heap_expand_twohop_rerank() 0.952 ms 75.0% 96.9% older city-only helper
sorted_heap_graph_rag_twohop_scan() 1.012 ms 75.0% 96.9% older city-only wrapper
sorted_heap SQL pathsum baseline 1.204 ms 98.4% 98.4% same ANN seeds, hop1_distance + hop2_distance
sorted_heap_expand_twohop_path_rerank() 0.955 ms 98.4% 98.4% fused path-aware helper
sorted_heap_graph_rag_twohop_path_scan() 1.018 ms 98.4% 98.4% fused path-aware wrapper
pgvector HNSW + heap expansion 1.422 ms 85.9% 85.9% path-aware rerank, ef_search=64
zvec HNSW + heap expansion 1.720 ms 100.0% 100.0% path-aware rerank, ef=64
Qdrant HNSW + heap expansion 3.435 ms 100.0% 100.0% path-aware rerank, hnsw_ef=64

The new AWS result matches the local diagnostic cleanly: the dominant quality loss on this workload was the old hop-2-only rerank contract, not the seed ANN frontier. The path-aware helper preserves the quality gain on ARM64 with only trivial latency cost versus the older helper.

On the apples-to-apples path-aware contract, the portable frontier is now:

  • sorted_heap fastest
  • zvec and Qdrant strongest on answer quality
  • pgvector still behind on both latency and quality at this operating point

One intermediate AWS all-engines rerun temporarily dropped the sorted_heap path-aware rows to 96.9% / 96.9%. An immediate sorted_heap-only control and a second full all-engines rerun both returned the stable 98.4% / 98.4% point above, so the published table uses the confirmed rerun rather than the single outlier.

An AWS repeated-build protocol then tightened that confidence band on the same balanced 5K point. Using three independent fresh builds:

  • sorted_heap_expand_twohop_path_rerank() median 0.962 ms, range 0.956-0.965 ms, hit@1/hit@k = 98.4/98.4 on all three builds
  • sorted_heap_graph_rag_twohop_path_scan() median 1.025 ms, range 1.018-1.043 ms, hit@1/hit@k = 98.4/98.4 on all three builds
  • pgvector path-aware parity row median 1.434 ms, hit@1/hit@k 84.4-89.1
  • zvec path-aware parity row median 1.711 ms, 100.0/100.0
  • Qdrant path-aware parity row median 3.355 ms, 100.0/100.0

So on the current portable 5K point, the earlier AWS outlier now looks like an anomaly rather than a broad instability. The balanced sorted_heap path-aware rows stayed fixed across all three rebuilds.

The larger 10K-chain AWS rerun now tells a different story than the older city-only benchmark. At the same portable point:

  • heap two-hop SQL: 1.319 ms, hit@1 71.9%, hit@k 92.2%
  • city-only sorted_heap_graph_rag_twohop_scan(): 1.197 ms, hit@1 73.4%, hit@k 93.8%
  • SQL pathsum baseline: 1.436 ms, hit@1 96.9%, hit@k 98.4%
  • sorted_heap_expand_twohop_path_rerank(): 1.185 ms, hit@1 96.9%, hit@k 98.4%
  • sorted_heap_graph_rag_twohop_path_scan(): 1.212 ms, hit@1 96.9%, hit@k 98.4%

So the old larger-scale caveat now narrows materially: the main 10K loss was also the city-only rerank contract, not a fundamental collapse of the seed frontier at that scale.

An AWS repeated-build protocol then checked whether the remaining 10K difference was really a build-variance problem. Using three independent fresh builds on the same 10K path-aware point:

  • sorted_heap_expand_twohop_path_rerank() median 1.177 ms, range 1.148-1.191 ms, hit@1/hit@k = 95.3/96.9 on all three builds
  • sorted_heap_graph_rag_twohop_path_scan() median 1.236 ms, range 1.211-1.240 ms, hit@1/hit@k = 95.3/96.9 on all three builds
  • pgvector path-aware parity row median 1.667 ms, hit@1/hit@k 76.6-82.8
  • zvec path-aware parity row median 2.788 ms, 98.4/100.0
  • Qdrant path-aware parity row median 3.818 ms, 98.4/100.0

So the larger 10K AWS point is now repeated-build stable too. The remaining issue is scale frontier, not build instability: the 10K quality band is lower than 5K, but it stayed fixed across fresh builds.

An exact-seed diagnostic on the local 5K and 10K points did not improve hit@1 or hit@k versus the ANN-seeded sorted_heap helper. So on this benchmark shape, the remaining gap is not explained by ANN approximation alone. The stronger result is that seed coverage itself was identical for ANN and exact seeds: 98.4% at 5K and 96.9% at 10K. So the remaining loss is downstream of seeding. The new rerank-rank diagnostic narrows that further: at 5K, the correct city is still within the top 6 for 95% of reachable queries, and at 10K it is still within the top 3 for 95% of reachable queries. The quality drop is therefore driven by a few severe outliers (max rank 17 at 5K, 20 at 10K), not by a broad collapse.

A path-aware SQL rerank baseline then changed the picture materially. Keeping the same ANN seeds and the same two-hop expansion, but scoring candidates as hop1_distance + hop2_distance, moved the local balanced points to:

  • 5K: 0.957 ms, hit@1 98.4%, hit@k 98.4%
  • 10K: 1.179 ms, hit@1 95.3%, hit@k 96.9%

That branch is now implemented in the extension and verified on both local and AWS ARM64 runs. The fused path-aware helper measured:

  • local 5K: 0.726 ms, hit@1 98.4%, hit@k 98.4%
  • local 10K: 0.823 ms, hit@1 95.3%, hit@k 96.9%
  • AWS 5K: 0.955 ms, hit@1 98.4%, hit@k 98.4%
  • AWS 10K: 1.185 ms, hit@1 96.9%, hit@k 98.4%

So the current strongest portable GraphRAG result is no longer the SQL baseline or the old city-only helper. It is the fused path-aware helper.

Current real code-corpus GraphRAG reference benchmark (cogniformerus CrossFile)

Repo-owned harnesses:

  • python3 scripts/bench_graph_rag_code_corpus.py --runs 3 --backend-mode fresh --ann-k 16 --top-k 4
  • python3 scripts/repeat_graph_rag_code_corpus_builds.py --repeats 3 --runs 3 --backend-mode fresh
  • REMOTE_PYTHON=/path/to/python REPEATS=3 RUNS=3 BACKEND_MODE=fresh bash scripts/repeat_graph_rag_code_corpus_builds_aws.sh <aws-host> /path/to/repo 65320

This benchmark uses the actual cogniformerus source tree (40 files, 840 rows after chunk + summary expansion) and the real CrossFile prompts from butler_code_test.cr. The current real-corpus conclusion is not a single universal winner. The frontier splits by embedding mode:

Mode Best case Local repeated-build p50 AWS repeated-build p50 Keyword coverage Full hits Notes
generic prompt_summary_snippet_py 0.613 ms 0.955 ms 100.0% 100.0% symbol-aware variant is strictly slower with no quality gain
code-aware prompt_symbol_summary_snippet_py 0.963 ms 1.541 ms 100.0% 100.0% exact prompt-symbol rescue is required in summary seeding

The most important diagnostic result was the old code-aware miss:

  • prompt_summary_snippet_py on facts_sh
    • local repeated-build: 97.6% keyword coverage, 83.3% full hits
    • AWS repeated-build: 97.6%, 83.3%
  • prompt_symbol_summary_snippet_py on facts_sh
    • local repeated-build: 100.0%, 100.0%
    • AWS repeated-build: 100.0%, 100.0%

So the code-aware quality win is now both repeated-build stable and cross-environment stable. The change in winner is not a local-only artifact.

Larger in-repo cogniformerus transfer gate

The smaller 40-file code-corpus slice above is a useful stable benchmark, but it is not the only in-repo transfer check anymore. The same repeated-build protocol was rerun on the full cogniformerus repository (183 Crystal files), still using the real CrossFile prompts from butler_code_test.cr.

Control point at the old tiny-budget contract (top_k=4, 1 fresh build):

Mode Best case Local p50 Keyword coverage Full hits Avg returned rows Notes
generic prompt_summary_snippet_py 0.770 ms 87.1% 66.7% 3.67 larger corpus exposes a result-budget cliff
code-aware prompt_symbol_summary_snippet_py 1.824 ms 87.6% 66.7% 4.00 same cliff under code-aware embeddings

Bounded recovery point (top_k=8, 3 fresh builds):

Mode Best case Local repeated-build p50 Keyword coverage Full hits Avg returned rows Notes
generic prompt_summary_snippet_py 0.819 ms 100.0% 100.0% 6.33 larger in-repo Crystal transfer now verified
code-aware prompt_symbol_summary_snippet_py 1.814 ms 100.0% 100.0% 7.50 same winner, but needs the larger final budget

Interpretation:

  • the current real code-corpus contracts do transfer beyond the tiny 40-file slice
  • the dominant larger-corpus issue on the in-repo Crystal side is result budget, not a new retrieval failure
  • this larger-corpus Crystal gate is now covered and serves as a transfer check for the benchmark-side code-retrieval logic, not as the stable release contract for GraphRAG

Mixed-language external code-corpus GraphRAG reference benchmark (pycdc)

The code-corpus harness now also supports:

  • JSON question fixtures
  • configurable source extensions
  • quoted local #include "..." dependency edges for C/C++ corpora

The first mixed-language adversary corpus was pycdc, using a repo-owned fixture in scripts/fixtures/graph_rag_pycdc_questions.json. This run used the real pycdc source tree (138 files, 1281 rows after chunk + summary expansion, 72 local dependency edges) and 3 fresh builds at top_k=8.

Mode Best case Local repeated-build p50 Keyword coverage Full hits Avg returned rows Notes
generic prompt_symbol_summary_snippet_py 0.850 ms 90.0% 60.0% 6.40 fastest mixed-language point, but it does not close the corpus
code-aware prompt_compactseed_require_summary_snippet_fn 8.006 ms 100.0% 100.0% 5.80 helper-backed compact lexical seed + one-hop include rescue closes the corpus

Interpretation:

  • the mixed-language gate is now covered for a real ~/Projects/C corpus
  • the result split is sharper than on the Crystal corpora:
    • the fast generic path plateaus below perfect quality
    • the code-aware include-rescue path closes the corpus, but at a much higher latency
  • so ~/Projects/C is now covered as part of the broader code-corpus reference matrix, even though the fastest generic point remains partial

Archive-side code-corpus GraphRAG reference benchmark (ninja/src)

The same widened harness was then pointed at an archive-side corpus under ~/SrcArchives: apple/ninja/src. This run used a second repo-owned fixture in scripts/fixtures/graph_rag_ninja_questions.json and the local include graph inside ninja/src (103 files, 1757 rows after chunk + summary expansion, 282 dependency edges).

Initial smoke at top_k=8:

Mode Best fast case Local p50 Keyword coverage Full hits Notes
generic prompt_summary_snippet_py 0.898 ms 95.0% 80.0% already close without rescue
code-aware prompt_summary_snippet_py 0.928 ms 85.0% 80.0% code-aware mode is weaker on this corpus

Bounded budget probe (top_k=12, 3 fresh builds):

Mode Best case Local repeated-build p50 Keyword coverage Full hits Avg returned rows Notes
generic prompt_summary_snippet_py 0.914 ms 100.0% 100.0% 7.80 archive-side gate closes with a small result-budget bump
code-aware prompt_summary_snippet_py 0.871 ms 85.0% 80.0% 7.60 still not the winner on this corpus

Interpretation:

  • the ~/SrcArchives side is now covered by a real repeated-build gate
  • unlike pycdc, this archive corpus does not need a dependency-rescue contract to close
  • the winner is the simple generic summary-snippet path at a slightly larger final budget (top_k=12)
  • the larger real-corpus verification matrix for the narrow 0.13 fact-graph release now spans:
    • ~/Projects/Crystal
    • ~/Projects/C
    • ~/SrcArchives

External folding stress corpus for GraphRAG reference logic (folding/src)

The same harness was then pointed at a second real code corpus outside this repository: folding/src with prompts from butler_folding_test.cr. This is not the primary publishable frontier for the repository, but it is a strong adversary corpus because it falsifies overfit retrieval contracts quickly.

Current repeated-build result:

Mode Case Local repeated-build p50 AWS repeated-build p50 Keyword coverage Full hits Notes
generic prompt_summary_snippet_py 1.048 ms 1.540 ms 90.5% 83.3% fast baseline drifts below perfect quality on this corpus
generic prompt_compactseed_require_summary_snippet_fn 5.940 ms 8.839 ms 100.0% 100.0% compact lexical seed table + helper-backed one-hop REQUIRES_FILE rescue
generic prompt_lexseed_require_summary_snippet_fn 28.266 ms 41.960 ms 100.0% 100.0% historical full-summary lexical rescue, now dominated
code-aware prompt_summary_snippet_py 1.080 ms 1.775 ms 79.8% 66.7% worse baseline than the primary cogniformerus corpus
code-aware prompt_compactseed_require_summary_snippet_fn 5.804 ms 8.392 ms 100.0% 100.0% compact lexical seed table + helper-backed one-hop REQUIRES_FILE rescue
code-aware prompt_lexseed_require_summary_snippet_fn 36.676 ms 60.457 ms 100.0% 100.0% historical full-summary lexical rescue, now dominated

Interpretation:

  • the external folding miss was a real seed-selection problem, not a snippet extraction bug
  • the rescue is now verified on both local Apple Silicon and AWS ARM64
  • the current documented external rescue is no longer the old full-summary lexical path; it is the compact lexical-seed table variant
  • compact lexical seeding keeps 100.0% / 100.0% while cutting the old rescue by about 4.8x locally and 4.7-7.2x on AWS, depending on mode
  • an isolated local timing split shows the helper-backed rescue is still dominated by lexical-seed + REQUIRES_FILE fetch work (~10.7-11.0 ms/query) with snippet postprocess as a secondary cold-start cost (~7.7-8.0 ms/query)
  • the old full-summary lexical rescue remains useful as a diagnostic, but it is no longer the external default frontier
  • even the compact rescue is still slower than the primary in-repo winners, so it does not replace them as the default GraphRAG contract

Legacy/manual IVF-PQ benchmark

The sections below are still useful for the explicit IVF-PQ API (svec_ann_scan), but they are no longer the default ANN baseline for the repository. Those measurements target the legacy/manual vector path, not the planner-integrated sorted_hnsw Index AM.

All IVF-PQ benchmarks below use svec_ann_scan (C-level) with residual PQ. 1 Gi k8s pod, PostgreSQL 18.

103K vectors, 2880-dim (Gutenberg corpus)

Residual PQ (M=720, dsub=4), 256 IVF partitions. 100 cross-queries (self-match excluded):

Config R@1 Recall@10 Avg latency
nprobe=1, PQ-only 54% 48% 5.5 ms
nprobe=3, PQ-only 79% 71% 8 ms
nprobe=3, rerank=96 82% 74% 10 ms
nprobe=5, rerank=96 89% 86% 12 ms
nprobe=10, rerank=200 97% 94% 22 ms

Self-query (vector in dataset): R@1 = 100% at nprobe=3 / 8 ms.

10K vectors, 2880-dim (float32 precision test)

Same corpus, pure svec (float32), nlist=64, M=720 residual PQ. 100 cross-queries:

Config R@1 Recall@10
nprobe=1, PQ-only 56% 56%
nprobe=3, PQ-only 72% 82%
nprobe=5, rerank=96 93% 93%
nprobe=10, rerank=200 99% 99%

float32 vs float16 precision impact

Tested the same 10K Gutenberg vectors in two configurations:

  • float32 (svec): native 32-bit storage, independently trained codebooks
  • float16-degraded (hsvec): svec → hsvec → svec roundtrip, independently trained

Result: no measurable recall difference. Float16 precision loss (~1e-7) is 1000× smaller than typical distance gaps between consecutive neighbors (~1e-4). The recall bottleneck is PQ quantization and IVF routing, not input precision. This confirms hsvec is a safe storage choice for ANN workloads.

CRUD performance (500K rows, svec(128), prepared mode)

Operation eager / heap lazy / heap Notes
SELECT PK 85% 85% Index Scan via btree
SELECT range 1K 97% Custom Scan pruning (eager only)
Bulk INSERT 100% 100% Always eager
DELETE + INSERT 63% 63% INSERT always eager
UPDATE non-vec 46% 100% Lazy skips zone map flush
UPDATE vec col 102% 100% Parity both modes
Mixed OLTP 83% 97% Near-parity with lazy

Eager mode (default) maintains zone maps on every UPDATE for scan pruning. Lazy mode (sorted_heap.lazy_update = on) trades scan pruning for UPDATE parity with heap. Compact/merge restores pruning.

Self-query vs cross-query

Self-query: query vector is in the dataset (typical RAG case — you embedded documents, now you search them). The vector is always found as its own closest neighbor, so R@1 = 100%.

Cross-query: query vector is NOT in the dataset (e.g., user question embedded at search time). R@1 depends on nprobe and PQ fidelity.

When comparing benchmarks, verify whether self-match is included or excluded. The tables above use cross-query (self-match excluded) for honest comparison.


Methodology notes

  • EXPLAIN ANALYZE: warm cache (pg_prewarm), average of 5 runs, actual execution time + buffer reads reported
  • pgbench: 10 s runtime, 1 client, includes pgbench overhead (connection management, query dispatch); useful for relative throughput comparison
  • INSERT: COPY path via INSERT ... SELECT generate_series()
  • Compact time: wall-clock time for sorted_heap_compact() on warm data
  • Vector search: 100 random queries from the dataset, self-match excluded by requesting lim := 11 and taking positions 2–11. Ground truth via exact brute-force cosine (<=> operator). Latency measured via clock_timestamp() per-query in PL/pgSQL loop (20 queries, warm cache)
  • Current local sorted_hnsw comparison: deterministic synthetic 10K x 384D corpus via scripts/bench_sorted_hnsw_vs_pgvector.sh, 3 fresh builds for PostgreSQL methods, median p50 / median recall reported; Qdrant via scripts/bench_qdrant_synthetic.py, 3 warm measurement passes on one local Docker collection; zvec via scripts/bench_zvec_synthetic.py, 3 warm measurement passes on one local in-process collection
  • Current local real-dataset sample: scripts/bench_ann_real_dataset.py on ANN-Benchmarks nytimes-256-angular, sampled to 10K base vectors and 20 queries. Ground truth is exact PostgreSQL svec heap search on the sampled corpus. Numbers above are medians across 3 full harness runs.