Design: sorted_hnsw Shared Scan Cache
Status
Implemented in:
f33d620perf: share decoded sorted_hnsw scan cache across backends27e77a3perf: cost sorted_hnsw shared-cache startup separately
The original upper-level-only payload was a useful falsifier, but it only cut fresh 50k/384D scans by about 10% and missed the bar. The implemented version therefore promotes the shared slot to a full decoded immutable scan snapshot.
Problem
sorted_hnsw already keeps a decoded scan cache per backend in src/sorted_hnsw.c, keyed by {relid, relfilenode, cache_gen}.
That solves repeated scans inside one backend, but every new backend still pays to rebuild the same immutable scan-side state:
- SQ8 min/scale arrays
- upper-level adjacency
- upper-level
nid -> indexlookup tables - scan-cache control structs
The new fixed-graph harness can now measure the gap directly:
freshmode: one ordered scan per new backendreusemode: one warmup query, then repeated scans in a single backend
Representative pre-shared-cache 384D result:
fresh avg_ms = 0.5803reuse avg_ms = 0.3007
So roughly half of the observed fresh-backend latency is still startup work, not the search itself.
Goal
Reduce cross-backend cold-start cost for ordered sorted_hnsw scans without changing search semantics or weakening crash safety.
Non-goals
- Do not change on-disk page layout.
- Do not make writable-path behavior depend on shared-memory mutation.
- Do not share exact rerank heap/TOAST state.
- Do not start with a general-purpose multi-index cache manager if a smaller falsifier can prove or kill the idea first.
Current architecture
Today each backend does:
- Read metapage.
- Read SQ8 auxiliary pages.
- Decode upper levels into backend-local
palloc()allocations. - Build upper-level lookup tables.
- Keep L0 decoded lazily per backend.
That means fresh mode repeatedly duplicates the same immutable decode work.
Implemented Phase A: single-slot shared immutable cache
Start with one experimental shared cache slot, effective only when the extension is loaded via shared_preload_libraries.
Why a single slot first:
- it is enough to validate the measured bottleneck on the current benchmark
- it keeps invalidation and memory budgeting simple
- it gives a hard falsifier before building a more complex multi-slot cache
What to share
Share immutable decoded scan-side data:
- SQ8 mins
- SQ8 scales
- decoded L0 node headers
- decoded L0 neighbor slab
- decoded L0 SQ8 vectors
- upper-level adjacency
- upper-level
nid -> indexarrays - immutable header fields:
M,dim,n_nodes,entry_nid,max_level,cache_gen,relid,locator
Keep these backend-local:
- L0 decoded pages and neighbor arrays
- visited bitsets
- candidate/result scratch
- writable-path insert helpers
- rerank heap/slot state
Why this split
The upper-level-only split was safe but insufficient. The dominant remaining fresh-backend tax was the decoded L0 snapshot itself, so the implemented slot stores the full immutable decoded graph and relies on cache_gen invalidation to keep writable-path updates from reusing stale state.
Shared memory layout
Use the existing shared-memory hook pattern already present in src/sorted_heap_scan.c.
Add:
- one control struct in main shared memory
- one fixed-size payload arena reserved at startup
- one LWLock protecting publish/replace of the slot
Use offsets inside the payload arena rather than raw palloc()-style pointer ownership.
Control struct sketch
typedef struct ShnswSharedScanCacheCtl
{
slock_t mutex_or_lwlock;
bool valid;
Oid relid;
RelFileLocator locator;
uint64 cache_gen;
int32 dim;
int16 M;
int16 max_level;
int32 n_nodes;
int32 entry_nid;
Size payload_bytes;
Size sq8_mins_off;
Size sq8_scales_off;
Size upper_entries_off[SHNSW_MAX_LEVELS];
Size upper_nbr_idx_off[SHNSW_MAX_LEVELS];
int32 upper_count[SHNSW_MAX_LEVELS];
} ShnswSharedScanCacheCtl;
Initial payload budget
Start with a fixed budget just for the experimental slot.
Current payload budget: 64 MB.
If the decoded snapshot does not fit, publication bails out and the backend falls back to the normal private decode path.
Backend behavior
On scan-cache lookup:
- Read
cache_genfrom the metapage. - Check the shared slot under lock.
- If
{relid, locator, cache_gen}match and payload is present:- attach backend-local lightweight wrappers to shared immutable arrays
- skip SQ8/upper decode
- Otherwise:
- build the normal backend-local cache
- materialize any deferred L0 pages
- if the shared slot is empty or stale and the payload fits, publish the decoded immutable snapshot into shared memory
On insert, vacuum tombstone, rebuild, relfilenode swap, or reindex:
- metapage already bumps
cache_gen - next backend lookup sees mismatch and treats the shared slot as stale
- stale slots are replaced lazily on the next eligible scan
Correctness constraints
- shared cache must never survive a
cache_genmismatch - any
cache_genmismatch forces a miss - payload publication must be all-or-nothing
- fallback to backend-local decode must remain correct when shmem is disabled, full, or stale
Falsifier
The branch cleared the original gate:
fresh50k/384D fixed-graph average:shared_cache=off -> 1.9213 msshared_cache=on -> 0.6337 ms
- warm same-backend 50k/384D:
shared_cache=off -> 0.2180 msshared_cache=on -> 0.0917 ms
make installcheckgreenscripts/test_sorted_hnsw_crash_recovery.shgreen- fresh-backend insert invalidation check:
- nearest before insert
833:0.000258 - nearest after insert
2001:0.000000
- nearest before insert
That is enough to keep the branch and stop treating it as speculative.
Validation plan
Use the new benchmark split from scripts/bench_sorted_hnsw_fixed_graph.sh:
fresh: cross-backend cold-start costreuse: steady-state search cost
Required checks:
make installcheck REGRESS='pg_sorted_heap sorted_hnsw'./scripts/test_sorted_hnsw_crash_recovery.sh /tmp <port>./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 fresh on off./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 fresh on on./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 reuse on off./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 reuse on on
If Phase A works
Only then consider:
- more than one shared slot
- LRU/size-based replacement
- optional shared L0 decode for read-mostly indexes
- explicit observability counters for shared hits/misses/publishes
If Phase A fails
Stop local cache-manager work and accept that the next performance branch is either:
- a larger shared-cache redesign with DSA/DSM, or
- planner/policy work that avoids paying fresh-backend cost in short-lived sessions