Design: `sorted_hnsw` Shared Scan Cache

Status

Implemented in:

f33d620 perf: share decoded sorted_hnsw scan cache across backends
27e77a3 perf: cost sorted_hnsw shared-cache startup separately

The original upper-level-only payload was a useful falsifier, but it only cut fresh 50k/384D scans by about 10% and missed the bar. The implemented version therefore promotes the shared slot to a full decoded immutable scan snapshot.

Problem

sorted_hnsw already keeps a decoded scan cache per backend in src/sorted_hnsw.c, keyed by {relid, relfilenode, cache_gen}.

That solves repeated scans inside one backend, but every new backend still pays to rebuild the same immutable scan-side state:

SQ8 min/scale arrays
upper-level adjacency
upper-level nid -> index lookup tables
scan-cache control structs

The new fixed-graph harness can now measure the gap directly:

fresh mode: one ordered scan per new backend
reuse mode: one warmup query, then repeated scans in a single backend

Representative pre-shared-cache 384D result:

fresh avg_ms = 0.5803
reuse avg_ms = 0.3007

So roughly half of the observed fresh-backend latency is still startup work, not the search itself.

Goal

Reduce cross-backend cold-start cost for ordered sorted_hnsw scans without changing search semantics or weakening crash safety.

Non-goals

Do not change on-disk page layout.
Do not make writable-path behavior depend on shared-memory mutation.
Do not share exact rerank heap/TOAST state.
Do not start with a general-purpose multi-index cache manager if a smaller falsifier can prove or kill the idea first.

Current architecture

Today each backend does:

Read metapage.
Read SQ8 auxiliary pages.
Decode upper levels into backend-local palloc() allocations.
Build upper-level lookup tables.
Keep L0 decoded lazily per backend.

That means fresh mode repeatedly duplicates the same immutable decode work.

Implemented Phase A: single-slot shared immutable cache

Start with one experimental shared cache slot, effective only when the extension is loaded via shared_preload_libraries.

Why a single slot first:

it is enough to validate the measured bottleneck on the current benchmark
it keeps invalidation and memory budgeting simple
it gives a hard falsifier before building a more complex multi-slot cache

Share immutable decoded scan-side data:

SQ8 mins
SQ8 scales
decoded L0 node headers
decoded L0 neighbor slab
decoded L0 SQ8 vectors
upper-level adjacency
upper-level nid -> index arrays
immutable header fields: M, dim, n_nodes, entry_nid, max_level, cache_gen, relid, locator

Keep these backend-local:

L0 decoded pages and neighbor arrays
visited bitsets
candidate/result scratch
writable-path insert helpers
rerank heap/slot state

Why this split

The upper-level-only split was safe but insufficient. The dominant remaining fresh-backend tax was the decoded L0 snapshot itself, so the implemented slot stores the full immutable decoded graph and relies on cache_gen invalidation to keep writable-path updates from reusing stale state.

Shared memory layout

Use the existing shared-memory hook pattern already present in src/sorted_heap_scan.c.

Add:

one control struct in main shared memory
one fixed-size payload arena reserved at startup
one LWLock protecting publish/replace of the slot

Use offsets inside the payload arena rather than raw palloc()-style pointer ownership.

Control struct sketch

typedef struct ShnswSharedScanCacheCtl
{
    slock_t mutex_or_lwlock;
    bool valid;
    Oid relid;
    RelFileLocator locator;
    uint64 cache_gen;
    int32 dim;
    int16 M;
    int16 max_level;
    int32 n_nodes;
    int32 entry_nid;

    Size payload_bytes;
    Size sq8_mins_off;
    Size sq8_scales_off;
    Size upper_entries_off[SHNSW_MAX_LEVELS];
    Size upper_nbr_idx_off[SHNSW_MAX_LEVELS];
    int32 upper_count[SHNSW_MAX_LEVELS];
} ShnswSharedScanCacheCtl;

Initial payload budget

Start with a fixed budget just for the experimental slot.

Current payload budget: 64 MB.

If the decoded snapshot does not fit, publication bails out and the backend falls back to the normal private decode path.

Backend behavior

On scan-cache lookup:

Read cache_gen from the metapage.
Check the shared slot under lock.
If {relid, locator, cache_gen} match and payload is present:
- attach backend-local lightweight wrappers to shared immutable arrays
- skip SQ8/upper decode
Otherwise:
- build the normal backend-local cache
- materialize any deferred L0 pages
- if the shared slot is empty or stale and the payload fits, publish the decoded immutable snapshot into shared memory

On insert, vacuum tombstone, rebuild, relfilenode swap, or reindex:

metapage already bumps cache_gen
next backend lookup sees mismatch and treats the shared slot as stale
stale slots are replaced lazily on the next eligible scan

Correctness constraints

shared cache must never survive a cache_gen mismatch
any cache_gen mismatch forces a miss
payload publication must be all-or-nothing
fallback to backend-local decode must remain correct when shmem is disabled, full, or stale

Falsifier

The branch cleared the original gate:

fresh 50k/384D fixed-graph average:
- shared_cache=off -> 1.9213 ms
- shared_cache=on -> 0.6337 ms
warm same-backend 50k/384D:
- shared_cache=off -> 0.2180 ms
- shared_cache=on -> 0.0917 ms
make installcheck green
scripts/test_sorted_hnsw_crash_recovery.sh green
fresh-backend insert invalidation check:
- nearest before insert 833:0.000258
- nearest after insert 2001:0.000000

That is enough to keep the branch and stop treating it as speculative.

Validation plan

Use the new benchmark split from scripts/bench_sorted_hnsw_fixed_graph.sh:

fresh: cross-backend cold-start cost
reuse: steady-state search cost

Required checks:

make installcheck REGRESS='pg_sorted_heap sorted_hnsw'
./scripts/test_sorted_hnsw_crash_recovery.sh /tmp <port>
./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 fresh on off
./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 fresh on on
./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 reuse on off
./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 reuse on on

If Phase A works

Only then consider:

more than one shared slot
LRU/size-based replacement
optional shared L0 decode for read-mostly indexes
explicit observability counters for shared hits/misses/publishes

If Phase A fails

Stop local cache-manager work and accept that the next performance branch is either:

a larger shared-cache redesign with DSA/DSM, or
planner/policy work that avoids paying fresh-backend cost in short-lived sessions

Design: sorted_hnsw Shared Scan Cache