Design: sorted_hnsw Shared Scan Cache

Status

Implemented in:

  • f33d620 perf: share decoded sorted_hnsw scan cache across backends
  • 27e77a3 perf: cost sorted_hnsw shared-cache startup separately

The original upper-level-only payload was a useful falsifier, but it only cut fresh 50k/384D scans by about 10% and missed the bar. The implemented version therefore promotes the shared slot to a full decoded immutable scan snapshot.

Problem

sorted_hnsw already keeps a decoded scan cache per backend in src/sorted_hnsw.c, keyed by {relid, relfilenode, cache_gen}.

That solves repeated scans inside one backend, but every new backend still pays to rebuild the same immutable scan-side state:

  • SQ8 min/scale arrays
  • upper-level adjacency
  • upper-level nid -> index lookup tables
  • scan-cache control structs

The new fixed-graph harness can now measure the gap directly:

  • fresh mode: one ordered scan per new backend
  • reuse mode: one warmup query, then repeated scans in a single backend

Representative pre-shared-cache 384D result:

  • fresh avg_ms = 0.5803
  • reuse avg_ms = 0.3007

So roughly half of the observed fresh-backend latency is still startup work, not the search itself.

Goal

Reduce cross-backend cold-start cost for ordered sorted_hnsw scans without changing search semantics or weakening crash safety.

Non-goals

  • Do not change on-disk page layout.
  • Do not make writable-path behavior depend on shared-memory mutation.
  • Do not share exact rerank heap/TOAST state.
  • Do not start with a general-purpose multi-index cache manager if a smaller falsifier can prove or kill the idea first.

Current architecture

Today each backend does:

  1. Read metapage.
  2. Read SQ8 auxiliary pages.
  3. Decode upper levels into backend-local palloc() allocations.
  4. Build upper-level lookup tables.
  5. Keep L0 decoded lazily per backend.

That means fresh mode repeatedly duplicates the same immutable decode work.

Implemented Phase A: single-slot shared immutable cache

Start with one experimental shared cache slot, effective only when the extension is loaded via shared_preload_libraries.

Why a single slot first:

  • it is enough to validate the measured bottleneck on the current benchmark
  • it keeps invalidation and memory budgeting simple
  • it gives a hard falsifier before building a more complex multi-slot cache

What to share

Share immutable decoded scan-side data:

  • SQ8 mins
  • SQ8 scales
  • decoded L0 node headers
  • decoded L0 neighbor slab
  • decoded L0 SQ8 vectors
  • upper-level adjacency
  • upper-level nid -> index arrays
  • immutable header fields: M, dim, n_nodes, entry_nid, max_level, cache_gen, relid, locator

Keep these backend-local:

  • L0 decoded pages and neighbor arrays
  • visited bitsets
  • candidate/result scratch
  • writable-path insert helpers
  • rerank heap/slot state

Why this split

The upper-level-only split was safe but insufficient. The dominant remaining fresh-backend tax was the decoded L0 snapshot itself, so the implemented slot stores the full immutable decoded graph and relies on cache_gen invalidation to keep writable-path updates from reusing stale state.

Shared memory layout

Use the existing shared-memory hook pattern already present in src/sorted_heap_scan.c.

Add:

  • one control struct in main shared memory
  • one fixed-size payload arena reserved at startup
  • one LWLock protecting publish/replace of the slot

Use offsets inside the payload arena rather than raw palloc()-style pointer ownership.

Control struct sketch

typedef struct ShnswSharedScanCacheCtl
{
    slock_t mutex_or_lwlock;
    bool valid;
    Oid relid;
    RelFileLocator locator;
    uint64 cache_gen;
    int32 dim;
    int16 M;
    int16 max_level;
    int32 n_nodes;
    int32 entry_nid;

    Size payload_bytes;
    Size sq8_mins_off;
    Size sq8_scales_off;
    Size upper_entries_off[SHNSW_MAX_LEVELS];
    Size upper_nbr_idx_off[SHNSW_MAX_LEVELS];
    int32 upper_count[SHNSW_MAX_LEVELS];
} ShnswSharedScanCacheCtl;

Initial payload budget

Start with a fixed budget just for the experimental slot.

Current payload budget: 64 MB.

If the decoded snapshot does not fit, publication bails out and the backend falls back to the normal private decode path.

Backend behavior

On scan-cache lookup:

  1. Read cache_gen from the metapage.
  2. Check the shared slot under lock.
  3. If {relid, locator, cache_gen} match and payload is present:
    • attach backend-local lightweight wrappers to shared immutable arrays
    • skip SQ8/upper decode
  4. Otherwise:
    • build the normal backend-local cache
    • materialize any deferred L0 pages
    • if the shared slot is empty or stale and the payload fits, publish the decoded immutable snapshot into shared memory

On insert, vacuum tombstone, rebuild, relfilenode swap, or reindex:

  • metapage already bumps cache_gen
  • next backend lookup sees mismatch and treats the shared slot as stale
  • stale slots are replaced lazily on the next eligible scan

Correctness constraints

  • shared cache must never survive a cache_gen mismatch
  • any cache_gen mismatch forces a miss
  • payload publication must be all-or-nothing
  • fallback to backend-local decode must remain correct when shmem is disabled, full, or stale

Falsifier

The branch cleared the original gate:

  • fresh 50k/384D fixed-graph average:
    • shared_cache=off -> 1.9213 ms
    • shared_cache=on -> 0.6337 ms
  • warm same-backend 50k/384D:
    • shared_cache=off -> 0.2180 ms
    • shared_cache=on -> 0.0917 ms
  • make installcheck green
  • scripts/test_sorted_hnsw_crash_recovery.sh green
  • fresh-backend insert invalidation check:
    • nearest before insert 833:0.000258
    • nearest after insert 2001:0.000000

That is enough to keep the branch and stop treating it as speculative.

Validation plan

Use the new benchmark split from scripts/bench_sorted_hnsw_fixed_graph.sh:

  • fresh: cross-backend cold-start cost
  • reuse: steady-state search cost

Required checks:

  • make installcheck REGRESS='pg_sorted_heap sorted_hnsw'
  • ./scripts/test_sorted_hnsw_crash_recovery.sh /tmp <port>
  • ./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 fresh on off
  • ./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 fresh on on
  • ./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 reuse on off
  • ./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp <port> 2000 8 384 reuse on on

If Phase A works

Only then consider:

  • more than one shared slot
  • LRU/size-based replacement
  • optional shared L0 decode for read-mostly indexes
  • explicit observability counters for shared hits/misses/publishes

If Phase A fails

Stop local cache-manager work and accept that the next performance branch is either:

  • a larger shared-cache redesign with DSA/DSM, or
  • planner/policy work that avoids paying fresh-backend cost in short-lived sessions