zvec empty-id retrieval bug

This note is an upstream-ready issue draft for a zvec retrieval defect that showed up during GraphRAG parity work in pg_sorted_heap.

The short version:

  • ANN scores still come back
  • returned doc.id values become empty strings
  • the failure reproduces on both:
    • a real-text Gutenberg GraphRAG corpus
    • a plain synthetic FP32 corpus

So this does not look like a PostgreSQL expansion/rerank bug.

Environment used for the verified local repro:

  • zvec 0.2.0
  • Python package at:
    • /opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/zvec
  • native extension:
    • /opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/_zvec.cpython-312-darwin.so

Minimal synthetic reproducer

Repo-owned script:

Command:

python3 scripts/repro_zvec_synthetic_threshold.py --rows 4900,4950,5000 --query-count 5

Current verified output shape:

SYNTH_THRESH|rows=4900|status=ok|first_bad_query=None|sample=None
SYNTH_THRESH|rows=4950|status=bad|first_bad_query=1|sample=['', '', '', '', '', '', '']
SYNTH_THRESH|rows=5000|status=bad|first_bad_query=1|sample=['', '', '', '', '', '', '']

stderr also reports:

Failed to find target chunk for index 4945

Current minimal signature:

  • dim=32
  • ef_search=64
  • topk=7
  • rows=4950

Neighbor controls:

  • rows=4900, same params: ok
  • rows=4950, topk<=6: ok
  • rows=4950, topk>=7: bad

The compact repro also survives simple runtime knob changes:

  • memory_limit_mb=8192: bad
  • memory_limit_mb=1024: bad
  • memory_limit_mb=256: bad
  • query_threads=1, optimize_threads=1: bad
  • query_threads=2, optimize_threads=2: bad
  • query_threads=4, optimize_threads=4: bad

So the current minimal case does not look like a trivial thread-count or memory-budget artifact.

It also survives broad HNSW parameter changes on the same compact case:

  • ef_search=16, ef_construction=16, m=8: bad
  • ef_search=16, ef_construction=64, m=16: bad
  • ef_search=32, ef_construction=64, m=16: bad
  • ef_search=64, ef_construction=64, m=16: bad
  • ef_search=128, ef_construction=64, m=16: bad
  • ef_search=64, ef_construction=128, m=16: bad
  • ef_search=64, ef_construction=64, m=8: bad
  • ef_search=64, ef_construction=64, m=32: bad

So the compact failing case does not look like a fragile HNSW tuning artifact either.

Stronger diagnostics

On the compact synthetic case:

  • rows=4950, topk=6
    • valid ids come back
  • rows=4950, topk=7
    • scores still come back
    • every doc.id becomes ''

That means the failure is not “query returns nothing”. Ranking still appears to produce plausible scores, but document metadata resolution fails.

Representative observation:

CASE 4950 6 [('1600', 0.12408530712127686), ..., ('2946', 0.14136314392089844)]
CASE 4950 7 [('', 0.12408530712127686), ..., ('', 0.14136314392089844)]

The bug is also non-monotonic by row count. Verified examples:

  • bad: 4950, 5000, 7500, 7900, 16000, 28000, 30000, 45000, 60000
  • ok: 4900, 7000, 7800, 24000, 75000

So this is not a simple “after N rows everything breaks” threshold.

Real-text corroboration

Repo-owned script:

Current verified Gutenberg signature:

  • dim=32
  • topk=16
  • ef_search=64
  • 64x256, 80x256, 96x256, 112x256 slices: ok
  • 128x256 (58,954 rows): bad

Observed failure:

Failed to find target chunk for index 58379

Returned ids are empty strings / unmapped ids for the first bad probe.

Additional context

One larger synthetic case gives another useful hint:

  • rows=16000
  • exact cosine inspection shows the best-score bucket spans 1000, 2000, ..., 16000
  • zvec already returns empty ids at topk=5

This does not prove the internal root cause, but it suggests the failure may depend on candidate-materialization / metadata-fetch paths rather than on the ANN score computation itself.

The shipped native binary also contains the exact failing message plus nearby storage component paths:

/Users/cuiys/workspace/zvec/src/db/index/storage/mmap_forward_store.cc
/Users/cuiys/workspace/zvec/src/db/index/storage/bufferpool_forward_store.cc
Failed to find target chunk for index %d
Encountered empty chunk at index %d

That does not prove which codepath is at fault, but it makes the working component hypothesis narrower: the failure is plausibly in forward-store chunk lookup or chunk materialization rather than in HNSW distance ranking itself.

Debug file logging on the compact synthetic case adds one more strong clue. With log_level=DEBUG, query_threads=1, and the same 4950 / topk=7 repro, zvec reports:

Opened IPC with 4950 rows, 2 cols, 2 chunks, is_fixed_batch_size[1] fixed_batch_size[2432]
...
Failed to find target chunk for index 4945
...
Record batch: _zvec_uid_: [
  null,
  null,
  null,
  null,
  null,
  null,
  null
]

So the strongest current reading is:

  • the forward store opens successfully
  • the query reaches the recall/output stage
  • metadata resolution for _zvec_uid_ fails and the returned batch carries null ids even though scores are still present

Why this matters

For the pg_sorted_heap GraphRAG benchmark harness, this bug currently blocks a clean large-slice zvec parity row. PostgreSQL-side expansion+rereank remains stable; the unstable stage is the zvec ANN seed retrieval itself.

Status

Verified locally with repo-owned reproducers. No claim yet about the exact internal root cause inside zvec.