zvec empty-id retrieval bug

This note is an upstream-ready issue draft for a zvec retrieval defect that showed up during GraphRAG parity work in pg_sorted_heap.

The short version:

ANN scores still come back
returned doc.id values become empty strings
the failure reproduces on both:
- a real-text Gutenberg GraphRAG corpus
- a plain synthetic FP32 corpus

So this does not look like a PostgreSQL expansion/rerank bug.

Environment used for the verified local repro:

zvec 0.2.0
Python package at:
- /opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/zvec
native extension:
- /opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/_zvec.cpython-312-darwin.so

Minimal synthetic reproducer

Repo-owned script:

scripts/repro_zvec_synthetic_threshold.py

Command:

python3 scripts/repro_zvec_synthetic_threshold.py --rows 4900,4950,5000 --query-count 5

Current verified output shape:

SYNTH_THRESH|rows=4900|status=ok|first_bad_query=None|sample=None
SYNTH_THRESH|rows=4950|status=bad|first_bad_query=1|sample=['', '', '', '', '', '', '']
SYNTH_THRESH|rows=5000|status=bad|first_bad_query=1|sample=['', '', '', '', '', '', '']

stderr also reports:

Failed to find target chunk for index 4945

Current minimal signature:

dim=32
ef_search=64
topk=7
rows=4950

Neighbor controls:

rows=4900, same params: ok
rows=4950, topk<=6: ok
rows=4950, topk>=7: bad

The compact repro also survives simple runtime knob changes:

memory_limit_mb=8192: bad
memory_limit_mb=1024: bad
memory_limit_mb=256: bad
query_threads=1, optimize_threads=1: bad
query_threads=2, optimize_threads=2: bad
query_threads=4, optimize_threads=4: bad

So the current minimal case does not look like a trivial thread-count or memory-budget artifact.

It also survives broad HNSW parameter changes on the same compact case:

ef_search=16, ef_construction=16, m=8: bad
ef_search=16, ef_construction=64, m=16: bad
ef_search=32, ef_construction=64, m=16: bad
ef_search=64, ef_construction=64, m=16: bad
ef_search=128, ef_construction=64, m=16: bad
ef_search=64, ef_construction=128, m=16: bad
ef_search=64, ef_construction=64, m=8: bad
ef_search=64, ef_construction=64, m=32: bad

So the compact failing case does not look like a fragile HNSW tuning artifact either.

Stronger diagnostics

On the compact synthetic case:

rows=4950, topk=6
- valid ids come back
rows=4950, topk=7
- scores still come back
- every doc.id becomes ''

That means the failure is not “query returns nothing”. Ranking still appears to produce plausible scores, but document metadata resolution fails.

Representative observation:

CASE 4950 6 [('1600', 0.12408530712127686), ..., ('2946', 0.14136314392089844)]
CASE 4950 7 [('', 0.12408530712127686), ..., ('', 0.14136314392089844)]

The bug is also non-monotonic by row count. Verified examples:

bad: 4950, 5000, 7500, 7900, 16000, 28000, 30000, 45000, 60000
ok: 4900, 7000, 7800, 24000, 75000

So this is not a simple “after N rows everything breaks” threshold.

Real-text corroboration

Repo-owned script:

scripts/repro_zvec_gutenberg_threshold.py

Current verified Gutenberg signature:

dim=32
topk=16
ef_search=64
64x256, 80x256, 96x256, 112x256 slices: ok
128x256 (58,954 rows): bad

Observed failure:

Failed to find target chunk for index 58379

Returned ids are empty strings / unmapped ids for the first bad probe.

Additional context

One larger synthetic case gives another useful hint:

rows=16000
exact cosine inspection shows the best-score bucket spans 1000, 2000, ..., 16000
zvec already returns empty ids at topk=5

This does not prove the internal root cause, but it suggests the failure may depend on candidate-materialization / metadata-fetch paths rather than on the ANN score computation itself.

The shipped native binary also contains the exact failing message plus nearby storage component paths:

/Users/cuiys/workspace/zvec/src/db/index/storage/mmap_forward_store.cc
/Users/cuiys/workspace/zvec/src/db/index/storage/bufferpool_forward_store.cc
Failed to find target chunk for index %d
Encountered empty chunk at index %d

That does not prove which codepath is at fault, but it makes the working component hypothesis narrower: the failure is plausibly in forward-store chunk lookup or chunk materialization rather than in HNSW distance ranking itself.

Debug file logging on the compact synthetic case adds one more strong clue. With log_level=DEBUG, query_threads=1, and the same 4950 / topk=7 repro, zvec reports:

Opened IPC with 4950 rows, 2 cols, 2 chunks, is_fixed_batch_size[1] fixed_batch_size[2432]
...
Failed to find target chunk for index 4945
...
Record batch: _zvec_uid_: [
  null,
  null,
  null,
  null,
  null,
  null,
  null
]

So the strongest current reading is:

the forward store opens successfully
the query reaches the recall/output stage
metadata resolution for _zvec_uid_ fails and the returned batch carries null ids even though scores are still present

Why this matters

For the pg_sorted_heap GraphRAG benchmark harness, this bug currently blocks a clean large-slice zvec parity row. PostgreSQL-side expansion+rereank remains stable; the unstable stage is the zvec ANN seed retrieval itself.

Status

Verified locally with repo-owned reproducers. No claim yet about the exact internal root cause inside zvec.