zvec empty-id retrieval bug
This note is an upstream-ready issue draft for a zvec retrieval defect that showed up during GraphRAG parity work in pg_sorted_heap.
The short version:
- ANN scores still come back
- returned
doc.idvalues become empty strings - the failure reproduces on both:
- a real-text Gutenberg GraphRAG corpus
- a plain synthetic FP32 corpus
So this does not look like a PostgreSQL expansion/rerank bug.
Environment used for the verified local repro:
zvec 0.2.0- Python package at:
/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/zvec
- native extension:
/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/_zvec.cpython-312-darwin.so
Minimal synthetic reproducer
Repo-owned script:
Command:
python3 scripts/repro_zvec_synthetic_threshold.py --rows 4900,4950,5000 --query-count 5
Current verified output shape:
SYNTH_THRESH|rows=4900|status=ok|first_bad_query=None|sample=None
SYNTH_THRESH|rows=4950|status=bad|first_bad_query=1|sample=['', '', '', '', '', '', '']
SYNTH_THRESH|rows=5000|status=bad|first_bad_query=1|sample=['', '', '', '', '', '', '']
stderr also reports:
Failed to find target chunk for index 4945
Current minimal signature:
dim=32ef_search=64topk=7rows=4950
Neighbor controls:
rows=4900, same params:okrows=4950,topk<=6:okrows=4950,topk>=7:bad
The compact repro also survives simple runtime knob changes:
memory_limit_mb=8192:badmemory_limit_mb=1024:badmemory_limit_mb=256:badquery_threads=1,optimize_threads=1:badquery_threads=2,optimize_threads=2:badquery_threads=4,optimize_threads=4:bad
So the current minimal case does not look like a trivial thread-count or memory-budget artifact.
It also survives broad HNSW parameter changes on the same compact case:
ef_search=16,ef_construction=16,m=8:badef_search=16,ef_construction=64,m=16:badef_search=32,ef_construction=64,m=16:badef_search=64,ef_construction=64,m=16:badef_search=128,ef_construction=64,m=16:badef_search=64,ef_construction=128,m=16:badef_search=64,ef_construction=64,m=8:badef_search=64,ef_construction=64,m=32:bad
So the compact failing case does not look like a fragile HNSW tuning artifact either.
Stronger diagnostics
On the compact synthetic case:
rows=4950,topk=6- valid ids come back
rows=4950,topk=7- scores still come back
- every
doc.idbecomes''
That means the failure is not “query returns nothing”. Ranking still appears to produce plausible scores, but document metadata resolution fails.
Representative observation:
CASE 4950 6 [('1600', 0.12408530712127686), ..., ('2946', 0.14136314392089844)]
CASE 4950 7 [('', 0.12408530712127686), ..., ('', 0.14136314392089844)]
The bug is also non-monotonic by row count. Verified examples:
- bad:
4950,5000,7500,7900,16000,28000,30000,45000,60000 - ok:
4900,7000,7800,24000,75000
So this is not a simple “after N rows everything breaks” threshold.
Real-text corroboration
Repo-owned script:
Current verified Gutenberg signature:
dim=32topk=16ef_search=6464x256,80x256,96x256,112x256slices:ok128x256(58,954rows):bad
Observed failure:
Failed to find target chunk for index 58379
Returned ids are empty strings / unmapped ids for the first bad probe.
Additional context
One larger synthetic case gives another useful hint:
rows=16000- exact cosine inspection shows the best-score bucket spans
1000, 2000, ..., 16000 zvecalready returns empty ids attopk=5
This does not prove the internal root cause, but it suggests the failure may depend on candidate-materialization / metadata-fetch paths rather than on the ANN score computation itself.
The shipped native binary also contains the exact failing message plus nearby storage component paths:
/Users/cuiys/workspace/zvec/src/db/index/storage/mmap_forward_store.cc
/Users/cuiys/workspace/zvec/src/db/index/storage/bufferpool_forward_store.cc
Failed to find target chunk for index %d
Encountered empty chunk at index %d
That does not prove which codepath is at fault, but it makes the working component hypothesis narrower: the failure is plausibly in forward-store chunk lookup or chunk materialization rather than in HNSW distance ranking itself.
Debug file logging on the compact synthetic case adds one more strong clue. With log_level=DEBUG, query_threads=1, and the same 4950 / topk=7 repro, zvec reports:
Opened IPC with 4950 rows, 2 cols, 2 chunks, is_fixed_batch_size[1] fixed_batch_size[2432]
...
Failed to find target chunk for index 4945
...
Record batch: _zvec_uid_: [
null,
null,
null,
null,
null,
null,
null
]
So the strongest current reading is:
- the forward store opens successfully
- the query reaches the recall/output stage
- metadata resolution for
_zvec_uid_fails and the returned batch carries null ids even though scores are still present
Why this matters
For the pg_sorted_heap GraphRAG benchmark harness, this bug currently blocks a clean large-slice zvec parity row. PostgreSQL-side expansion+rereank remains stable; the unstable stage is the zvec ANN seed retrieval itself.
Status
Verified locally with repo-owned reproducers. No claim yet about the exact internal root cause inside zvec.