2026-02-24 - LanceDB discussion and publishing to docs¶
Scope¶
This page captures the end-to-end discussion around:
- LanceDB risk / failure modes (red-team view)
- A pragmatic migration plan (dual-write / shadow-read / cutover)
- Using GitHub repo + Cloudflare Pages as the publishing surface
- Debugging Cloudflare Pages deployment issues during the MkDocs upgrade
Discussion notes (verbatim-ish)¶
A) LanceDB: failure modes and safeguards (red-team view)¶
(From the discussion draft in chat)
1) Deletion / destruction risk
- Failure mode: accidental table drop, compaction/vacuum causing irreversible loss; reindex overwriting shards; data directory cleared (container rebuild / wrong volume mount).
- Controls: production read-only account + change approval; table/dir snapshots (object storage versioning); two-person review for destructive ops; pre-flight mount/path fingerprint checks.
- Monitoring: sudden row/file-count drop; inode/capacity anomaly; drop/compact audit events.
- Rollback: restore from snapshot; disable automatic vacuum.
2) Missing audit trail
- Failure mode: cannot trace who wrote/deleted/rebuilt index.
- Controls: all writes go through a service layer; force request_id, operator, query/API, affected rows; immutable audit log (WORM / centralized logging).
- Monitoring: audit gaps; write requests without request_id.
3) Injection / data poisoning
- Failure mode: user input spliced into filters; poisoned vector/metadata (overlong strings, NaN/Inf, dim mismatch) causing crashes or retrieval bias; prompt/content injection impacting downstream RAG.
- Controls: parameterized filters, allowlist fields; strict ingest validation (dim, range, NaN, length, schema); content safety + dedupe; tenant isolation by table/dir.
- Monitoring: ingest failure rate; anomaly ratios; per-tenant write bursts; retrieval distribution drift.
4) Consistency / concurrency
- Failure mode: multi-writer races; partial write success; read-after-write inconsistency; index/data out of sync; cross-process locks failing (NFS/object storage).
- Controls: single-writer or queued writes; atomic batch protocol (stage -> switch pointer); avoid unsupported shared FS; idempotent writes (request_id dedupe).
- Monitoring: index build failures; schema drift; batch retry counts.
5) Ops and capacity
- Failure mode: index bloat/fragmentation causing latency jitter; OOM/disk-full corrupting data; upgrade incompatibilities.
- Controls: capacity waterlines + quotas; scheduled compaction/optimize (off-peak, rollbackable); load-test baselines; backup + compatibility validation before upgrades.
- Monitoring: P99 latency, disk/mem waterlines, segment growth, crash/restart counts.
- Rollback: blue/green switch to old version + snapshot; keep dual-write window for replay.
B) Migration plan (parallel period)¶
1) Build new storage namespace model (session-memory / daily memory / MEMORY.md) + unified IDs
- Full backfill: export old -> bucket by user/session/day -> write to new with checksum/version.
2) Dual-write
- Write path writes old + new.
- New store failures do not block primary path, but must be alerted and replayed.
- Idempotent write (same ID overwrites).
3) Shadow-read rollout
- Default reads old.
- Canary reads new and diffs field-level output; mismatch falls back to old and reports sample.
- Ramp read-new % to 100.
4) Cutover and rollback
- Cutover changes only read toggle.
- Rollback = read old + stop new writes (or continue dual-write); keep replay queue + backfill scripts.
Acceptance metrics
- Read consistency: diff rate < 0.1% (critical fields zero tolerance)
- Write reliability: new writes >= 99.9%, replay backlog < 5 min
- Performance: P95 read latency <= +10%
- Observability: QPS/error/hit per namespace traceable; per-user replay/audit possible
Namespace policy
- session-memory: short-term context, strong TTL
- daily memory: day-aggregated increments (facts/preferences)
- MEMORY.md: long-term distilled summary (milestones/stable preferences), versioned with traceable sources
Pruning / dedupe / TTL / archive
- Dedupe: hash + semantic threshold; last-write-wins for same key class (preference/profile/decision)
- TTL: session 7-30d; daily 90-180d; MEMORY.md no TTL but quarterly review
- Archive: TTL expiry -> cold storage (compressed, read-only, per-user encryption) with audit index
C) Tianming 700-zh summary (business view)¶
This section is already captured in the deliverable page:
Deliverables -> LanceDB: failure modes and migration
D) Publishing: GitHub + Cloudflare Pages¶
We decided to publish to Cloudflare Pages for stability and convenience.
Key build settings (MkDocs Material)
- Production branch:
master - Build command:
pip install -r docs-requirements.txt && mkdocs build - Output directory:
site
Deployment debugging highlights
- Root cause of "HTML shown as source / raw markdown": stale Cloudflare
_redirectsrewrite and_headersContent-Type overrides from the pre-MkDocs era. - Fix: remove
docs/_redirectsanddocs/_headers, ensure MkDocs output is deployed.
Pointers¶
- Deliverable:
../deliverables/lancedb-failure-modes-and-migration.md - Templates:
../page-templates/