跳转至

2026-02-24 - LanceDB discussion and publishing to docs

Scope

This page captures the end-to-end discussion around:

  • LanceDB risk / failure modes (red-team view)
  • A pragmatic migration plan (dual-write / shadow-read / cutover)
  • Using GitHub repo + Cloudflare Pages as the publishing surface
  • Debugging Cloudflare Pages deployment issues during the MkDocs upgrade

Discussion notes (verbatim-ish)

A) LanceDB: failure modes and safeguards (red-team view)

(From the discussion draft in chat)

1) Deletion / destruction risk

  • Failure mode: accidental table drop, compaction/vacuum causing irreversible loss; reindex overwriting shards; data directory cleared (container rebuild / wrong volume mount).
  • Controls: production read-only account + change approval; table/dir snapshots (object storage versioning); two-person review for destructive ops; pre-flight mount/path fingerprint checks.
  • Monitoring: sudden row/file-count drop; inode/capacity anomaly; drop/compact audit events.
  • Rollback: restore from snapshot; disable automatic vacuum.

2) Missing audit trail

  • Failure mode: cannot trace who wrote/deleted/rebuilt index.
  • Controls: all writes go through a service layer; force request_id, operator, query/API, affected rows; immutable audit log (WORM / centralized logging).
  • Monitoring: audit gaps; write requests without request_id.

3) Injection / data poisoning

  • Failure mode: user input spliced into filters; poisoned vector/metadata (overlong strings, NaN/Inf, dim mismatch) causing crashes or retrieval bias; prompt/content injection impacting downstream RAG.
  • Controls: parameterized filters, allowlist fields; strict ingest validation (dim, range, NaN, length, schema); content safety + dedupe; tenant isolation by table/dir.
  • Monitoring: ingest failure rate; anomaly ratios; per-tenant write bursts; retrieval distribution drift.

4) Consistency / concurrency

  • Failure mode: multi-writer races; partial write success; read-after-write inconsistency; index/data out of sync; cross-process locks failing (NFS/object storage).
  • Controls: single-writer or queued writes; atomic batch protocol (stage -> switch pointer); avoid unsupported shared FS; idempotent writes (request_id dedupe).
  • Monitoring: index build failures; schema drift; batch retry counts.

5) Ops and capacity

  • Failure mode: index bloat/fragmentation causing latency jitter; OOM/disk-full corrupting data; upgrade incompatibilities.
  • Controls: capacity waterlines + quotas; scheduled compaction/optimize (off-peak, rollbackable); load-test baselines; backup + compatibility validation before upgrades.
  • Monitoring: P99 latency, disk/mem waterlines, segment growth, crash/restart counts.
  • Rollback: blue/green switch to old version + snapshot; keep dual-write window for replay.

B) Migration plan (parallel period)

1) Build new storage namespace model (session-memory / daily memory / MEMORY.md) + unified IDs

  • Full backfill: export old -> bucket by user/session/day -> write to new with checksum/version.

2) Dual-write

  • Write path writes old + new.
  • New store failures do not block primary path, but must be alerted and replayed.
  • Idempotent write (same ID overwrites).

3) Shadow-read rollout

  • Default reads old.
  • Canary reads new and diffs field-level output; mismatch falls back to old and reports sample.
  • Ramp read-new % to 100.

4) Cutover and rollback

  • Cutover changes only read toggle.
  • Rollback = read old + stop new writes (or continue dual-write); keep replay queue + backfill scripts.

Acceptance metrics

  • Read consistency: diff rate < 0.1% (critical fields zero tolerance)
  • Write reliability: new writes >= 99.9%, replay backlog < 5 min
  • Performance: P95 read latency <= +10%
  • Observability: QPS/error/hit per namespace traceable; per-user replay/audit possible

Namespace policy

  • session-memory: short-term context, strong TTL
  • daily memory: day-aggregated increments (facts/preferences)
  • MEMORY.md: long-term distilled summary (milestones/stable preferences), versioned with traceable sources

Pruning / dedupe / TTL / archive

  • Dedupe: hash + semantic threshold; last-write-wins for same key class (preference/profile/decision)
  • TTL: session 7-30d; daily 90-180d; MEMORY.md no TTL but quarterly review
  • Archive: TTL expiry -> cold storage (compressed, read-only, per-user encryption) with audit index

C) Tianming 700-zh summary (business view)

This section is already captured in the deliverable page:

  • Deliverables -> LanceDB: failure modes and migration

D) Publishing: GitHub + Cloudflare Pages

We decided to publish to Cloudflare Pages for stability and convenience.

Key build settings (MkDocs Material)

  • Production branch: master
  • Build command: pip install -r docs-requirements.txt && mkdocs build
  • Output directory: site

Deployment debugging highlights

  • Root cause of "HTML shown as source / raw markdown": stale Cloudflare _redirects rewrite and _headers Content-Type overrides from the pre-MkDocs era.
  • Fix: remove docs/_redirects and docs/_headers, ensure MkDocs output is deployed.

Pointers

  • Deliverable: ../deliverables/lancedb-failure-modes-and-migration.md
  • Templates: ../page-templates/