PHI scrubbed before embedding
De-identification at ingest, not at retrieval. The vector store never sees raw PHI. This single decision shortened the compliance review by weeks — reviewers didn't have to audit what wasn't there.
We design and ship compliant AI systems for healthtech teams. Architecture, audit logs, tenant isolation, PHI boundaries, cloud infrastructure — done before they become a problem. We have shipped this. One healthtech client secured hospital partnerships 8 to 10 months ahead of their original timeline because the infrastructure was the unlock.
Reference architecture · anonymized
A Series A healthtech client. RAG over clinical notes. Multi-tenant, multi-hospital. The architecture cleared compliance review on first pass. Below: the decisions that mattered.
De-identification at ingest, not at retrieval. The vector store never sees raw PHI. This single decision shortened the compliance review by weeks — reviewers didn't have to audit what wasn't there.
Metadata filtering applied before retrieval, not after. Each query is constrained to the tenant's namespace at the index level. No cross-tenant leakage path exists at the architectural level — not just at the application layer.
Audit logs capture: query, retrieved chunks, model, prompt, response, timestamp, user. Any answer can be traced back to its sources. This is what hospital security teams asked about — not the model, the audit trail.
Engineers (including us) work in the client's environment with credentials that expire. No data egress. No copies of PHI on developer machines. Enforced at the cloud IAM level, not by policy.
Compliance wasn't the goal — it was the unlock. The AI system we built became the reason hospitals said yes.
What we got wrong first
We initially placed the PHI scrubber after retrieval. The thinking was simple: keep the raw clinical text in the vector store, scrub on the way out, more flexibility for retrieval quality. It tested fine. It failed the compliance conversation. The reviewers' question was not "does the scrubber work" — it was "why does PHI live in the vector store at all?"
The answer was that it shouldn't, and didn't have to. Moving the scrubber to ingest cost a week of refactoring and lost us nothing in retrieval quality (the de-identified text still embeds well — clinical structure carries the signal). It saved the client a month of compliance back-and-forth.
The lesson: in regulated AI, design from what reviewers will ask, not what the model needs.
What this sprint produces
Who it's for
A note on confidentiality
The reference architecture above is generalized from real work. We do not name clients, share repos, or publish architecture diagrams that could identify them. When we work with you, the same protection applies — your stack, your decisions, your repos stay yours.
Ready?