Compliance-grade RAG for tier-1 LATAM banking

1×

Architecture diagram — compliance RAG pipeline, data plane × control plane × compliance plane, LGPD + BCB 4.893 + AI Act Art. 12

Anonymized writeup. Client name + product name omitted under non-compete. Architecture and stack patterns are fully discussable.

Ten thousand tax filings a day. A five-year auditor-replay window. A bank.

The setting: a tier-1 LATAM bank under LGPD (Brazil’s data-protection law, Lei 13.709/2018 ), where every release crosses a change-advisory board and deploys happen inside controlled windows. The internal team needed an LLM agent to triage Brazilian IRPF filings — classify return type, extract structured fields, flag anomalies for human review — and to log every decision so an auditor can reconstruct it years later. Budget for the agent path: ≤8s p95, 30s p99. Edge cases drain async through a human-review queue.

The failure mode is a fine, not a bad demo

Most RAG writeups optimize for answer quality. Here the binding constraint is different: a wrong classification cascades into a wrong tax-bracket assignment, and that ends in a fine plus a regulatory finding. “The model was 87% confident” is not a defensible answer to a regulator. Neither is a decision you can no longer reproduce because the prompt template changed eight quarters ago.

The inputs make it harder. Filings arrive as PDF, scanned image, legacy structured XML, and free-text email: the same field can live in five different places depending on filing year and filer category. And the compliance envelope is rigid. No personal data leaves the customer’s Azure tenancy, so no public LLM APIs — all inference runs inside their Azure OpenAI deployment. Every deploy needs change-board signoff plus a 72-hour staging soak; there is no emergency-hotfix path. Security review blocks any Python dependency that pulls binary wheels from non-standard indices. Runtime is Azure-only, with the Terraform module library cross-built for GCP as a portability hedge.

The Y-statement we committed to:

In the context of regulated tax-document triage, facing hallucination risk that ends in fines and a 5-year auditor-replay obligation, we decided on a single-Postgres RAG pipeline with forced-citation prompting and an append-only audit log, to achieve regulator-defensible decisions at ≤8s p95, accepting ~600ms of citation-validation latency and doubled per-year index storage.

One Postgres carries the whole thing

Retrieval pipeline flow: document ingest through format normalizer, chunker and pgvector to the LLM, citation extractor and confidence scorer; high-confidence auto-decide and the human-review queue both write to the append-only audit log; an eval-harness tap feeds offline regression

fig. 2 — retrieval pipeline · data / control / compliance planes

Ingest starts with a normalizer that converts PDF, image, XML and email into one canonical JSON envelope: {filing_id, year, filer_category, raw_text, raw_pages[], metadata}. PDF and image go through OCR (Azure Document Intelligence, pre-approved by security review). XML gets a pinned XSD parser. Email gets a strict regex extractor that falls through to “needs human” the moment a required field is missing: an extractor that guesses is a liability generator.

Chunking is sentence-window (300 tokens, 50 overlap) with year-aware boundaries; Brazilian tax forms carry strict section dividers that make the year detectable. Embeddings come from text-embedding-3-large and land in Postgres + pgvector, indexed per year. The retrieval router runs year-scoped, filer-category-scoped k-NN and returns the top-8 hits with retrieval-method metadata — which embedding model, which index version — so the audit log captures how the hit set was produced, not just what it was.

The model must answer in the shape {decision, confidence, citations[]}, where each citation is a span into the source document. Spans are validated post-hoc: cite a span that does not exist and the response is rejected and re-prompted, twice at most, then handed to a human.

The audit log is not logging. It is the product. An append-only Postgres table, partitioned by year, where every row carries the input hash (SHA-256 of the canonical envelope), the retrieval hit-set hash, the full prompt with its template version, the full model response, the computed confidence, the final decision, and who reviewed it if escalated. Schema versioned with flyway. Nightly replication ships partitions to a separate Azure Storage account under a WORM (write-once-read-many) policy that enforces the 5-year retention.

Citations you can’t fake, retrieval you can’t contaminate

pgvector over a dedicated vector DB. One Postgres extension covers retrieval, the audit log, and operational state: one backup story, one HA story, one access-control story. At our peak of ~50 retrievals/s, pgvector with HNSW holds the latency budget comfortably. Above ~500/s, Supabase’s 2025 vector-DB benchmark says Pinecone or Qdrant start pulling ahead — that is the reconsider line, an order of magnitude away. The boring choice won, and boring is exactly what you want under a change-advisory board.

Forced citations, then span validation. The citation requirement plus post-hoc validation costs ~600ms when a retry fires. It bought a hallucinated-decision rate of zero in production. Not low: zero, because a fabricated span cannot survive validation against the source. The idea matches Anthropic’s citation API design ; we applied it as prompt engineering on Azure OpenAI gpt-4-class models. With the audit gate as the binding constraint, the latency trade was not a hard call.

Retrieval scoped per year. Brazilian tax law shifts meaningfully year to year; a 2018 filing answered with 2024 retrieval context is confidently wrong. Year is therefore a first-class index dimension, and the router rejects cross-year hits unless a query explicitly opts in. Cost: per-year HNSW storage roughly doubles. Benefit: zero cross-year contamination in eval, ever since.

The eval harness is the deploy gate

Unit tests tell you the code still runs. They say nothing about whether last week’s prompt tweak quietly broke 2019 filings. The deployment gate is a regression harness over ~500 hand-labeled golden filings, run on every PR and nightly against the last 30 days of production traffic:

python

def fitness_gate(golden: list[Filing], run: EvalRun) -> None:
    """Deploy gate — fails the pipeline, not a dashboard."""
    assert run.decision_accuracy(golden) >= 0.985
    assert run.hallucinated_citations(golden) == 0
    assert run.p95_latency_s <= 8.0
    assert run.cross_year_retrievals == 0

That harness is also why the human-review queue is a feature rather than a fallback. Every low-confidence decision a reviewer resolves, rationale included, feeds the golden set: the reviewers are labeling the eval data that tightens the system that routes them less work.

Where this pattern is wrong: an open-domain consumer chatbot, where the citation and audit overhead destroys UX latency for no regulatory payoff; anything past ~500 retrievals/s, where the retrieval layer should move to a dedicated vector DB; multi-tenant systems with cross-tenant data sharing, where the single-table pgvector model gets messy fast. It fits when the domain is regulated (banking, legal, healthcare, tax), the corpus has natural scope boundaries (year, category, jurisdiction), and there is budget for a human queue on low-confidence decisions.

If you build one thing on day 1, build the append-only log. Audit bolted on later means data gaps, and data gaps are audit findings. The rest — forced citations, scoped retrieval, boring storage — is what made the replay cheap.

Stack

Language: Python 3.12
API: FastAPI (single-process, sync handlers — async didn’t add value at this throughput)
Storage: Azure Database for PostgreSQL Flexible Server 16 + pgvector 0.7
LLM: Azure OpenAI Service — gpt-4-class deployment in customer tenant, text-embedding-3-large for embeddings
OCR: Azure Document Intelligence (preview-approved by security)
Audit: Postgres append-only table + nightly replication to Azure Blob Storage with immutability policy
Infra: Terraform (modular library reused for parallel GCP target — Cloud SQL + Vertex AI + Cloud Storage)
CI/CD: Azure DevOps Pipelines (multi-stage YAML, environments + approvals)
Observability: Azure Monitor + Log Analytics workspace, custom dashboards for confidence histogram + retrieval-latency p99
Eval harness: custom Python harness; ~500 hand-labeled filings as golden set; runs in PR CI + nightly cron

The failure mode is a fine, not a bad demo#

One Postgres carries the whole thing#

Citations you can’t fake, retrieval you can’t contaminate#

The eval harness is the deploy gate#

Stack#

The failure mode is a fine, not a bad demo

One Postgres carries the whole thing

Citations you can’t fake, retrieval you can’t contaminate

The eval harness is the deploy gate

Stack