Architecture diagram — compliance RAG pipeline, data plane × control plane × compliance plane, LGPD + BCB 4.893 + AI Act Art. 12
fig. 1 — Container view · compliance-grade RAG · LATAM banking
Architecture diagram — compliance RAG pipeline, data plane × control plane × compliance plane, LGPD + BCB 4.893 + AI Act Art. 12

Anonymized writeup. Client name + product name omitted under non-compete. Architecture and stack patterns are fully discussable.

Context

Tier-1 LATAM bank. Regulated environment: LGPD (Brazilian data-protection law), internal audit gates on every release, controlled deployment windows (deploys gated by change-advisory board). Internal team needed an LLM agent to triage tax-compliance documents (Brazilian IRPF — personal income tax filings) at scale: classify return type, extract structured fields, surface anomalies for human review, log every decision for auditor replay.

Volume target: ~10K filings/day at peak. Latency budget: ≤8s per filing for the agent path; human-review queue handles edge cases async.

Problem statement

Three concrete problems:

  1. Document heterogeneity. Filings arrive as PDF, scanned image, structured XML (older format), and free-text email. Same field can sit in 5 different places depending on filing year + filer category.
  2. Hallucination risk in regulated context. A wrong classification cascades to wrong tax-bracket assignment → fine + potential regulatory finding. LLM “I’m 87% confident” is not acceptable as the only signal.
  3. Audit replay. Auditors must reconstruct any decision 5+ years later. Need every input, every retrieval hit, every prompt, every model output, every confidence score, and every human override stored immutably.

Constraints

  • LGPD: no personal data leaves the customer’s Azure tenancy. No public LLM API calls. All inference inside customer’s Azure OpenAI deployment.
  • Audit gate: every deploy requires change-board signoff + 72h soak in staging. No emergency hotfixes.
  • Soft real-time: 8s p95 target for the agent path, 30s p99.
  • No Python dependencies that pull binary wheels from non-standard indices (security review blocker).
  • Single-cloud: Azure-only for runtime, but Terraform module library cross-built for GCP for future portability.

Architecture

flowchart LR
    A[Document ingest] --> B[Format normalizer]
    B --> C[Chunker + embedder]
    C --> D[(Postgres + pgvector)]
    D --> E[Retrieval router]
    E --> F[Azure OpenAI gpt-4-class]
    F --> G[Citation extractor]
    G --> H[Confidence scorer]
    H -->|high conf| I[Auto-decide + log]
    H -->|low conf| J[Human-review queue]
    I --> K[(Audit log<br/>append-only)]
    J --> K
    F -.eval harness.-> L[Offline regression]

Five components carry most of the weight:

  1. Format normalizer. Converts PDF / image / XML / email into a canonical JSON envelope {filing_id, year, filer_category, raw_text, raw_pages[], metadata}. PDF + image go through OCR (Azure Document Intelligence — pre-approved by security review). XML uses a pinned XSD parser. Email uses a strict regex extractor with explicit fallback to “needs human” if the regex misses required fields.
  2. Chunker + embedder. Sentence-window chunking (300 tokens, 50 overlap) with year-aware boundary detection — Brazilian tax forms have strict section dividers that are easy to detect. Embeddings via Azure OpenAI text-embedding-3-large, stored in Postgres + pgvector with a per-year index for retrieval scoping.
  3. Retrieval router. Year-scoped + filer-category-scoped k-NN search. Returns top-8 hits with explicit retrieval-method metadata so the audit log captures which embedding model + which index version produced the hit set.
  4. Citation extractor. Forces the LLM to emit {decision, confidence, citations[]} where citations are spans into the original document. Spans validated post-hoc — if the LLM cites a span that doesn’t exist in the source, the response is rejected and re-prompted (max 2 retries) before falling through to human review.
  5. Audit log. Append-only Postgres table partitioned by year. Every row carries: input hash (SHA-256 of canonical JSON), retrieval hit set hash, full prompt (with prompt-template version), full model response, computed confidence, final decision, who-reviewed (if escalated). Schema versioned via flyway migrations. Logs replicated nightly to a separate Azure Storage account with WORM (write-once-read-many) policy enforcing 5-year retention.

Decision narrative

Three decisions worth calling out — each with trade-offs.

Decision 1: pgvector over a dedicated vector DB. Single Postgres extension covers retrieval + the audit log + the operational state. One backup story, one HA story, one access-control story. Trade-off: at our peak query rate (~50 retrievals/s) pgvector with HNSW indexing comfortably handles latency, but we’d reconsider above ~500/s where Pinecone or Qdrant start pulling ahead per Supabase’s 2025 vector-DB benchmark . Operational simplicity won.

Decision 2: Forced-citation prompting + post-hoc span validation. Adding the citation requirement to the prompt + validating spans against the source raises latency by ~600ms (one extra round trip when retry needed) but cuts hallucinated-decision rate dramatically. The pattern is grounded in Anthropic’s citation API design — same idea, applied via prompt engineering on Azure OpenAI gpt-4-class models. Trade-off accepted because audit gate is the binding constraint.

Decision 3: Per-year retrieval scoping. Brazilian tax law changes meaningfully year-to-year. A 2018-filing prompt retrieving 2024 examples produces wrong answers fast. Solution: year is a first-class index dimension. The retrieval router rejects cross-year hits unless the query explicitly opts in. Cost: per-year HNSW index storage roughly doubles. Benefit: zero cross-year contamination in eval.

Lessons learned

  • Force the LLM to cite + validate spans. The single biggest hallucination-reduction technique. Costs latency, buys regulator-defensible decisions.
  • Append-only audit log is non-negotiable. Build it on day 1. Bolt-on audit later = data gaps = audit finding.
  • Retrieval scope > model size. A smaller model with tightly scoped retrieval beat a larger model with broader retrieval on accuracy + latency + cost.
  • Eval harness is the deployment gate, not unit tests. Regression tests against frozen golden examples (~500 filings, hand-labeled) catch prompt drift faster than any other signal. Runs on every PR + on a nightly batch against last 30 days of production traffic.
  • Human review is a feature, not a fallback. Surfacing low-confidence decisions to humans + capturing their rationale built the dataset that lets the eval harness improve over time.

Stack

  • Language: Python 3.12
  • API: FastAPI (single-process, sync handlers — async didn’t add value at this throughput)
  • Storage: Azure Database for PostgreSQL Flexible Server 16 + pgvector 0.7
  • LLM: Azure OpenAI Service — gpt-4-class deployment in customer tenant, text-embedding-3-large for embeddings
  • OCR: Azure Document Intelligence (preview-approved by security)
  • Audit: Postgres append-only table + nightly replication to Azure Blob Storage with immutability policy
  • Infra: Terraform (modular library reused for parallel GCP target — Cloud SQL + Vertex AI + Cloud Storage)
  • CI/CD: Azure DevOps Pipelines (multi-stage YAML, environments + approvals)
  • Observability: Azure Monitor + Log Analytics workspace, custom dashboards for confidence histogram + retrieval-latency p99
  • Eval harness: custom Python harness; ~500 hand-labeled filings as golden set; runs in PR CI + nightly cron

When this pattern fits

  • Regulated domain (banking, legal, healthcare, tax) requiring auditor replay
  • Document corpus that has natural boundary markers (year, category, jurisdiction)
  • Budget for an internal review queue (human-in-the-loop on low-confidence decisions)

When it doesn’t

  • Open-domain consumer chatbot — overkill; the audit + citation overhead destroys UX latency
  • Throughput >500 retrievals/s — switch retrieval layer to a dedicated vector DB
  • Multi-tenant with cross-tenant data sharing — pgvector single-table model gets messy fast