This is Part 6 of the RAG Enterprise Series. This post covers the supporting infrastructure that applies across all four domains (Parts 2–5). It can be read independently.
The decision of which RAG level to use (covered in Part 1) and the domain-specific application (Parts 2–5) both depend on getting the supporting stack right. Memory architecture, prompt engineering, fine-tuning strategy, and embedding quality are the multipliers on retrieval sophistication — a well-indexed L2 system with good embeddings will often outperform a poorly-constructed L3.
6. Cross-Cutting Supporting Stack
6.1 Memory Architecture — On-the-Go Preparation
A production multi-domain RAG system needs a layered memory architecture:
- Current conversation turn
- Recently retrieved chunks
- Active reasoning chain
- Tool call history
- Session summaries
- Active task state
- Reflection notes
- Partial results · conversation arc
- User profile embeddings
- Historical interaction summaries
- Domain knowledge base
- Fine-tuned fact store
On-the-go memory construction (key pattern for all four domains):
6.2 Prompt Engineering Patterns
Pattern 1 — Role + Constraint + Context + Task:
Role: You are a [DOMAIN_EXPERT_PERSONA].
Constraint: [REGULATORY_GUARDRAILS + HALLUCINATION_PREVENTION]
Context: [RETRIEVED_DOCS + USER_PROFILE]
Task: [SPECIFIC_QUERY]
Output format: [STRUCTURED_OUTPUT_SPEC]
Pattern 2 — Chain-of-thought with domain priming:
"Before answering, think step by step:
1. What are the hard constraints on this answer? (regulatory, factual, user-specific)
2. What information do I have that's relevant?
3. What is missing that would change my answer?
4. What is the safest defensible answer given uncertainty?"
Pattern 3 — Retrieval quality self-assessment (for Agentic RAG):
"After reviewing the retrieved documents, answer:
- Do I have sufficient context to answer this question with confidence?
- Is there a specific type of document I'm missing?
- Are any of the retrieved documents potentially outdated or contradictory?
If insufficient: specify what additional retrieval is needed before proceeding."
Pattern 4 — Multi-persona validation (for high-stakes decisions):
"Generate the answer from three perspectives:
(A) Optimistic: best-case interpretation of available data
(B) Conservative: what would a cautious senior analyst say?
(C) Devil's advocate: what is the strongest counter-argument?
Then synthesize a final recommendation acknowledging the tension."
7. Embedding Improvements — Detailed Breakdown
This is the highest-leverage infrastructure investment for improving RAG quality before touching the retrieval architecture level.
7.1 Standard Dense Embeddings and Their Limitations
Standard embeddings (OpenAI text-embedding-3-large, Sentence-BERT) fail on:
| Failure Mode | Example |
|---|---|
| Pronoun resolution | "The patient took it twice daily. What is the dosage?" — "it" has no embedding |
| Cultural semantic drift | "Luxury" in Japanese travel context ≠ "luxury" in Brazilian |
| Domain vocabulary gap | "CET1 ratio" has weak embedding if not in training corpus |
| Temporal context | "Recent" and "2024" are not semantically equivalent but should retrieve similar recency |
| Cross-lingual mismatch | French query → English document corpus → semantic gap |
| Negation | "No shellfish" and "shellfish" are almost identical in dense embedding space |
7.2 Improvements by Technique
Rule-based: transform "no X" → "NOT_X" as a distinct token
7.3 SSM-Based Encoders for Domain-Specific Embeddings
The Mamba papers expose a new design option that isn't widely deployed yet but is a near-term architectural consideration:
Current pattern: Use a Transformer-based encoder (BERT variant) to produce embeddings, then use a Transformer-based decoder (GPT variant) to generate answers. Both components pay the quadratic cost.
Emerging pattern: Use an SSM-based encoder for long-document embeddings.
Why this matters for enterprise domains with long source documents:
Healthcare:
EHR summary: 3,000 tokens → BERT-based encoder: O(3,000²) attention
Clinical note corpus: 500 documents × 3,000 tokens = 1.5M tokens to embed
Mamba encoder: O(3,000) per document — linear, scales to full corpus
Wealth Management:
Offering memorandum: 50,000 tokens
Standard BERT: impossible to embed in one pass (exceeds context window)
Chunking required: loses cross-section coherence
Mamba encoder: can process the full document → better global embedding
The long-context embedding capability of SSM-based encoders is particularly valuable for the prospectus analysis, clinical guideline embedding, and regulatory document embedding use cases described earlier. A single document-level embedding that preserves the global context of a 50,000-token document is not possible with standard Transformer encoders without chunking and averaging — which loses structural coherence.
Part of the RAG Enterprise Series. Next: Mamba and SSMs — What the Generation Backbone Change Means for RAG.