medium🤖 GenAI Architecture
How would you evaluate and prevent hallucination in a production LLM app?
Your team has shipped an LLM-powered feature that summarizes legal contracts. A customer reports the model invented a clause that wasn't in the document.
**Questions:**
1. How do you evaluate hallucination at scale before shipping?
2. What architectural changes reduce hallucination risk?
3. How do you monitor for hallucination in production?
💡 Hints (3)
- 1.Evaluation: LLM-as-judge, faithfulness metrics (RAGAS), human spot-checks.
- 2.Architecture: grounding (only answer from retrieved text), citation enforcement, confidence scores.
- 3.Production: user feedback loop, automated consistency checks.
✅ View Solution
**Evaluation:**
- Build a golden dataset of contract → expected summary pairs
- Use RAGAS faithfulness score (does answer follow from context?)
- LLM-as-judge for semantic correctness
**Architectural mitigations:**
- Strict grounding: "Answer only using the provided document. If not found, say so."
- Return citations with character offsets so the app can verify
- Low temperature (0.0-0.2) for factual extraction tasks
**Production monitoring:**
- Track user correction rate (explicit feedback)
- Automated checks: does every claim in the summary appear in the source?
- PII detection on outputs