hard🤖 GenAI Architecture
Design a RAG pipeline for a customer support chatbot with 10M documents.
A B2B SaaS company wants to build a customer support chatbot that answers questions using their internal knowledge base (10M support articles, product docs, and resolved tickets).
**Requirements:**
- Response latency < 3s
- Answers must cite sources
- Must handle multi-turn conversations
- Knowledge base updates daily
**Design the RAG pipeline.**
💡 Hints (3)
- 1.10M documents — think about chunking strategy and vector index at scale (pgvector? Pinecone? Weaviate?).
- 2.Multi-turn: where do you store conversation history? How do you reformulate follow-up questions?
- 3.Daily updates: incremental indexing vs full re-index.
✅ View Solution
**Pipeline:**
1. **Ingestion** — Chunk documents (512 tokens, 10% overlap). Embed with text-embedding-3-large. Store in Pinecone or pgvector.
2. **Query reformulation** — Use LLM to rewrite follow-up questions into standalone queries (HyDE or summarization of chat history).
3. **Retrieval** — Hybrid search: dense (vector) + sparse (BM25). Re-rank with cross-encoder.
4. **Generation** — GPT-4o with retrieved chunks as context. Prompt enforces citation.
5. **Updates** — Incremental upsert on document change events via webhook.
**Trade-offs:** pgvector is cheaper; Pinecone scales better at 10M+. Cross-encoder re-ranking adds ~300ms latency.