hard🤖 GenAI Architecture

Design a RAG pipeline for a customer support chatbot with 10M documents.

A B2B SaaS company wants to build a customer support chatbot that answers questions using their internal knowledge base (10M support articles, product docs, and resolved tickets). **Requirements:** - Response latency < 3s - Answers must cite sources - Must handle multi-turn conversations - Knowledge base updates daily **Design the RAG pipeline.**

💡 Hints (3)

1.10M documents — think about chunking strategy and vector index at scale (pgvector? Pinecone? Weaviate?).
2.Multi-turn: where do you store conversation history? How do you reformulate follow-up questions?
3.Daily updates: incremental indexing vs full re-index.

✅ View Solution

**Pipeline:** 1. **Ingestion** — Chunk documents (512 tokens, 10% overlap). Embed with text-embedding-3-large. Store in Pinecone or pgvector. 2. **Query reformulation** — Use LLM to rewrite follow-up questions into standalone queries (HyDE or summarization of chat history). 3. **Retrieval** — Hybrid search: dense (vector) + sparse (BM25). Re-rank with cross-encoder. 4. **Generation** — GPT-4o with retrieved chunks as context. Prompt enforces citation. 5. **Updates** — Incremental upsert on document change events via webhook. **Trade-offs:** pgvector is cheaper; Pinecone scales better at 10M+. Cross-encoder re-ranking adds ~300ms latency.