Building a RAG Pipeline From Scratch in 2026 (Production-Ready)
April 4, 2026 · AI for Developers, RAG, LLMs
Retrieval-Augmented Generation (RAG) is the practical way to ship LLM features that stay accurate and up to date. In 2026, the baseline expectation is a pipeline that can ingest documents continuously, retrieve fast, and generate answers with citations in under a second for common queries. This guide walks you through building a production-grade RAG pipeline from scratch with concrete code and numbers you can implement today.
What a RAG pipeline really is
A RAG system is a data pipeline plus a query-time retrieval and generation flow. The core stages are:
- Ingestion: parse files, clean text, split into chunks, attach metadata
- Embedding: convert chunks into vectors (e.g., 384–1,536 dimensions)
- Indexing: store vectors in a vector database with metadata filters
- Retrieval: find top-k relevant chunks for a query
- Generation: provide chunks to the LLM and compose a response
- Evaluation & monitoring: track accuracy, latency, hallucinations
Think of RAG as a search engine feeding an LLM. If your retrieval is weak, generation will fail. Your biggest ROI comes from document quality, chunking, and retrieval strategy—not from prompt tweaks.
Architecture choices (and why they matter)
Make these decisions early to avoid rework:
- Chunk size: 400–800 tokens is a good default; use overlap (e.g., 80–120 tokens).
- Embedding model: pick a model with stable performance and good multilingual support if needed.
- Vector DB: start with a local option (SQLite + HNSW or embedded DB), then scale to a managed service.
- Metadata filters: store doc_type, source, created_at, tenant_id to support precise filtering.
- Retrieval strategy: hybrid (semantic + keyword) yields better recall than pure vector search.
In 2026, a typical production stack looks like: ingestion in Python/Node, embeddings via a hosted API or local model, vector search in Qdrant/Chroma/Weaviate, and generation with a high-quality model that supports tool usage or structured output.
Step 1: Ingestion and normalization
Ingestion is the most under-engineered part of RAG. Garbage-in, hallucinations-out. Normalize early:
- Remove boilerplate (headers, footers, nav)
- Preserve structural hints (headings, bullets)
- Capture metadata (title, URL, timestamps)
If you’re parsing JSON data sources (APIs, logs), validate and format the output with a tool like the JSON Formatter to catch malformed fields before embedding.
Python ingestion example
import hashlib
from bs4 import BeautifulSoup
def normalize_html(html, source_url):
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
text = "\n".join([t.get_text(" ", strip=True) for t in soup.find_all(["h1","h2","h3","p","li"])])
doc_id = hashlib.sha256((source_url + text).encode()).hexdigest()
return {
"id": doc_id,
"text": text,
"metadata": {"source": source_url}
}
Node.js ingestion example
import crypto from "crypto";
import { JSDOM } from "jsdom";
export function normalizeHtml(html, sourceUrl) {
const dom = new JSDOM(html);
const document = dom.window.document;
document.querySelectorAll("script,style,nav,footer").forEach(n => n.remove());
const blocks = [...document.querySelectorAll("h1,h2,h3,p,li")]
.map(n => n.textContent.trim())
.filter(Boolean);
const text = blocks.join("\n");
const id = crypto.createHash("sha256").update(sourceUrl + text).digest("hex");
return { id, text, metadata: { source: sourceUrl } };
}
Step 2: Chunking strategy (the highest leverage step)
Chunking determines recall. Overly large chunks cause irrelevant context. Overly small chunks lose meaning. A proven strategy:
- Chunk size: 600 tokens
- Overlap: 100 tokens
- Respect headings: split at h2/h3 when possible
Store chunk boundaries in metadata so you can show partial quotes later.
Chunking algorithm (language-agnostic)
function chunkText(tokens, size=600, overlap=100) {
const chunks = []
let i = 0
while (i < tokens.length) {
const chunk = tokens.slice(i, i + size)
chunks.push(chunk)
i += (size - overlap)
}
return chunks
}
Step 3: Embedding generation
Embeddings convert text into vectors. The two key constraints are latency and cost. If you expect heavy traffic, use a local embedding model or batch API calls. Always cache embeddings by content hash so you don’t pay twice.
Python embedding example
from openai import OpenAI
client = OpenAI()
def embed(texts):
resp = client.embeddings.create(
model="text-embedding-3-large",
input=texts
)
return [e.embedding for e in resp.data]
JavaScript embedding example
import OpenAI from "openai";
const client = new OpenAI();
export async function embed(texts) {
const resp = await client.embeddings.create({
model: "text-embedding-3-large",
input: texts
});
return resp.data.map(d => d.embedding);
}
Step 4: Vector indexing
Most vector databases support metadata filters. Use them. They allow precise scoping (e.g., only docs from a single tenant or date range). Also consider hybrid search: combine vector similarity with keyword matches for higher recall.
Qdrant upsert example (Python)
from qdrant_client import QdrantClient
client = QdrantClient(url="http://localhost:6333")
client.upsert(
collection_name="docs",
points=[
{"id": chunk_id, "vector": vector, "payload": metadata}
for chunk_id, vector, metadata in chunk_records
]
)
Step 5: Retrieval and reranking
Basic vector search returns top-k candidates. A reranker (cross-encoder) can reorder results for better relevance. If you don’t want a second model, do a simple score threshold and cut off weak results.
Retrieval example (pseudo)
results = vector_search(query_embedding, k=8, filter={"tenant_id": "acme"})
reranked = rerank(query, results)
context = join_top_chunks(reranked, max_tokens=1200)
For keyword signals, consider a hybrid query:
- Vector search for semantic similarity
- BM25 or keyword search for exact matches
- Merge and dedupe results
Step 6: Prompting and grounding
A good RAG prompt is short, grounded, and explicit. Your prompt should instruct the model to cite the retrieved chunks and refuse if the answer is not supported by context.
Prompt template (compact and strict)
System: You are an expert assistant. Answer only using the provided context.
User: Question: {question}
Context:
{context}
Instructions:
- Cite sources with [1], [2] in-line
- If not in context, say “I don’t know based on the provided sources.”
Step 7: Evaluation and observability
RAG systems fail silently unless you measure them. Track:
- Answer accuracy: rate by human review or QA tests
- Context recall: did retrieval include the right chunk?
- Latency: p95 < 1.5s for interactive apps
- Cost per query: embeddings + LLM calls
Store traces with query, retrieved chunks, and final answer. A JSON logging format is easy to review and you can validate it quickly with the JSON Formatter.
Common pitfalls (and how to avoid them)
- Over-chunking: tiny chunks lose context; stick to 500–800 tokens.
- No metadata filters: cross-tenant leakage is a security risk.
- Embedding everything: filter out nav, ads, and boilerplate.
- Ignoring cache: hashing content saves significant cost.
- No evaluation loop: you can’t improve what you can’t measure.
Practical data hygiene tips
Most RAG bugs come from messy data. A few tactics that help:
- Stable IDs: use UUIDs for docs or chunks (create them with the UUID Generator).
- Normalize URLs: encode special characters using the URL Encoder/Decoder before hashing.
- Regex cleanup: remove tracking parameters or repeated whitespace using the Regex Tester to validate patterns.
- Binary payloads: if you must store small binaries in JSON, encode safely with the Base64 Encoder/Decoder.
Reference implementation outline
Here is a minimal, production-ready RAG pipeline flow you can build in a week:
- Ingestion: nightly job pulls HTML/Markdown/PDFs → normalized text
- Chunking: 600-token chunks with 100 overlap
- Embeddings: batch size 64, cached by SHA-256 hash
- Index: Qdrant or Chroma with metadata filters
- Retrieval: vector top-8 + rerank + cutoff threshold
- Generation: strict prompt with citations
- Observability: JSON logs of query, context, answer, latency
When to go beyond basic RAG
If you’re seeing missing answers, consider these upgrades:
- Hybrid search: combine BM25 + vector results
- Query rewriting: expand queries into multiple subqueries
- Contextual reranking: cross-encoder rerankers boost precision
- Knowledge graph layer: for highly structured data
- Feedback loops: integrate user thumbs-up/down into retraining
Final checklist (ship-ready)
- Chunking tested with real documents
- Embeddings cached and deduped
- Metadata filters enforced per tenant
- Retrieval p95 < 300ms
- Full request trace stored as JSON
- Hallucination fallbacks enabled
Build the boring parts right and RAG becomes one of the most reliable LLM features you can ship. The difference between a demo and a production system is almost always data hygiene, retrieval quality, and observability.
FAQ
What is the fastest way to build a RAG pipeline? The fastest way is to use an off-the-shelf vector DB, a hosted embedding API, and a strict prompt template; you can ship an MVP in 2–3 days with clean data.
What chunk size should I use for RAG in 2026? A 600-token chunk size with 100-token overlap is a reliable default for most developer documentation and knowledge bases.
Do I need hybrid search for RAG? You need hybrid search if exact matches matter or users ask for precise identifiers, because keyword signals improve recall over pure vector search.
How do I prevent hallucinations in RAG? You prevent hallucinations by using strict prompts, high-precision retrieval, and refusing to answer when context is missing.
What latency should a production RAG system target? A production RAG system should target p95 end-to-end latency under 1.5 seconds for interactive use cases.
Recommended Tools & Resources
Level up your workflow with these developer tools:
Try Cursor Editor → Anthropic API → AI Engineering by Chip Huyen →More From Our Network
- HomeOfficeRanked.ai — AI workstation hardware and setup guides
- TheOpsDesk.ai — AI automation case studies and solopreneur ops
Dev Tools Digest
Get weekly developer tools, tips, and tutorials. Join our developer newsletter.