Building a RAG Pipeline From Scratch in 2026 (Production-Ready)

Q: What is the fastest way to build a RAG pipeline?

The fastest way is to use an off-the-shelf vector DB, a hosted embedding API, and a strict prompt template; you can ship an MVP in 2–3 days with clean data.

Q: What chunk size should I use for RAG in 2026?

A 600-token chunk size with 100-token overlap is a reliable default for most developer documentation and knowledge bases.

Q: Do I need hybrid search for RAG?

You need hybrid search if exact matches matter or users ask for precise identifiers, because keyword signals improve recall over pure vector search.

Q: How do I prevent hallucinations in RAG?

You prevent hallucinations by using strict prompts, high-precision retrieval, and refusing to answer when context is missing.

Q: What latency should a production RAG system target?

A production RAG system should target p95 end-to-end latency under 1.5 seconds for interactive use cases.

April 4, 2026 · AI for Developers, RAG, LLMs

Retrieval-Augmented Generation (RAG) is the practical way to ship LLM features that stay accurate and up to date. In 2026, the baseline expectation is a pipeline that can ingest documents continuously, retrieve fast, and generate answers with citations in under a second for common queries. This guide walks you through building a production-grade RAG pipeline from scratch with concrete code and numbers you can implement today.

What a RAG pipeline really is

A RAG system is a data pipeline plus a query-time retrieval and generation flow. The core stages are:

Ingestion: parse files, clean text, split into chunks, attach metadata
Embedding: convert chunks into vectors (e.g., 384–1,536 dimensions)
Indexing: store vectors in a vector database with metadata filters
Retrieval: find top-k relevant chunks for a query
Generation: provide chunks to the LLM and compose a response
Evaluation & monitoring: track accuracy, latency, hallucinations

Think of RAG as a search engine feeding an LLM. If your retrieval is weak, generation will fail. Your biggest ROI comes from document quality, chunking, and retrieval strategy—not from prompt tweaks.

Architecture choices (and why they matter)

Make these decisions early to avoid rework:

Chunk size: 400–800 tokens is a good default; use overlap (e.g., 80–120 tokens).
Embedding model: pick a model with stable performance and good multilingual support if needed.
Vector DB: start with a local option (SQLite + HNSW or embedded DB), then scale to a managed service.
Metadata filters: store doc_type, source, created_at, tenant_id to support precise filtering.
Retrieval strategy: hybrid (semantic + keyword) yields better recall than pure vector search.

In 2026, a typical production stack looks like: ingestion in Python/Node, embeddings via a hosted API or local model, vector search in Qdrant/Chroma/Weaviate, and generation with a high-quality model that supports tool usage or structured output.

Step 1: Ingestion and normalization

Ingestion is the most under-engineered part of RAG. Garbage-in, hallucinations-out. Normalize early:

Remove boilerplate (headers, footers, nav)
Preserve structural hints (headings, bullets)
Capture metadata (title, URL, timestamps)

If you’re parsing JSON data sources (APIs, logs), validate and format the output with a tool like the JSON Formatter to catch malformed fields before embedding.

Python ingestion example

import hashlib
from bs4 import BeautifulSoup

def normalize_html(html, source_url):
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    text = "\n".join([t.get_text(" ", strip=True) for t in soup.find_all(["h1","h2","h3","p","li"])])
    doc_id = hashlib.sha256((source_url + text).encode()).hexdigest()
    return {
        "id": doc_id,
        "text": text,
        "metadata": {"source": source_url}
    }

Node.js ingestion example

import crypto from "crypto";
import { JSDOM } from "jsdom";

export function normalizeHtml(html, sourceUrl) {
  const dom = new JSDOM(html);
  const document = dom.window.document;
  document.querySelectorAll("script,style,nav,footer").forEach(n => n.remove());
  const blocks = [...document.querySelectorAll("h1,h2,h3,p,li")]
    .map(n => n.textContent.trim())
    .filter(Boolean);
  const text = blocks.join("\n");
  const id = crypto.createHash("sha256").update(sourceUrl + text).digest("hex");
  return { id, text, metadata: { source: sourceUrl } };
}

Step 2: Chunking strategy (the highest leverage step)

Chunking determines recall. Overly large chunks cause irrelevant context. Overly small chunks lose meaning. A proven strategy:

Chunk size: 600 tokens
Overlap: 100 tokens
Respect headings: split at h2/h3 when possible

Store chunk boundaries in metadata so you can show partial quotes later.

Chunking algorithm (language-agnostic)

function chunkText(tokens, size=600, overlap=100) {
  const chunks = []
  let i = 0
  while (i < tokens.length) {
    const chunk = tokens.slice(i, i + size)
    chunks.push(chunk)
    i += (size - overlap)
  }
  return chunks
}

Step 3: Embedding generation

Embeddings convert text into vectors. The two key constraints are latency and cost. If you expect heavy traffic, use a local embedding model or batch API calls. Always cache embeddings by content hash so you don’t pay twice.

Python embedding example

from openai import OpenAI
client = OpenAI()

def embed(texts):
    resp = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts
    )
    return [e.embedding for e in resp.data]

JavaScript embedding example

import OpenAI from "openai";
const client = new OpenAI();

export async function embed(texts) {
  const resp = await client.embeddings.create({
    model: "text-embedding-3-large",
    input: texts
  });
  return resp.data.map(d => d.embedding);
}

Step 4: Vector indexing

Most vector databases support metadata filters. Use them. They allow precise scoping (e.g., only docs from a single tenant or date range). Also consider hybrid search: combine vector similarity with keyword matches for higher recall.

Qdrant upsert example (Python)

from qdrant_client import QdrantClient
client = QdrantClient(url="http://localhost:6333")

client.upsert(
    collection_name="docs",
    points=[
        {"id": chunk_id, "vector": vector, "payload": metadata}
        for chunk_id, vector, metadata in chunk_records
    ]
)

Step 5: Retrieval and reranking

Basic vector search returns top-k candidates. A reranker (cross-encoder) can reorder results for better relevance. If you don’t want a second model, do a simple score threshold and cut off weak results.

Retrieval example (pseudo)

results = vector_search(query_embedding, k=8, filter={"tenant_id": "acme"})
reranked = rerank(query, results)
context = join_top_chunks(reranked, max_tokens=1200)

For keyword signals, consider a hybrid query:

Vector search for semantic similarity
BM25 or keyword search for exact matches
Merge and dedupe results

Step 6: Prompting and grounding

A good RAG prompt is short, grounded, and explicit. Your prompt should instruct the model to cite the retrieved chunks and refuse if the answer is not supported by context.

Prompt template (compact and strict)

System: You are an expert assistant. Answer only using the provided context.
User: Question: {question}
Context:
{context}
Instructions:
- Cite sources with [1], [2] in-line
- If not in context, say “I don’t know based on the provided sources.”

Step 7: Evaluation and observability

RAG systems fail silently unless you measure them. Track:

Answer accuracy: rate by human review or QA tests
Context recall: did retrieval include the right chunk?
Latency: p95 < 1.5s for interactive apps
Cost per query: embeddings + LLM calls

Store traces with query, retrieved chunks, and final answer. A JSON logging format is easy to review and you can validate it quickly with the JSON Formatter.

Common pitfalls (and how to avoid them)

Over-chunking: tiny chunks lose context; stick to 500–800 tokens.
No metadata filters: cross-tenant leakage is a security risk.
Embedding everything: filter out nav, ads, and boilerplate.
Ignoring cache: hashing content saves significant cost.
No evaluation loop: you can’t improve what you can’t measure.

Practical data hygiene tips

Most RAG bugs come from messy data. A few tactics that help:

Stable IDs: use UUIDs for docs or chunks (create them with the UUID Generator).
Normalize URLs: encode special characters using the URL Encoder/Decoder before hashing.
Regex cleanup: remove tracking parameters or repeated whitespace using the Regex Tester to validate patterns.
Binary payloads: if you must store small binaries in JSON, encode safely with the Base64 Encoder/Decoder.

Reference implementation outline

Here is a minimal, production-ready RAG pipeline flow you can build in a week:

Ingestion: nightly job pulls HTML/Markdown/PDFs → normalized text
Chunking: 600-token chunks with 100 overlap
Embeddings: batch size 64, cached by SHA-256 hash
Index: Qdrant or Chroma with metadata filters
Retrieval: vector top-8 + rerank + cutoff threshold
Generation: strict prompt with citations
Observability: JSON logs of query, context, answer, latency

When to go beyond basic RAG

If you’re seeing missing answers, consider these upgrades:

Hybrid search: combine BM25 + vector results
Query rewriting: expand queries into multiple subqueries
Contextual reranking: cross-encoder rerankers boost precision
Knowledge graph layer: for highly structured data
Feedback loops: integrate user thumbs-up/down into retraining

Final checklist (ship-ready)

Chunking tested with real documents
Embeddings cached and deduped
Metadata filters enforced per tenant
Retrieval p95 < 300ms
Full request trace stored as JSON
Hallucination fallbacks enabled

Build the boring parts right and RAG becomes one of the most reliable LLM features you can ship. The difference between a demo and a production system is almost always data hygiene, retrieval quality, and observability.

FAQ

What is the fastest way to build a RAG pipeline? The fastest way is to use an off-the-shelf vector DB, a hosted embedding API, and a strict prompt template; you can ship an MVP in 2–3 days with clean data.

What chunk size should I use for RAG in 2026? A 600-token chunk size with 100-token overlap is a reliable default for most developer documentation and knowledge bases.

Do I need hybrid search for RAG? You need hybrid search if exact matches matter or users ask for precise identifiers, because keyword signals improve recall over pure vector search.

How do I prevent hallucinations in RAG? You prevent hallucinations by using strict prompts, high-precision retrieval, and refusing to answer when context is missing.

What latency should a production RAG system target? A production RAG system should target p95 end-to-end latency under 1.5 seconds for interactive use cases.

Recommended Tools & Resources

Level up your workflow with these developer tools:

Try Cursor Editor → Anthropic API → AI Engineering by Chip Huyen →