Building a RAG Pipeline From Scratch in 2026 (Production-Ready)

April 4, 2026 · AI for Developers, RAG, LLMs

Retrieval-Augmented Generation (RAG) is the practical way to ship LLM features that stay accurate and up to date. In 2026, the baseline expectation is a pipeline that can ingest documents continuously, retrieve fast, and generate answers with citations in under a second for common queries. This guide walks you through building a production-grade RAG pipeline from scratch with concrete code and numbers you can implement today.

What a RAG pipeline really is

A RAG system is a data pipeline plus a query-time retrieval and generation flow. The core stages are:

Think of RAG as a search engine feeding an LLM. If your retrieval is weak, generation will fail. Your biggest ROI comes from document quality, chunking, and retrieval strategy—not from prompt tweaks.

Architecture choices (and why they matter)

Make these decisions early to avoid rework:

In 2026, a typical production stack looks like: ingestion in Python/Node, embeddings via a hosted API or local model, vector search in Qdrant/Chroma/Weaviate, and generation with a high-quality model that supports tool usage or structured output.

Step 1: Ingestion and normalization

Ingestion is the most under-engineered part of RAG. Garbage-in, hallucinations-out. Normalize early:

If you’re parsing JSON data sources (APIs, logs), validate and format the output with a tool like the JSON Formatter to catch malformed fields before embedding.

Python ingestion example

import hashlib
from bs4 import BeautifulSoup

def normalize_html(html, source_url):
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    text = "\n".join([t.get_text(" ", strip=True) for t in soup.find_all(["h1","h2","h3","p","li"])])
    doc_id = hashlib.sha256((source_url + text).encode()).hexdigest()
    return {
        "id": doc_id,
        "text": text,
        "metadata": {"source": source_url}
    }

Node.js ingestion example

import crypto from "crypto";
import { JSDOM } from "jsdom";

export function normalizeHtml(html, sourceUrl) {
  const dom = new JSDOM(html);
  const document = dom.window.document;
  document.querySelectorAll("script,style,nav,footer").forEach(n => n.remove());
  const blocks = [...document.querySelectorAll("h1,h2,h3,p,li")]
    .map(n => n.textContent.trim())
    .filter(Boolean);
  const text = blocks.join("\n");
  const id = crypto.createHash("sha256").update(sourceUrl + text).digest("hex");
  return { id, text, metadata: { source: sourceUrl } };
}

Step 2: Chunking strategy (the highest leverage step)

Chunking determines recall. Overly large chunks cause irrelevant context. Overly small chunks lose meaning. A proven strategy:

Store chunk boundaries in metadata so you can show partial quotes later.

Chunking algorithm (language-agnostic)

function chunkText(tokens, size=600, overlap=100) {
  const chunks = []
  let i = 0
  while (i < tokens.length) {
    const chunk = tokens.slice(i, i + size)
    chunks.push(chunk)
    i += (size - overlap)
  }
  return chunks
}

Step 3: Embedding generation

Embeddings convert text into vectors. The two key constraints are latency and cost. If you expect heavy traffic, use a local embedding model or batch API calls. Always cache embeddings by content hash so you don’t pay twice.

Python embedding example

from openai import OpenAI
client = OpenAI()

def embed(texts):
    resp = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts
    )
    return [e.embedding for e in resp.data]

JavaScript embedding example

import OpenAI from "openai";
const client = new OpenAI();

export async function embed(texts) {
  const resp = await client.embeddings.create({
    model: "text-embedding-3-large",
    input: texts
  });
  return resp.data.map(d => d.embedding);
}

Step 4: Vector indexing

Most vector databases support metadata filters. Use them. They allow precise scoping (e.g., only docs from a single tenant or date range). Also consider hybrid search: combine vector similarity with keyword matches for higher recall.

Qdrant upsert example (Python)

from qdrant_client import QdrantClient
client = QdrantClient(url="http://localhost:6333")

client.upsert(
    collection_name="docs",
    points=[
        {"id": chunk_id, "vector": vector, "payload": metadata}
        for chunk_id, vector, metadata in chunk_records
    ]
)

Step 5: Retrieval and reranking

Basic vector search returns top-k candidates. A reranker (cross-encoder) can reorder results for better relevance. If you don’t want a second model, do a simple score threshold and cut off weak results.

Retrieval example (pseudo)

results = vector_search(query_embedding, k=8, filter={"tenant_id": "acme"})
reranked = rerank(query, results)
context = join_top_chunks(reranked, max_tokens=1200)

For keyword signals, consider a hybrid query:

Step 6: Prompting and grounding

A good RAG prompt is short, grounded, and explicit. Your prompt should instruct the model to cite the retrieved chunks and refuse if the answer is not supported by context.

Prompt template (compact and strict)

System: You are an expert assistant. Answer only using the provided context.
User: Question: {question}
Context:
{context}
Instructions:
- Cite sources with [1], [2] in-line
- If not in context, say “I don’t know based on the provided sources.”

Step 7: Evaluation and observability

RAG systems fail silently unless you measure them. Track:

Store traces with query, retrieved chunks, and final answer. A JSON logging format is easy to review and you can validate it quickly with the JSON Formatter.

Common pitfalls (and how to avoid them)

Practical data hygiene tips

Most RAG bugs come from messy data. A few tactics that help:

Reference implementation outline

Here is a minimal, production-ready RAG pipeline flow you can build in a week:

When to go beyond basic RAG

If you’re seeing missing answers, consider these upgrades:

Final checklist (ship-ready)

Build the boring parts right and RAG becomes one of the most reliable LLM features you can ship. The difference between a demo and a production system is almost always data hygiene, retrieval quality, and observability.

FAQ

What is the fastest way to build a RAG pipeline? The fastest way is to use an off-the-shelf vector DB, a hosted embedding API, and a strict prompt template; you can ship an MVP in 2–3 days with clean data.

What chunk size should I use for RAG in 2026? A 600-token chunk size with 100-token overlap is a reliable default for most developer documentation and knowledge bases.

Do I need hybrid search for RAG? You need hybrid search if exact matches matter or users ask for precise identifiers, because keyword signals improve recall over pure vector search.

How do I prevent hallucinations in RAG? You prevent hallucinations by using strict prompts, high-precision retrieval, and refusing to answer when context is missing.

What latency should a production RAG system target? A production RAG system should target p95 end-to-end latency under 1.5 seconds for interactive use cases.

Recommended Tools & Resources

Level up your workflow with these developer tools:

Try Cursor Editor → Anthropic API → AI Engineering by Chip Huyen →

More From Our Network

Dev Tools Digest

Get weekly developer tools, tips, and tutorials. Join our developer newsletter.