Embedding Models for Semantic Search in Your App (2026 Guide)

March 14, 2026 · AI for Developers, Semantic Search, Vector Databases

Semantic search has moved from “nice to have” to core product feature. Whether you’re building a docs site, a support portal, or in-app search for user-generated content, embeddings let you retrieve meaning instead of keywords. This guide shows how to build semantic search in a real app in 2026: choosing an embedding model, chunking data, storing vectors, and building fast retrieval with code examples.

What semantic search actually does

Traditional search relies on keyword matching. Semantic search turns text into vectors (embeddings) so you can find meaning—even when the query uses different words. “Reset password” can match “forgot my login,” and “refund policy” can match “return money within 30 days.”

At a high level, the pipeline looks like this:

Choosing an embedding model in 2026

There’s no single “best” embedding model. Your choice depends on cost, latency, hosting, and accuracy. In 2026, most teams choose one of three categories:

Concrete recommendation: Start with a high-quality hosted embedding for MVP, then evaluate self-hosted models once you exceed ~10M embeddings or need tighter latency control.

Key model selection criteria

Chunking strategy that actually works

Embedding entire documents is a common mistake. Queries are short, and big documents average out meaning. Chunking improves relevance.

A practical baseline:

Keep chunk metadata. You’ll need it for filtering and result display.

Data format for embeddings

Store each chunk with its vector, source document, and metadata (title, URL, tags, timestamps). Many teams use a JSON structure. Before storing or sending, validate the JSON; the DevToolKit JSON Formatter is a quick way to inspect payloads.

{
  "id": "chunk_9f2d...",
  "doc_id": "docs_reset_password",
  "text": "If you forgot your password, click Reset and check your email...",
  "metadata": {
    "title": "Reset your password",
    "url": "https://example.com/docs/reset-password",
    "section": "Account" 
  },
  "embedding": [0.012, -0.984, 0.112, ...]
}

Vector storage options (practical tradeoffs)

There are two common paths:

For most apps under 5–10 million vectors, Postgres + pgvector is enough. If you’re indexing hundreds of millions, consider a dedicated service.

Example: pgvector table schema

CREATE TABLE doc_chunks (
  id UUID PRIMARY KEY,
  doc_id TEXT NOT NULL,
  chunk_index INT NOT NULL,
  text TEXT NOT NULL,
  metadata JSONB,
  embedding VECTOR(1536)
);

CREATE INDEX ON doc_chunks USING ivfflat (embedding vector_cosine_ops);

Need UUIDs for chunk IDs? Use the DevToolKit UUID Generator to seed test data.

End-to-end embedding pipeline (Node.js)

This example shows chunking, embedding, and storage. It’s model-agnostic—replace the embedding call with your provider of choice.

import { Client } from "pg";
import { v4 as uuid } from "uuid";

const pg = new Client({ connectionString: process.env.DATABASE_URL });
await pg.connect();

function chunkText(text, size = 500, overlap = 80) {
  const words = text.split(/\s+/);
  const chunks = [];
  let i = 0;
  while (i < words.length) {
    const slice = words.slice(i, i + size);
    chunks.push(slice.join(" "));
    i += size - overlap;
  }
  return chunks;
}

async function embed(text) {
  // Replace with your embedding provider
  const res = await fetch("https://api.example.com/embeddings", {
    method: "POST",
    headers: { "Content-Type": "application/json", "Authorization": `Bearer ${process.env.API_KEY}` },
    body: JSON.stringify({ model: "embed-2026", input: text })
  });
  const data = await res.json();
  return data.embedding;
}

async function ingestDoc(doc) {
  const chunks = chunkText(doc.text);
  for (let idx = 0; idx < chunks.length; idx++) {
    const embedding = await embed(chunks[idx]);
    await pg.query(
      `INSERT INTO doc_chunks (id, doc_id, chunk_index, text, metadata, embedding)
       VALUES ($1, $2, $3, $4, $5, $6)`,
      [uuid(), doc.id, idx, chunks[idx], doc.metadata, embedding]
    );
  }
}

Semantic query and ranking

Once data is embedded and indexed, queries are straightforward: embed the query and search for nearest neighbors. Use cosine similarity for normalized embeddings.

async function search(query, limit = 5) {
  const queryEmbedding = await embed(query);
  const res = await pg.query(
    `SELECT doc_id, text, metadata,
            1 - (embedding <=> $1) AS score
     FROM doc_chunks
     ORDER BY embedding <=> $1
     LIMIT $2`,
    [queryEmbedding, limit]
  );
  return res.rows;
}

Hybrid search: keyword + semantic

Semantic search alone can miss exact matches (e.g., error codes, function names). Hybrid search combines keyword ranking with vector similarity.

Practical optimization tips

RAG integration: from search to answers

Semantic search is the retrieval layer of a RAG system. After you get top chunks, pass them into your LLM prompt. Keep it tight—most models perform best with 2–5 chunks (300–600 tokens each).

const results = await search("How do I reset my password?", 4);
const context = results.map(r => `- ${r.text}`).join("\n");
const prompt = `Answer the user using only the context:\n${context}\n\nQuestion: How do I reset my password?`;

Security and privacy considerations

If your embedding payload includes IDs, make sure they’re URL-safe; the DevToolKit URL Encoder helps when passing identifiers in query parameters.

Monitoring quality (the part most teams skip)

Monitor search quality with actual user behavior. Track:

Build a test set of 50–200 queries with expected results. Run nightly evaluation to detect regressions when switching models or chunking logic.

Example: simple Python ingestion pipeline

import uuid
import psycopg2
import requests

conn = psycopg2.connect("dbname=app user=app")
cur = conn.cursor()


def embed(text):
    r = requests.post(
        "https://api.example.com/embeddings",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"model": "embed-2026", "input": text}
    )
    return r.json()["embedding"]


def ingest(doc_id, text, metadata):
    chunks = [text[i:i+1500] for i in range(0, len(text), 1400)]
    for i, chunk in enumerate(chunks):
        emb = embed(chunk)
        cur.execute(
            "INSERT INTO doc_chunks (id, doc_id, chunk_index, text, metadata, embedding) VALUES (%s,%s,%s,%s,%s,%s)",
            (str(uuid.uuid4()), doc_id, i, chunk, metadata, emb)
        )
    conn.commit()

Common mistakes to avoid

When to re-embed

Re-embed if you:

Version your embeddings so you can run A/B tests.

Final checklist before launch

FAQ

How many dimensions should embeddings have for semantic search?
Use 768–1,536 dimensions for most apps in 2026 because it balances retrieval quality with storage and latency.

Is pgvector fast enough for production?
Yes, pgvector is fast enough up to roughly 5–10 million vectors when indexed with IVFFLAT or HNSW and proper filtering.

How do I handle private documents in semantic search?
Enforce doc-level filters in every query and never return chunks that the user isn’t authorized to see.

Do I need hybrid search or just embeddings?
Use hybrid search for technical or code-heavy content because keyword matching improves precision on identifiers and error codes.

How often should I re-embed my data?
Re-embed whenever you switch models, change chunking logic, or update documents in a meaningful way.

Recommended Tools & Resources

Level up your workflow with these developer tools:

Try Cursor Editor → Anthropic API → AI Engineering by Chip Huyen →

More From Our Network

Dev Tools Digest

Get weekly developer tools, tips, and tutorials. Join our developer newsletter.