Embedding Models for Semantic Search in Your App (2026 Guide)

March 14, 2026 · AI for Developers, Semantic Search, Vector Databases

Semantic search has moved from “nice to have” to core product feature. Whether you’re building a docs site, a support portal, or in-app search for user-generated content, embeddings let you retrieve meaning instead of keywords. This guide shows how to build semantic search in a real app in 2026: choosing an embedding model, chunking data, storing vectors, and building fast retrieval with code examples.

What semantic search actually does

Traditional search relies on keyword matching. Semantic search turns text into vectors (embeddings) so you can find meaning—even when the query uses different words. “Reset password” can match “forgot my login,” and “refund policy” can match “return money within 30 days.”

At a high level, the pipeline looks like this:

Ingest: Split content into chunks.
Embed: Convert each chunk into a vector.
Index: Store vectors in a vector index (pgvector, FAISS, Milvus, Pinecone, etc.).
Query: Embed the user query and retrieve nearest neighbors.

Choosing an embedding model in 2026

There’s no single “best” embedding model. Your choice depends on cost, latency, hosting, and accuracy. In 2026, most teams choose one of three categories:

Hosted API embeddings (fast to ship, pay-per-use). Good for startups and teams optimizing time-to-value.
Open-source embeddings (self-hosted, predictable cost). Best for high volume or data control.
Domain-tuned embeddings (finetuned on specific content). Best for very technical corpora or highly specialized language.

Concrete recommendation: Start with a high-quality hosted embedding for MVP, then evaluate self-hosted models once you exceed ~10M embeddings or need tighter latency control.

Key model selection criteria

Vector dimensionality: 384–3,072 dims are common. Higher dims improve recall but cost more to store and search.
Latency: Sub-100ms embedding is typical for API models; self-hosting can be 10–40ms with optimized GPU.
Multilingual support: If your app has global users, require it from day one.
License/compliance: Ensure you can store and process user data under your policy.

Chunking strategy that actually works

Embedding entire documents is a common mistake. Queries are short, and big documents average out meaning. Chunking improves relevance.

A practical baseline:

Chunk size: 300–700 tokens
Overlap: 50–100 tokens to preserve context between chunks
Chunk by structure: Split by headings, paragraphs, or sections when possible

Keep chunk metadata. You’ll need it for filtering and result display.

Data format for embeddings

Store each chunk with its vector, source document, and metadata (title, URL, tags, timestamps). Many teams use a JSON structure. Before storing or sending, validate the JSON; the DevToolKit JSON Formatter is a quick way to inspect payloads.

{
  "id": "chunk_9f2d...",
  "doc_id": "docs_reset_password",
  "text": "If you forgot your password, click Reset and check your email...",
  "metadata": {
    "title": "Reset your password",
    "url": "https://example.com/docs/reset-password",
    "section": "Account" 
  },
  "embedding": [0.012, -0.984, 0.112, ...]
}

Vector storage options (practical tradeoffs)

There are two common paths:

Postgres + pgvector: Great for teams already on Postgres, simple to operate.
Dedicated vector databases: Faster search at scale, better distributed support.

For most apps under 5–10 million vectors, Postgres + pgvector is enough. If you’re indexing hundreds of millions, consider a dedicated service.

Example: pgvector table schema

CREATE TABLE doc_chunks (
  id UUID PRIMARY KEY,
  doc_id TEXT NOT NULL,
  chunk_index INT NOT NULL,
  text TEXT NOT NULL,
  metadata JSONB,
  embedding VECTOR(1536)
);

CREATE INDEX ON doc_chunks USING ivfflat (embedding vector_cosine_ops);

Need UUIDs for chunk IDs? Use the DevToolKit UUID Generator to seed test data.

End-to-end embedding pipeline (Node.js)

This example shows chunking, embedding, and storage. It’s model-agnostic—replace the embedding call with your provider of choice.

import { Client } from "pg";
import { v4 as uuid } from "uuid";

const pg = new Client({ connectionString: process.env.DATABASE_URL });
await pg.connect();

function chunkText(text, size = 500, overlap = 80) {
  const words = text.split(/\s+/);
  const chunks = [];
  let i = 0;
  while (i < words.length) {
    const slice = words.slice(i, i + size);
    chunks.push(slice.join(" "));
    i += size - overlap;
  }
  return chunks;
}

async function embed(text) {
  // Replace with your embedding provider
  const res = await fetch("https://api.example.com/embeddings", {
    method: "POST",
    headers: { "Content-Type": "application/json", "Authorization": `Bearer ${process.env.API_KEY}` },
    body: JSON.stringify({ model: "embed-2026", input: text })
  });
  const data = await res.json();
  return data.embedding;
}

async function ingestDoc(doc) {
  const chunks = chunkText(doc.text);
  for (let idx = 0; idx < chunks.length; idx++) {
    const embedding = await embed(chunks[idx]);
    await pg.query(
      `INSERT INTO doc_chunks (id, doc_id, chunk_index, text, metadata, embedding)
       VALUES ($1, $2, $3, $4, $5, $6)`,
      [uuid(), doc.id, idx, chunks[idx], doc.metadata, embedding]
    );
  }
}

Semantic query and ranking

Once data is embedded and indexed, queries are straightforward: embed the query and search for nearest neighbors. Use cosine similarity for normalized embeddings.

async function search(query, limit = 5) {
  const queryEmbedding = await embed(query);
  const res = await pg.query(
    `SELECT doc_id, text, metadata,
            1 - (embedding <=> $1) AS score
     FROM doc_chunks
     ORDER BY embedding <=> $1
     LIMIT $2`,
    [queryEmbedding, limit]
  );
  return res.rows;
}

Hybrid search: keyword + semantic

Semantic search alone can miss exact matches (e.g., error codes, function names). Hybrid search combines keyword ranking with vector similarity.

Keyword engine: Postgres full-text, Elasticsearch, or Meilisearch.
Vector engine: pgvector or a vector DB.
Merge: Weighted score: final = 0.6 * vector_score + 0.4 * keyword_score.

Practical optimization tips

Normalize embeddings: Most providers do this. If not, normalize to unit length for cosine distance.
Cache query embeddings: Popular queries repeat; store hash → embedding.
Batch embedding: Embed in batches of 16–128 to cut API latency.
Filter before search: If you can filter by language, product, or visibility, do it first.
Store metadata separately: Keep the vector table lean for faster ANN search.

RAG integration: from search to answers

Semantic search is the retrieval layer of a RAG system. After you get top chunks, pass them into your LLM prompt. Keep it tight—most models perform best with 2–5 chunks (300–600 tokens each).

const results = await search("How do I reset my password?", 4);
const context = results.map(r => `- ${r.text}`).join("\n");
const prompt = `Answer the user using only the context:\n${context}\n\nQuestion: How do I reset my password?`;

Security and privacy considerations

PII handling: Avoid embedding sensitive fields directly (SSNs, phone numbers).
Access control: Enforce doc-level permissions at query time.
Data retention: Define an embedding deletion strategy if content is removed.

If your embedding payload includes IDs, make sure they’re URL-safe; the DevToolKit URL Encoder helps when passing identifiers in query parameters.

Monitoring quality (the part most teams skip)

Monitor search quality with actual user behavior. Track:

Click-through rate on top results
Zero-result searches (even with embeddings)
Query-to-answer satisfaction (thumbs up/down)

Build a test set of 50–200 queries with expected results. Run nightly evaluation to detect regressions when switching models or chunking logic.

Example: simple Python ingestion pipeline

import uuid
import psycopg2
import requests

conn = psycopg2.connect("dbname=app user=app")
cur = conn.cursor()


def embed(text):
    r = requests.post(
        "https://api.example.com/embeddings",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"model": "embed-2026", "input": text}
    )
    return r.json()["embedding"]


def ingest(doc_id, text, metadata):
    chunks = [text[i:i+1500] for i in range(0, len(text), 1400)]
    for i, chunk in enumerate(chunks):
        emb = embed(chunk)
        cur.execute(
            "INSERT INTO doc_chunks (id, doc_id, chunk_index, text, metadata, embedding) VALUES (%s,%s,%s,%s,%s,%s)",
            (str(uuid.uuid4()), doc_id, i, chunk, metadata, emb)
        )
    conn.commit()

Common mistakes to avoid

Embedding raw HTML: Strip tags first.
Skipping overlap: You lose context at chunk boundaries.
Ignoring filters: Without doc-level access control, users can see private content.
Evaluating only with “gut feel”: Use a test set and metrics.

When to re-embed

Re-embed if you:

Change the embedding model
Change chunking strategy
Significantly update the document text

Version your embeddings so you can run A/B tests.

Final checklist before launch

Chunk size and overlap defined
Embedding model selected and benchmarked
Vector index configured (IVFFLAT or HNSW)
Access control enforced at query time
Quality evaluation dataset in place

FAQ

How many dimensions should embeddings have for semantic search?
Use 768–1,536 dimensions for most apps in 2026 because it balances retrieval quality with storage and latency.

Is pgvector fast enough for production?
Yes, pgvector is fast enough up to roughly 5–10 million vectors when indexed with IVFFLAT or HNSW and proper filtering.

How do I handle private documents in semantic search?
Enforce doc-level filters in every query and never return chunks that the user isn’t authorized to see.

Do I need hybrid search or just embeddings?
Use hybrid search for technical or code-heavy content because keyword matching improves precision on identifiers and error codes.

How often should I re-embed my data?
Re-embed whenever you switch models, change chunking logic, or update documents in a meaningful way.

Recommended Tools & Resources

Level up your workflow with these developer tools:

Try Cursor Editor → Anthropic API → AI Engineering by Chip Huyen →