What are embeddings in AI?

Embeddings are fixed-length arrays of numbers (vectors) that capture the semantic meaning of text. Similar meanings produce similar vectors, enabling machines to understand that 'plumbing repair' is related to 'fix a leaky faucet' even though the words differ.

How does cosine similarity work for search?

Cosine similarity measures the angle between two vectors on a scale from -1 to 1. Vectors pointing in similar directions (small angle) score close to 1, meaning they're semantically similar. It ignores magnitude, focusing purely on direction.

Which embedding model should I use?

For most applications, OpenAI's text-embedding-3-small offers the best balance of quality and cost at $0.02 per million tokens. For maximum quality, use text-embedding-3-large or Cohere embed-v4. For self-hosted, BGE-M3 or Nomic Embed are strong open-source options.

What is the difference between vector databases?

Pinecone is fully managed with zero ops. Qdrant offers the best self-hosted experience. pgvector adds vector search to existing Postgres. ChromaDB is ideal for prototyping. Choose based on whether you need managed vs self-hosted and your existing infrastructure.

Why does chunking matter for embedding quality?

Embeddings capture meaning for a fixed window of text. If chunks are too large, the embedding averages over too many concepts. Too small, and context is lost. Semantic chunking at paragraph or section boundaries typically produces the best retrieval quality.

What is hybrid search and when should I use it?

Hybrid search combines vector similarity (semantic meaning) with keyword matching (BM25). Use it when exact terms matter alongside meaning, like searching product names, code identifiers, or domain-specific terminology that embeddings might not capture well.

What are matryoshka embeddings?

Matryoshka embeddings store the most important information in earlier dimensions, letting you truncate vectors to smaller sizes with minimal quality loss. OpenAI's text-embedding-3-large can be shortened from 3072 to 256 dimensions while still outperforming the older ada-002 model at full size.

How many dimensions do I need for good search quality?

For most applications, 256 to 768 dimensions provide a strong balance of quality and performance. Higher dimensions (1536 or 3072) capture more nuance but increase storage and compute costs. With matryoshka embeddings, you can start small and increase only if quality demands it.

Embeddings Turn Text Into Meaning. Here's the Math and the Code

A user types "how to fix a leaky faucet" into your search bar. The top result is titled "plumbing repair." Another user writes "cancel my subscription" and lands on "Account Cancellation Policy." Zero keywords overlap. Yet the search engine knows they mean the same thing.

That's embeddings.

Behind every semantic search, every RAG pipeline, every recommendation engine, there's a model turning text into numbers and a distance function deciding which numbers are close. It sounds abstract until you see the actual vectors, write the actual math, and build the actual search. Then it clicks.

That's what this article does. We generate real embeddings, implement cosine similarity from scratch, build a working search engine in 50 lines, compare the leading models head to head, and graduate from in-memory brute force to a production vector database.

What Are Embeddings, Really?

An embedding is a fixed-length array of numbers that captures semantic meaning. Each number represents a position along some learned dimension of meaning. Texts with similar meanings land near each other in this high-dimensional space, regardless of the specific words they use.

The classic example, first demonstrated by Mikolov et al. at Google in 2013: the vector for "king" minus "man" plus "woman" produces a vector close to "queen." The model learned that royalty and gender are separate dimensions, and you can do arithmetic on them. This isn't a trick. It falls directly out of how embedding models are trained: by predicting which words appear near each other in billions of sentences.

Modern embedding models output vectors with hundreds or thousands of dimensions. OpenAI's text-embedding-3-small produces 1,536 numbers per text input. Each number is typically a float between -1 and 1. You can't interpret individual dimensions ("dimension 847 means formality"), but the overall pattern is what carries meaning.

Let's generate some real embeddings and look at what comes back.

TypeScript:

typescript

import OpenAI from "openai";
 
const openai = new OpenAI();
 
async function getEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return response.data[0].embedding;
}
 
async function main() {
  const text = "How to fix a leaky faucet";
  const embedding = await getEmbedding(text);
 
  console.log(`Text: "${text}"`);
  console.log(`Dimensions: ${embedding.length}`);
  console.log(`First 10 values: [${embedding.slice(0, 10).map(v => v.toFixed(6)).join(", ")}]`);
  console.log(`Min: ${Math.min(...embedding).toFixed(6)}`);
  console.log(`Max: ${Math.max(...embedding).toFixed(6)}`);
}
 
main();

Python:

python

from openai import OpenAI
 
client = OpenAI()
 
def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding
 
text = "How to fix a leaky faucet"
embedding = get_embedding(text)
 
print(f'Text: "{text}"')
print(f"Dimensions: {len(embedding)}")
print(f"First 10 values: {[round(v, 6) for v in embedding[:10]]}")
print(f"Min: {min(embedding):.6f}")
print(f"Max: {max(embedding):.6f}")

Running either version prints 1,536 floating point numbers. The exact values aren't meaningful individually. What matters is the pattern: two texts about plumbing produce similar patterns, while a text about stock trading produces a completely different one. The next section shows exactly how to measure that similarity.

How Does Cosine Similarity Actually Work?

Cosine similarity measures the angle between two vectors. Two vectors pointing in the same direction score 1.0 (identical meaning). Perpendicular vectors score 0.0 (unrelated). Opposite vectors score -1.0. The formula ignores magnitude and focuses purely on direction, which is why a short sentence and a long paragraph about the same topic still score high.

The formula is just the dot product of the two vectors divided by the product of their magnitudes.

Cosine similarity measures the angle between vectors. Smaller angle means higher similarity.

Here's cosine similarity implemented from scratch. No libraries, no dependencies. Five lines of math that power nearly every semantic search system in production.

TypeScript:

typescript

import OpenAI from "openai";
 
const openai = new OpenAI();
 
// Cosine similarity from scratch -- dot product divided by magnitude product
function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0;
  let magA = 0;
  let magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
 
async function getEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return response.data[0].embedding;
}
 
async function main() {
  const sentences = [
    "The cat sits on the mat",
    "A feline rests on the rug",
    "The stock market crashed yesterday",
  ];
 
  // Embed all three sentences in a single API call
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: sentences,
  });
  const embeddings = response.data.map(d => d.embedding);
 
  // Compare every pair
  for (let i = 0; i < sentences.length; i++) {
    for (let j = i + 1; j < sentences.length; j++) {
      const score = cosineSimilarity(embeddings[i], embeddings[j]);
      console.log(`"${sentences[i]}" vs "${sentences[j]}"`);
      console.log(`  Similarity: ${score.toFixed(4)}\n`);
    }
  }
}
 
main();

Python:

python

import math
from openai import OpenAI
 
client = OpenAI()
 
# Cosine similarity from scratch -- dot product divided by magnitude product
def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    mag_a = math.sqrt(sum(x * x for x in a))
    mag_b = math.sqrt(sum(x * x for x in b))
    return dot / (mag_a * mag_b)
 
def get_embeddings(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [d.embedding for d in response.data]
 
sentences = [
    "The cat sits on the mat",
    "A feline rests on the rug",
    "The stock market crashed yesterday",
]
 
embeddings = get_embeddings(sentences)
 
# Compare every pair
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        score = cosine_similarity(embeddings[i], embeddings[j])
        print(f'"{sentences[i]}" vs "{sentences[j]}"')
        print(f"  Similarity: {score:.4f}\n")

You'll see something like: the cat/feline pair scores around 0.85-0.90, while either sentence compared against the stock market sentence drops to 0.10-0.15. The embedding model has never seen these exact sentences before, but it's learned from training data that cats and felines are semantically close while stock markets have nothing to do with either.

That gap between 0.88 and 0.12 is the foundation of everything. It's how search engines find relevant results without keyword overlap, how RAG pipelines retrieve the right context for LLMs, and how recommendation systems surface content you'll actually care about.

Build Along: Semantic Search in 50 Lines

A working semantic search engine needs three steps: embed your documents at index time, embed the user's query at search time, and return the documents with the highest cosine similarity scores. The whole thing fits in about 50 lines.

The architecture is simple. At index time, you embed every document and store the vectors alongside the text. At query time, you embed the search query, compute cosine similarity against every stored vector, and return the top matches.

TypeScript:

typescript

import OpenAI from "openai";
 
const openai = new OpenAI();
 
interface Document {
  text: string;
  embedding: number[];
}
 
function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
 
// Index: embed all documents in one batch API call
async function indexDocuments(texts: string[]): Promise<Document[]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });
  return texts.map((text, i) => ({
    text,
    embedding: response.data[i].embedding,
  }));
}
 
// Search: embed query, brute-force compare against all documents
async function search(query: string, docs: Document[], topK = 3) {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  const queryEmbedding = response.data[0].embedding;
 
  return docs
    .map(doc => ({
      text: doc.text,
      score: cosineSimilarity(queryEmbedding, doc.embedding),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}
 
async function main() {
  // A small knowledge base about home repair
  const documents = await indexDocuments([
    "To fix a leaky faucet, first turn off the water supply valve under the sink.",
    "Replace worn washers and O-rings to stop faucet drips permanently.",
    "A running toilet usually means the flapper valve needs replacing.",
    "Use plumber's tape on threaded connections to prevent leaks.",
    "Annual HVAC filter replacement improves energy efficiency by 5-15%.",
    "Clogged drains can be cleared with a plunger or drain snake.",
    "Check your water heater's anode rod every 2-3 years to prevent tank corrosion.",
    "Investing in index funds provides broad market exposure at low cost.",
    "The Federal Reserve adjusts interest rates to control inflation.",
    "Quarterly earnings reports drive short-term stock price movements.",
  ]);
 
  console.log(`Indexed ${documents.length} documents\n`);
 
  // Search with natural language -- no keyword matching needed
  const results = await search("my sink is dripping water", documents);
  console.log('Query: "my sink is dripping water"\n');
  for (const result of results) {
    console.log(`  [${result.score.toFixed(4)}] ${result.text}`);
  }
}
 
main();

Python:

python

import math
from openai import OpenAI
 
client = OpenAI()
 
def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    mag_a = math.sqrt(sum(x * x for x in a))
    mag_b = math.sqrt(sum(x * x for x in b))
    return dot / (mag_a * mag_b)
 
# Index: embed all documents in one batch API call
def index_documents(texts: list[str]) -> list[dict]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [
        {"text": text, "embedding": response.data[i].embedding}
        for i, text in enumerate(texts)
    ]
 
# Search: embed query, brute-force compare against all documents
def search(query: str, docs: list[dict], top_k: int = 3) -> list[dict]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=query,
    )
    query_embedding = response.data[0].embedding
 
    scored = [
        {"text": doc["text"], "score": cosine_similarity(query_embedding, doc["embedding"])}
        for doc in docs
    ]
    scored.sort(key=lambda x: x["score"], reverse=True)
    return scored[:top_k]
 
# A small knowledge base about home repair
documents = index_documents([
    "To fix a leaky faucet, first turn off the water supply valve under the sink.",
    "Replace worn washers and O-rings to stop faucet drips permanently.",
    "A running toilet usually means the flapper valve needs replacing.",
    "Use plumber's tape on threaded connections to prevent leaks.",
    "Annual HVAC filter replacement improves energy efficiency by 5-15%.",
    "Clogged drains can be cleared with a plunger or drain snake.",
    "Check your water heater's anode rod every 2-3 years to prevent tank corrosion.",
    "Investing in index funds provides broad market exposure at low cost.",
    "The Federal Reserve adjusts interest rates to control inflation.",
    "Quarterly earnings reports drive short-term stock price movements.",
])
 
print(f"Indexed {len(documents)} documents\n")
 
# Search with natural language -- no keyword matching needed
results = search("my sink is dripping water", documents)
print('Query: "my sink is dripping water"\n')
for r in results:
    print(f"  [{r['score']:.4f}] {r['text']}")

The faucet and washer documents score highest, even though "dripping" and "sink" don't appear in most of the top results. The finance documents score near zero. This is why semantic search matters: the user doesn't need to know the exact vocabulary of your knowledge base.

But this approach has an obvious problem. The search function computes cosine similarity against every document. With 10 documents, that's instant. With 10,000, it takes a few milliseconds. With 10 million, you're waiting seconds per query. The brute-force approach is O(n) per search, and that doesn't scale.

This is where vector databases come in. They use approximate nearest neighbor algorithms (HNSW, IVF, product quantization) to find similar vectors in sub-millisecond time, even across billions of documents. But before we get there, you need to choose the right embedding model.

How to Choose an Embedding Model

The embedding model determines the quality ceiling of your entire search system. No amount of clever retrieval or reranking can fix bad embeddings. Choose the wrong model and your search returns plausible but wrong results. Choose the right one and the rest of the pipeline has room to work.

Here are the current leaders, ranked by the MTEB (Massive Text Embedding Benchmark) leaderboard as of March 2026. MTEB evaluates models across retrieval, classification, clustering, and semantic similarity tasks. It's the closest thing to a standard benchmark for embeddings.

Model	Provider	Dimensions	MTEB Retrieval	Cost (per 1M tokens)	Self-Hostable
Gemini Embedding 001	Google	768-3072	~67.7	Free (under limits)	No
text-embedding-3-large	OpenAI	256-3072	~64.6	$0.13	No
text-embedding-3-small	OpenAI	512-1536	~61.6	$0.02	No
Embed v4	Cohere	256-1536	~65.0	$0.12	No
BGE-M3	BAAI	1024	~63.0	Free	Yes
Nomic Embed Text v2	Nomic AI	256-768	~60.5	Free	Yes
Jina Embeddings v3	Jina AI	32-1024	~58.0	$0.02	License required
NV-Embed-v2	NVIDIA	4096	~62.7	Free	Yes

A few patterns jump out of this table. First, the gap between proprietary and open-source models has narrowed significantly. BGE-M3 at 63.0 retrieval is competitive with OpenAI's text-embedding-3-large at 64.6, and it's completely free to run on your own hardware. Second, cost varies by an order of magnitude: OpenAI's small model costs $0.02 per million tokens while the large model costs $0.13 for roughly 3 percentage points of improvement. For most applications, the small model is the right starting point.

Matryoshka embeddings: pay for the dimensions you need

Most modern embedding models support matryoshka representations, named after Russian nesting dolls. The model is trained so that the most important semantic information is packed into the first dimensions. You can truncate a 3,072-dimension vector down to 256 dimensions and still get useful search results.

OpenAI's text-embedding-3-large at 256 dimensions actually outperforms the older text-embedding-ada-002 at its full 1,536 dimensions on MTEB benchmarks. That's a 6x reduction in storage and compute for better quality. Cohere's Embed v4 supports the same trick across 256, 512, 1,024, and 1,536 dimensions.

Here's how to embed the same text with different dimension counts and compare the results. The dimensions parameter truncates the output on the server side, so you pay the same API cost but store less data.

TypeScript:

typescript

import OpenAI from "openai";
 
const openai = new OpenAI();
 
// Cosine similarity -- same implementation as before
function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
 
async function embedWithDimensions(texts: string[], dimensions: number) {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-large",
    input: texts,
    dimensions, // Matryoshka truncation -- server-side
  });
  return response.data.map(d => d.embedding);
}
 
async function main() {
  const texts = [
    "How to fix a leaky faucet",
    "Plumbing repair guide for beginners",
    "Stock market analysis quarterly report",
  ];
 
  // Compare quality at different dimension counts
  for (const dims of [256, 1024, 3072]) {
    const embeddings = await embedWithDimensions(texts, dims);
    const simRelated = cosineSimilarity(embeddings[0], embeddings[1]);
    const simUnrelated = cosineSimilarity(embeddings[0], embeddings[2]);
 
    console.log(`Dimensions: ${dims}`);
    console.log(`  Faucet vs Plumbing: ${simRelated.toFixed(4)}`);
    console.log(`  Faucet vs Stocks:   ${simUnrelated.toFixed(4)}`);
    console.log(`  Separation gap:     ${(simRelated - simUnrelated).toFixed(4)}\n`);
  }
}
 
main();

Python:

python

import math
from openai import OpenAI
 
client = OpenAI()
 
def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    mag_a = math.sqrt(sum(x * x for x in a))
    mag_b = math.sqrt(sum(x * x for x in b))
    return dot / (mag_a * mag_b)
 
def embed_with_dimensions(texts: list[str], dimensions: int) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
        dimensions=dimensions,  # Matryoshka truncation -- server-side
    )
    return [d.embedding for d in response.data]
 
texts = [
    "How to fix a leaky faucet",
    "Plumbing repair guide for beginners",
    "Stock market analysis quarterly report",
]
 
# Compare quality at different dimension counts
for dims in [256, 1024, 3072]:
    embeddings = embed_with_dimensions(texts, dims)
    sim_related = cosine_similarity(embeddings[0], embeddings[1])
    sim_unrelated = cosine_similarity(embeddings[0], embeddings[2])
 
    print(f"Dimensions: {dims}")
    print(f"  Faucet vs Plumbing: {sim_related:.4f}")
    print(f"  Faucet vs Stocks:   {sim_unrelated:.4f}")
    print(f"  Separation gap:     {sim_related - sim_unrelated:.4f}\n")

You'll see that the separation gap between related and unrelated texts holds remarkably well even at 256 dimensions. The practical takeaway: start with lower dimensions to save storage, only increase if your retrieval quality suffers on your actual data. A 256-dimension vector uses 1 KB. A 3,072-dimension vector uses 12 KB. At a million documents, that's the difference between 1 GB and 12 GB of storage.

Vector Databases: Pinecone vs Qdrant vs Chroma vs pgvector

Already running Postgres? Use pgvector. Want zero ops? Pinecone. Best open-source self-hosted option? Qdrant. Prototyping and want something running in 30 seconds? ChromaDB.

That's the short answer. Here's the reasoning.

Vector databases solve the O(n) brute-force problem from our in-memory search engine. Instead of comparing the query against every document, they build index structures (HNSW graphs, IVF clusters) that narrow the search space. A query that would scan 10 million vectors with brute force takes under 10 milliseconds with an HNSW index. The tradeoff is that results are approximate nearest neighbors, not exact, but the accuracy is typically 95-99% at practical recall levels.

Database	Type	Language	Max Vectors (free tier)	Hybrid Search	Best For
Pinecone	Managed SaaS	-	100K (serverless)	Yes	Zero-ops production
Qdrant	Self-hosted / Cloud	Rust	1M (free cloud)	Yes (sparse vectors)	Self-hosted performance
ChromaDB	Embedded / Server	Rust (core)	Unlimited (local)	No (dense only)	Prototyping, small datasets
pgvector	Postgres extension	C	Unlimited (your Postgres)	Via tsvector	Teams already on Postgres
Weaviate	Self-hosted / Cloud	Go	50K (free sandbox)	Yes (BM25 built-in)	Hybrid search out of the box
Milvus	Self-hosted / Cloud	Go/C++	Unlimited (local)	Yes	GPU-accelerated, large scale

Let's rebuild our search engine using Qdrant. The core logic stays the same: embed documents, store them, search by similarity. But now the database handles the indexing and approximate search instead of our brute-force loop.

TypeScript:

typescript

import OpenAI from "openai";
import { QdrantClient } from "@qdrant/js-client-rest";
 
const openai = new OpenAI();
const qdrant = new QdrantClient({ url: "http://localhost:6333" });
 
const COLLECTION = "home_repair";
 
async function getEmbeddings(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });
  return response.data.map(d => d.embedding);
}
 
async function createCollection() {
  // Delete if exists, then create fresh
  const collections = await qdrant.getCollections();
  if (collections.collections.some(c => c.name === COLLECTION)) {
    await qdrant.deleteCollection(COLLECTION);
  }
 
  await qdrant.createCollection(COLLECTION, {
    vectors: { size: 1536, distance: "Cosine" },
  });
}
 
async function indexDocuments(texts: string[]) {
  const embeddings = await getEmbeddings(texts);
 
  // Upsert points with text stored as payload metadata
  await qdrant.upsert(COLLECTION, {
    wait: true,
    points: texts.map((text, i) => ({
      id: i,
      vector: embeddings[i],
      payload: { text },
    })),
  });
 
  console.log(`Indexed ${texts.length} documents into Qdrant`);
}
 
async function search(query: string, topK = 3) {
  const [queryEmbedding] = await getEmbeddings([query]);
 
  const results = await qdrant.search(COLLECTION, {
    vector: queryEmbedding,
    limit: topK,
    with_payload: true,
  });
 
  return results.map(r => ({
    text: (r.payload as { text: string }).text,
    score: r.score,
  }));
}
 
async function main() {
  await createCollection();
 
  await indexDocuments([
    "To fix a leaky faucet, first turn off the water supply valve under the sink.",
    "Replace worn washers and O-rings to stop faucet drips permanently.",
    "A running toilet usually means the flapper valve needs replacing.",
    "Use plumber's tape on threaded connections to prevent leaks.",
    "Annual HVAC filter replacement improves energy efficiency by 5-15%.",
    "Clogged drains can be cleared with a plunger or drain snake.",
    "Check your water heater's anode rod every 2-3 years to prevent tank corrosion.",
  ]);
 
  const results = await search("my sink is dripping water");
  console.log('\nQuery: "my sink is dripping water"\n');
  for (const r of results) {
    console.log(`  [${r.score.toFixed(4)}] ${r.text}`);
  }
}
 
main();

Python:

python

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct
 
openai_client = OpenAI()
qdrant = QdrantClient(url="http://localhost:6333")
 
COLLECTION = "home_repair"
 
def get_embeddings(texts: list[str]) -> list[list[float]]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [d.embedding for d in response.data]
 
def create_collection():
    # Recreate collection fresh
    collections = [c.name for c in qdrant.get_collections().collections]
    if COLLECTION in collections:
        qdrant.delete_collection(COLLECTION)
 
    qdrant.create_collection(
        collection_name=COLLECTION,
        vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    )
 
def index_documents(texts: list[str]):
    embeddings = get_embeddings(texts)
 
    # Upsert points with text stored as payload metadata
    qdrant.upsert(
        collection_name=COLLECTION,
        wait=True,
        points=[
            PointStruct(
                id=i,
                vector=embeddings[i],
                payload={"text": text},
            )
            for i, text in enumerate(texts)
        ],
    )
    print(f"Indexed {len(texts)} documents into Qdrant")
 
def search(query: str, top_k: int = 3) -> list[dict]:
    query_embedding = get_embeddings([query])[0]
 
    results = qdrant.search(
        collection_name=COLLECTION,
        query_vector=query_embedding,
        limit=top_k,
    )
 
    return [{"text": r.payload["text"], "score": r.score} for r in results]
 
create_collection()
 
index_documents([
    "To fix a leaky faucet, first turn off the water supply valve under the sink.",
    "Replace worn washers and O-rings to stop faucet drips permanently.",
    "A running toilet usually means the flapper valve needs replacing.",
    "Use plumber's tape on threaded connections to prevent leaks.",
    "Annual HVAC filter replacement improves energy efficiency by 5-15%.",
    "Clogged drains can be cleared with a plunger or drain snake.",
    "Check your water heater's anode rod every 2-3 years to prevent tank corrosion.",
])
 
results = search("my sink is dripping water")
print('\nQuery: "my sink is dripping water"\n')
for r in results:
    print(f"  [{r['score']:.4f}] {r['text']}")

The search results are identical to our brute-force version, but now the database handles the heavy lifting. With 7 documents, you won't notice a speed difference. At 100,000 documents, the brute-force approach takes hundreds of milliseconds while Qdrant returns results in under 5. At 10 million, Qdrant still returns in under 10 milliseconds. That's the HNSW index doing its job.

To run this locally, start Qdrant with Docker: docker run -p 6333:6333 qdrant/qdrant. The code above works as-is.

How Do You Prepare Real Documents for Embedding?

Production documents need chunking (splitting into coherent pieces), caching (avoiding redundant API calls), and often hybrid search (combining vector similarity with keyword matching). Embedding an entire 10-page document as one vector crushes all the meaning into a single point. The embedding for a document about "refund policies, shipping times, and product specifications" will be a vague average of all three topics, matching none of them well.

Chunking strategies

Chunking splits documents into pieces that each capture a coherent idea. The right chunk size depends on your content, but 300-500 tokens is a strong default. Too large and the embedding averages over too many concepts. Too small and you lose context that makes the text meaningful.

Three common strategies:

Fixed-size chunking splits text every N tokens with some overlap. Simple and predictable, but it ignores document structure. A chunk boundary might land in the middle of a sentence or split a code block in half.

Recursive splitting tries paragraph breaks first, then sentence breaks, then falls back to fixed-size. This preserves natural boundaries in the text and produces more coherent chunks.

Semantic chunking uses embeddings to detect topic shifts within a document. When consecutive sentences have low similarity, that's a natural split point. More expensive (you're embedding sentences individually) but produces the highest quality chunks.

Here's a recursive text splitter that handles real documents. It tries to split on paragraph boundaries first, falls back to sentences, and ensures overlap between consecutive chunks so no information is lost at the edges.

TypeScript:

typescript

interface Chunk {
  text: string;
  index: number;
  metadata: { start: number; end: number };
}
 
function chunkText(
  text: string,
  maxChunkSize = 500,
  overlap = 50
): Chunk[] {
  // Split into paragraphs first, preserving natural boundaries
  const paragraphs = text.split(/\n\n+/).filter(p => p.trim().length > 0);
  const chunks: Chunk[] = [];
  let currentChunk = "";
  let chunkStart = 0;
  let position = 0;
 
  for (const paragraph of paragraphs) {
    // If adding this paragraph exceeds the limit, finalize current chunk
    if (currentChunk.length + paragraph.length > maxChunkSize && currentChunk.length > 0) {
      chunks.push({
        text: currentChunk.trim(),
        index: chunks.length,
        metadata: { start: chunkStart, end: position },
      });
 
      // Overlap: keep the last N characters from the previous chunk
      const overlapText = currentChunk.slice(-overlap);
      currentChunk = overlapText + " " + paragraph;
      chunkStart = position - overlap;
    } else {
      currentChunk += (currentChunk ? "\n\n" : "") + paragraph;
    }
    position += paragraph.length + 2;
  }
 
  // Don't forget the last chunk
  if (currentChunk.trim().length > 0) {
    chunks.push({
      text: currentChunk.trim(),
      index: chunks.length,
      metadata: { start: chunkStart, end: position },
    });
  }
 
  return chunks;
}
 
// Example usage
const document = `
Refund Policy
 
All purchases can be refunded within 30 days of the original purchase date.
To request a refund, contact our support team with your order number.
 
After 30 days, we offer prorated refunds for annual subscriptions only.
Monthly subscriptions are non-refundable after the billing date.
 
Shipping Policy
 
Standard shipping takes 5-7 business days within the continental US.
Express shipping is available for an additional fee and arrives in 2-3 days.
 
International shipping times vary by destination and typically take 10-21 days.
Import duties and taxes are the responsibility of the buyer.
`;
 
const chunks = chunkText(document, 300, 50);
for (const chunk of chunks) {
  console.log(`--- Chunk ${chunk.index} (${chunk.text.length} chars) ---`);
  console.log(chunk.text);
  console.log();
}

Python:

python

from dataclasses import dataclass
 
@dataclass
class Chunk:
    text: str
    index: int
    start: int
    end: int
 
def chunk_text(
    text: str,
    max_chunk_size: int = 500,
    overlap: int = 50,
) -> list[Chunk]:
    # Split into paragraphs first, preserving natural boundaries
    paragraphs = [p for p in text.split("\n\n") if p.strip()]
    chunks: list[Chunk] = []
    current_chunk = ""
    chunk_start = 0
    position = 0
 
    for paragraph in paragraphs:
        # If adding this paragraph exceeds the limit, finalize current chunk
        if len(current_chunk) + len(paragraph) > max_chunk_size and current_chunk:
            chunks.append(Chunk(
                text=current_chunk.strip(),
                index=len(chunks),
                start=chunk_start,
                end=position,
            ))
 
            # Overlap: keep the last N characters from the previous chunk
            overlap_text = current_chunk[-overlap:]
            current_chunk = overlap_text + " " + paragraph
            chunk_start = position - overlap
        else:
            current_chunk += ("\n\n" if current_chunk else "") + paragraph
        position += len(paragraph) + 2
 
    # Don't forget the last chunk
    if current_chunk.strip():
        chunks.append(Chunk(
            text=current_chunk.strip(),
            index=len(chunks),
            start=chunk_start,
            end=position,
        ))
 
    return chunks
 
document = """
Refund Policy
 
All purchases can be refunded within 30 days of the original purchase date.
To request a refund, contact our support team with your order number.
 
After 30 days, we offer prorated refunds for annual subscriptions only.
Monthly subscriptions are non-refundable after the billing date.
 
Shipping Policy
 
Standard shipping takes 5-7 business days within the continental US.
Express shipping is available for an additional fee and arrives in 2-3 days.
 
International shipping times vary by destination and typically take 10-21 days.
Import duties and taxes are the responsibility of the buyer.
"""
 
chunks = chunk_text(document, max_chunk_size=300, overlap=50)
for chunk in chunks:
    print(f"--- Chunk {chunk.index} ({len(chunk.text)} chars) ---")
    print(chunk.text)
    print()

The overlap parameter is important. Without it, a question about "30-day refund policy for annual subscriptions" might miss the answer because the relevant information spans two chunks. With 50 characters of overlap, the end of one chunk bleeds into the beginning of the next, ensuring that boundary-crossing content still gets captured. Getting chunking right has an outsized impact on retrieval quality, often more than the model itself.

Hybrid search: when keywords matter

Vector search finds semantically similar results, but sometimes you need exact keyword matches. A customer searching for order number "ORD-2026-4891" needs a lexical match, not a semantic one. Product SKUs, error codes, email addresses, API endpoints, names. These are tokens where exact matching is essential.

Hybrid search combines vector similarity with keyword matching (BM25). You run both searches in parallel, then merge the results using reciprocal rank fusion (RRF). RRF scores each document by summing 1 / (k + rank) across both result lists. The constant k = 60 is the standard from the original research paper. Documents that rank highly in both searches bubble to the top.

TypeScript:

typescript

interface SearchResult {
  id: string;
  text: string;
  score: number;
}
 
// Reciprocal Rank Fusion -- merges results from multiple search methods
function reciprocalRankFusion(
  resultSets: SearchResult[][],
  k = 60
): SearchResult[] {
  const scores = new Map<string, { score: number; text: string }>();
 
  for (const results of resultSets) {
    for (let rank = 0; rank < results.length; rank++) {
      const result = results[rank];
      const rrf = 1 / (k + rank + 1); // +1 because rank is 0-indexed
      const existing = scores.get(result.id);
      if (existing) {
        existing.score += rrf;
      } else {
        scores.set(result.id, { score: rrf, text: result.text });
      }
    }
  }
 
  return Array.from(scores.entries())
    .map(([id, { score, text }]) => ({ id, text, score }))
    .sort((a, b) => b.score - a.score);
}
 
// Simple BM25-style keyword search (term frequency, no IDF for brevity)
function keywordSearch(query: string, docs: { id: string; text: string }[]): SearchResult[] {
  const queryTerms = query.toLowerCase().split(/\s+/);
 
  return docs
    .map(doc => {
      const docLower = doc.text.toLowerCase();
      // Count how many query terms appear in the document
      const matchCount = queryTerms.filter(term => docLower.includes(term)).length;
      return { ...doc, score: matchCount / queryTerms.length };
    })
    .filter(d => d.score > 0)
    .sort((a, b) => b.score - a.score);
}
 
// Usage: combine vector results with keyword results
function hybridSearch(
  vectorResults: SearchResult[],
  keywordResults: SearchResult[],
  topK = 5
): SearchResult[] {
  const fused = reciprocalRankFusion([vectorResults, keywordResults]);
  return fused.slice(0, topK);
}
 
// Example
const vectorHits: SearchResult[] = [
  { id: "1", text: "Plumbing repair guide for homeowners", score: 0.89 },
  { id: "2", text: "How to replace a faucet washer", score: 0.85 },
  { id: "3", text: "Home maintenance annual checklist", score: 0.72 },
];
 
const keywordHits: SearchResult[] = [
  { id: "2", text: "How to replace a faucet washer", score: 0.8 },
  { id: "4", text: "Faucet brands comparison and reviews", score: 0.6 },
  { id: "1", text: "Plumbing repair guide for homeowners", score: 0.4 },
];
 
const results = hybridSearch(vectorHits, keywordHits);
console.log("Hybrid search results (RRF):\n");
for (const r of results) {
  console.log(`  [${r.score.toFixed(4)}] ${r.text}`);
}

Python:

python

from dataclasses import dataclass
from collections import defaultdict
 
@dataclass
class SearchResult:
    id: str
    text: str
    score: float
 
# Reciprocal Rank Fusion -- merges results from multiple search methods
def reciprocal_rank_fusion(
    result_sets: list[list[SearchResult]],
    k: int = 60,
) -> list[SearchResult]:
    scores: dict[str, dict] = defaultdict(lambda: {"score": 0.0, "text": ""})
 
    for results in result_sets:
        for rank, result in enumerate(results):
            rrf = 1 / (k + rank + 1)  # +1 because rank is 0-indexed
            scores[result.id]["score"] += rrf
            scores[result.id]["text"] = result.text
 
    fused = [
        SearchResult(id=id, text=data["text"], score=data["score"])
        for id, data in scores.items()
    ]
    fused.sort(key=lambda x: x.score, reverse=True)
    return fused
 
# Simple BM25-style keyword search (term frequency, no IDF for brevity)
def keyword_search(query: str, docs: list[dict]) -> list[SearchResult]:
    query_terms = query.lower().split()
 
    scored = []
    for doc in docs:
        doc_lower = doc["text"].lower()
        match_count = sum(1 for term in query_terms if term in doc_lower)
        score = match_count / len(query_terms)
        if score > 0:
            scored.append(SearchResult(id=doc["id"], text=doc["text"], score=score))
 
    scored.sort(key=lambda x: x.score, reverse=True)
    return scored
 
# Usage: combine vector results with keyword results
def hybrid_search(
    vector_results: list[SearchResult],
    keyword_results: list[SearchResult],
    top_k: int = 5,
) -> list[SearchResult]:
    fused = reciprocal_rank_fusion([vector_results, keyword_results])
    return fused[:top_k]
 
# Example
vector_hits = [
    SearchResult(id="1", text="Plumbing repair guide for homeowners", score=0.89),
    SearchResult(id="2", text="How to replace a faucet washer", score=0.85),
    SearchResult(id="3", text="Home maintenance annual checklist", score=0.72),
]
 
keyword_hits = [
    SearchResult(id="2", text="How to replace a faucet washer", score=0.8),
    SearchResult(id="4", text="Faucet brands comparison and reviews", score=0.6),
    SearchResult(id="1", text="Plumbing repair guide for homeowners", score=0.4),
]
 
results = hybrid_search(vector_hits, keyword_hits)
print("Hybrid search results (RRF):\n")
for r in results:
    print(f"  [{r.score:.4f}] {r.text}")

Notice how "How to replace a faucet washer" rises to the top. It ranked well in both vector and keyword search, so RRF gives it the highest combined score. A document that only appeared in one search type still shows up, but with a lower fused score. This is the strength of hybrid search: it captures both semantic meaning and exact term relevance.

Embedding cache

The embedding API call is the most expensive operation in the pipeline, both in latency and cost. If 100 users search for "refund policy," you're paying for 100 identical embedding calls. A simple hash-based cache eliminates this waste entirely.

TypeScript:

typescript

import crypto from "crypto";
import OpenAI from "openai";
 
const openai = new OpenAI();
 
// In-memory cache -- swap for Redis or SQLite in production
const embeddingCache = new Map<string, number[]>();
 
async function getCachedEmbedding(text: string): Promise<number[]> {
  const key = crypto.createHash("sha256").update(text).digest("hex");
 
  if (embeddingCache.has(key)) {
    return embeddingCache.get(key)!; // Cache hit -- zero cost, instant
  }
 
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
 
  const embedding = response.data[0].embedding;
  embeddingCache.set(key, embedding);
  return embedding;
}

Python:

python

import hashlib
from openai import OpenAI
 
client = OpenAI()
 
# In-memory cache -- swap for Redis or SQLite in production
embedding_cache: dict[str, list[float]] = {}
 
def get_cached_embedding(text: str) -> list[float]:
    key = hashlib.sha256(text.encode()).hexdigest()
 
    if key in embedding_cache:
        return embedding_cache[key]  # Cache hit -- zero cost, instant
 
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
 
    embedding = response.data[0].embedding
    embedding_cache[key] = embedding
    return embedding

In production, replace the in-memory map with Redis or your database. The pattern stays the same. Cache invalidation is simple: when a document changes, delete its cache entry and re-embed. Query caches can use a TTL since the same queries tend to cluster in time.

The chunking, caching, and hybrid search patterns here are exactly what production RAG pipelines use under the hood. If you're building an agent with persistent memory, the same embedding and retrieval pipeline powers the semantic memory search that lets agents recall relevant context from past conversations.

Embeddings Are Infrastructure, Not Magic

Embeddings are the retrieval primitive. They turn text into meaning-preserving coordinates, and similarity search finds the closest matches. RAG pipelines use them to ground LLM answers in real documents. Agent memory systems use them to recall relevant context. Knowledge bases use them to make documentation searchable. Scorecards use semantic similarity to evaluate whether an agent's response matches the expected intent.

Four decisions matter:

Embedding model: start with text-embedding-3-small at $0.02/M tokens. Move to text-embedding-3-large if quality demands it. Self-host BGE-M3 if data can't leave your infrastructure.
Vector database: pgvector if you already run Postgres, Qdrant for self-hosted, Pinecone for managed.
Chunk size: 300-500 tokens with overlap.
Search type: hybrid (vector + keyword) beats either method alone.

When your agent connects to external systems through tools and MCP via function calling, the embedding pipeline becomes one of several retrieval sources working in parallel. The vector store handles product docs. A tool queries live inventory. Analytics tells you which source is actually driving answer quality.

The gap between a demo and production? Chunking strategy, embedding caching, hybrid search, and monitoring retrieval quality over time. None of those are hard once you understand the primitives.

Semantic Search, Built In

Chanl's knowledge base handles embedding, chunking, and hybrid search so you can focus on what your agent knows, not how it retrieves.

Try Knowledge Base

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

learning-ai embeddings vector-database semantic-search similarity-search typescript python rag

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Aprende IA Agéntica

Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.

500+ ingenieros suscritos

Embeddings Turn Text Into Meaning. Here's the Math and the Code

What Are Embeddings, Really?

How Does Cosine Similarity Actually Work?

Build Along: Semantic Search in 50 Lines

How to Choose an Embedding Model

Matryoshka embeddings: pay for the dimensions you need

Vector Databases: Pinecone vs Qdrant vs Chroma vs pgvector

How Do You Prepare Real Documents for Embedding?

Chunking strategies

Hybrid search: when keywords matter

Embedding cache

Embeddings Are Infrastructure, Not Magic

Semantic Search, Built In

Aprende IA Agéntica

Frequently Asked Questions

Related Articles

RAG desde Cero: Construye un Pipeline de Generación Aumentada por Recuperación

Your RAG Pipeline Is Answering the Wrong Question

Context Engineering Is What Your Agent Actually Needs