A user types "how to fix a leaky faucet" into your search bar. The top result is titled "plumbing repair." Another user writes "cancel my subscription" and lands on "Account Cancellation Policy." Zero keywords overlap. Yet the search engine knows they mean the same thing.
That's embeddings.
Behind every semantic search, every RAG pipeline, every recommendation engine, there's a model turning text into numbers and a distance function deciding which numbers are close. It sounds abstract until you see the actual vectors, write the actual math, and build the actual search. Then it clicks.
That's what this article does. We generate real embeddings, implement cosine similarity from scratch, build a working search engine in 50 lines, compare the leading models head to head, and graduate from in-memory brute force to a production vector database.
What Are Embeddings, Really?
An embedding is a fixed-length array of numbers that captures semantic meaning. Each number represents a position along some learned dimension of meaning. Texts with similar meanings land near each other in this high-dimensional space, regardless of the specific words they use.
The classic example, first demonstrated by Mikolov et al. at Google in 2013: the vector for "king" minus "man" plus "woman" produces a vector close to "queen." The model learned that royalty and gender are separate dimensions, and you can do arithmetic on them. This isn't a trick. It falls directly out of how embedding models are trained: by predicting which words appear near each other in billions of sentences.
Modern embedding models output vectors with hundreds or thousands of dimensions. OpenAI's text-embedding-3-small produces 1,536 numbers per text input. Each number is typically a float between -1 and 1. You can't interpret individual dimensions ("dimension 847 means formality"), but the overall pattern is what carries meaning.
Let's generate some real embeddings and look at what comes back.
TypeScript:
import OpenAI from "openai";
const openai = new OpenAI();
async function getEmbedding(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return response.data[0].embedding;
}
async function main() {
const text = "How to fix a leaky faucet";
const embedding = await getEmbedding(text);
console.log(`Text: "${text}"`);
console.log(`Dimensions: ${embedding.length}`);
console.log(`First 10 values: [${embedding.slice(0, 10).map(v => v.toFixed(6)).join(", ")}]`);
console.log(`Min: ${Math.min(...embedding).toFixed(6)}`);
console.log(`Max: ${Math.max(...embedding).toFixed(6)}`);
}
main();Python:
from openai import OpenAI
client = OpenAI()
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
text = "How to fix a leaky faucet"
embedding = get_embedding(text)
print(f'Text: "{text}"')
print(f"Dimensions: {len(embedding)}")
print(f"First 10 values: {[round(v, 6) for v in embedding[:10]]}")
print(f"Min: {min(embedding):.6f}")
print(f"Max: {max(embedding):.6f}")Running either version prints 1,536 floating point numbers. The exact values aren't meaningful individually. What matters is the pattern: two texts about plumbing produce similar patterns, while a text about stock trading produces a completely different one. The next section shows exactly how to measure that similarity.
How Does Cosine Similarity Actually Work?
Cosine similarity measures the angle between two vectors. Two vectors pointing in the same direction score 1.0 (identical meaning). Perpendicular vectors score 0.0 (unrelated). Opposite vectors score -1.0. The formula ignores magnitude and focuses purely on direction, which is why a short sentence and a long paragraph about the same topic still score high.
The formula is just the dot product of the two vectors divided by the product of their magnitudes.
Here's cosine similarity implemented from scratch. No libraries, no dependencies. Five lines of math that power nearly every semantic search system in production.
TypeScript:
import OpenAI from "openai";
const openai = new OpenAI();
// Cosine similarity from scratch -- dot product divided by magnitude product
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0;
let magA = 0;
let magB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
async function getEmbedding(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return response.data[0].embedding;
}
async function main() {
const sentences = [
"The cat sits on the mat",
"A feline rests on the rug",
"The stock market crashed yesterday",
];
// Embed all three sentences in a single API call
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: sentences,
});
const embeddings = response.data.map(d => d.embedding);
// Compare every pair
for (let i = 0; i < sentences.length; i++) {
for (let j = i + 1; j < sentences.length; j++) {
const score = cosineSimilarity(embeddings[i], embeddings[j]);
console.log(`"${sentences[i]}" vs "${sentences[j]}"`);
console.log(` Similarity: ${score.toFixed(4)}\n`);
}
}
}
main();Python:
import math
from openai import OpenAI
client = OpenAI()
# Cosine similarity from scratch -- dot product divided by magnitude product
def cosine_similarity(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
mag_a = math.sqrt(sum(x * x for x in a))
mag_b = math.sqrt(sum(x * x for x in b))
return dot / (mag_a * mag_b)
def get_embeddings(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [d.embedding for d in response.data]
sentences = [
"The cat sits on the mat",
"A feline rests on the rug",
"The stock market crashed yesterday",
]
embeddings = get_embeddings(sentences)
# Compare every pair
for i in range(len(sentences)):
for j in range(i + 1, len(sentences)):
score = cosine_similarity(embeddings[i], embeddings[j])
print(f'"{sentences[i]}" vs "{sentences[j]}"')
print(f" Similarity: {score:.4f}\n")You'll see something like: the cat/feline pair scores around 0.85-0.90, while either sentence compared against the stock market sentence drops to 0.10-0.15. The embedding model has never seen these exact sentences before, but it's learned from training data that cats and felines are semantically close while stock markets have nothing to do with either.
That gap between 0.88 and 0.12 is the foundation of everything. It's how search engines find relevant results without keyword overlap, how RAG pipelines retrieve the right context for LLMs, and how recommendation systems surface content you'll actually care about.
Build Along: Semantic Search in 50 Lines
A working semantic search engine needs three steps: embed your documents at index time, embed the user's query at search time, and return the documents with the highest cosine similarity scores. The whole thing fits in about 50 lines.
The architecture is simple. At index time, you embed every document and store the vectors alongside the text. At query time, you embed the search query, compute cosine similarity against every stored vector, and return the top matches.
TypeScript:
import OpenAI from "openai";
const openai = new OpenAI();
interface Document {
text: string;
embedding: number[];
}
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, magA = 0, magB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
// Index: embed all documents in one batch API call
async function indexDocuments(texts: string[]): Promise<Document[]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return texts.map((text, i) => ({
text,
embedding: response.data[i].embedding,
}));
}
// Search: embed query, brute-force compare against all documents
async function search(query: string, docs: Document[], topK = 3) {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: query,
});
const queryEmbedding = response.data[0].embedding;
return docs
.map(doc => ({
text: doc.text,
score: cosineSimilarity(queryEmbedding, doc.embedding),
}))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
async function main() {
// A small knowledge base about home repair
const documents = await indexDocuments([
"To fix a leaky faucet, first turn off the water supply valve under the sink.",
"Replace worn washers and O-rings to stop faucet drips permanently.",
"A running toilet usually means the flapper valve needs replacing.",
"Use plumber's tape on threaded connections to prevent leaks.",
"Annual HVAC filter replacement improves energy efficiency by 5-15%.",
"Clogged drains can be cleared with a plunger or drain snake.",
"Check your water heater's anode rod every 2-3 years to prevent tank corrosion.",
"Investing in index funds provides broad market exposure at low cost.",
"The Federal Reserve adjusts interest rates to control inflation.",
"Quarterly earnings reports drive short-term stock price movements.",
]);
console.log(`Indexed ${documents.length} documents\n`);
// Search with natural language -- no keyword matching needed
const results = await search("my sink is dripping water", documents);
console.log('Query: "my sink is dripping water"\n');
for (const result of results) {
console.log(` [${result.score.toFixed(4)}] ${result.text}`);
}
}
main();Python:
import math
from openai import OpenAI
client = OpenAI()
def cosine_similarity(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
mag_a = math.sqrt(sum(x * x for x in a))
mag_b = math.sqrt(sum(x * x for x in b))
return dot / (mag_a * mag_b)
# Index: embed all documents in one batch API call
def index_documents(texts: list[str]) -> list[dict]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [
{"text": text, "embedding": response.data[i].embedding}
for i, text in enumerate(texts)
]
# Search: embed query, brute-force compare against all documents
def search(query: str, docs: list[dict], top_k: int = 3) -> list[dict]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=query,
)
query_embedding = response.data[0].embedding
scored = [
{"text": doc["text"], "score": cosine_similarity(query_embedding, doc["embedding"])}
for doc in docs
]
scored.sort(key=lambda x: x["score"], reverse=True)
return scored[:top_k]
# A small knowledge base about home repair
documents = index_documents([
"To fix a leaky faucet, first turn off the water supply valve under the sink.",
"Replace worn washers and O-rings to stop faucet drips permanently.",
"A running toilet usually means the flapper valve needs replacing.",
"Use plumber's tape on threaded connections to prevent leaks.",
"Annual HVAC filter replacement improves energy efficiency by 5-15%.",
"Clogged drains can be cleared with a plunger or drain snake.",
"Check your water heater's anode rod every 2-3 years to prevent tank corrosion.",
"Investing in index funds provides broad market exposure at low cost.",
"The Federal Reserve adjusts interest rates to control inflation.",
"Quarterly earnings reports drive short-term stock price movements.",
])
print(f"Indexed {len(documents)} documents\n")
# Search with natural language -- no keyword matching needed
results = search("my sink is dripping water", documents)
print('Query: "my sink is dripping water"\n')
for r in results:
print(f" [{r['score']:.4f}] {r['text']}")The faucet and washer documents score highest, even though "dripping" and "sink" don't appear in most of the top results. The finance documents score near zero. This is why semantic search matters: the user doesn't need to know the exact vocabulary of your knowledge base.
But this approach has an obvious problem. The search function computes cosine similarity against every document. With 10 documents, that's instant. With 10,000, it takes a few milliseconds. With 10 million, you're waiting seconds per query. The brute-force approach is O(n) per search, and that doesn't scale.
This is where vector databases come in. They use approximate nearest neighbor algorithms (HNSW, IVF, product quantization) to find similar vectors in sub-millisecond time, even across billions of documents. But before we get there, you need to choose the right embedding model.
How to Choose an Embedding Model
The embedding model determines the quality ceiling of your entire search system. No amount of clever retrieval or reranking can fix bad embeddings. Choose the wrong model and your search returns plausible but wrong results. Choose the right one and the rest of the pipeline has room to work.
Here are the current leaders, ranked by the MTEB (Massive Text Embedding Benchmark) leaderboard as of March 2026. MTEB evaluates models across retrieval, classification, clustering, and semantic similarity tasks. It's the closest thing to a standard benchmark for embeddings.
| Model | Provider | Dimensions | MTEB Retrieval | Cost (per 1M tokens) | Self-Hostable |
|---|---|---|---|---|---|
| Gemini Embedding 001 | 768-3072 | ~67.7 | Free (under limits) | No | |
| text-embedding-3-large | OpenAI | 256-3072 | ~64.6 | $0.13 | No |
| text-embedding-3-small | OpenAI | 512-1536 | ~61.6 | $0.02 | No |
| Embed v4 | Cohere | 256-1536 | ~65.0 | $0.12 | No |
| BGE-M3 | BAAI | 1024 | ~63.0 | Free | Yes |
| Nomic Embed Text v2 | Nomic AI | 256-768 | ~60.5 | Free | Yes |
| Jina Embeddings v3 | Jina AI | 32-1024 | ~58.0 | $0.02 | License required |
| NV-Embed-v2 | NVIDIA | 4096 | ~62.7 | Free | Yes |
A few patterns jump out of this table. First, the gap between proprietary and open-source models has narrowed significantly. BGE-M3 at 63.0 retrieval is competitive with OpenAI's text-embedding-3-large at 64.6, and it's completely free to run on your own hardware. Second, cost varies by an order of magnitude: OpenAI's small model costs $0.02 per million tokens while the large model costs $0.13 for roughly 3 percentage points of improvement. For most applications, the small model is the right starting point.
Matryoshka embeddings: pay for the dimensions you need
Most modern embedding models support matryoshka representations, named after Russian nesting dolls. The model is trained so that the most important semantic information is packed into the first dimensions. You can truncate a 3,072-dimension vector down to 256 dimensions and still get useful search results.
OpenAI's text-embedding-3-large at 256 dimensions actually outperforms the older text-embedding-ada-002 at its full 1,536 dimensions on MTEB benchmarks. That's a 6x reduction in storage and compute for better quality. Cohere's Embed v4 supports the same trick across 256, 512, 1,024, and 1,536 dimensions.
Here's how to embed the same text with different dimension counts and compare the results. The dimensions parameter truncates the output on the server side, so you pay the same API cost but store less data.
TypeScript:
import OpenAI from "openai";
const openai = new OpenAI();
// Cosine similarity -- same implementation as before
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, magA = 0, magB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
async function embedWithDimensions(texts: string[], dimensions: number) {
const response = await openai.embeddings.create({
model: "text-embedding-3-large",
input: texts,
dimensions, // Matryoshka truncation -- server-side
});
return response.data.map(d => d.embedding);
}
async function main() {
const texts = [
"How to fix a leaky faucet",
"Plumbing repair guide for beginners",
"Stock market analysis quarterly report",
];
// Compare quality at different dimension counts
for (const dims of [256, 1024, 3072]) {
const embeddings = await embedWithDimensions(texts, dims);
const simRelated = cosineSimilarity(embeddings[0], embeddings[1]);
const simUnrelated = cosineSimilarity(embeddings[0], embeddings[2]);
console.log(`Dimensions: ${dims}`);
console.log(` Faucet vs Plumbing: ${simRelated.toFixed(4)}`);
console.log(` Faucet vs Stocks: ${simUnrelated.toFixed(4)}`);
console.log(` Separation gap: ${(simRelated - simUnrelated).toFixed(4)}\n`);
}
}
main();Python:
import math
from openai import OpenAI
client = OpenAI()
def cosine_similarity(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
mag_a = math.sqrt(sum(x * x for x in a))
mag_b = math.sqrt(sum(x * x for x in b))
return dot / (mag_a * mag_b)
def embed_with_dimensions(texts: list[str], dimensions: int) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-large",
input=texts,
dimensions=dimensions, # Matryoshka truncation -- server-side
)
return [d.embedding for d in response.data]
texts = [
"How to fix a leaky faucet",
"Plumbing repair guide for beginners",
"Stock market analysis quarterly report",
]
# Compare quality at different dimension counts
for dims in [256, 1024, 3072]:
embeddings = embed_with_dimensions(texts, dims)
sim_related = cosine_similarity(embeddings[0], embeddings[1])
sim_unrelated = cosine_similarity(embeddings[0], embeddings[2])
print(f"Dimensions: {dims}")
print(f" Faucet vs Plumbing: {sim_related:.4f}")
print(f" Faucet vs Stocks: {sim_unrelated:.4f}")
print(f" Separation gap: {sim_related - sim_unrelated:.4f}\n")You'll see that the separation gap between related and unrelated texts holds remarkably well even at 256 dimensions. The practical takeaway: start with lower dimensions to save storage, only increase if your retrieval quality suffers on your actual data. A 256-dimension vector uses 1 KB. A 3,072-dimension vector uses 12 KB. At a million documents, that's the difference between 1 GB and 12 GB of storage.
Vector Databases: Pinecone vs Qdrant vs Chroma vs pgvector
Already running Postgres? Use pgvector. Want zero ops? Pinecone. Best open-source self-hosted option? Qdrant. Prototyping and want something running in 30 seconds? ChromaDB.
That's the short answer. Here's the reasoning.
Vector databases solve the O(n) brute-force problem from our in-memory search engine. Instead of comparing the query against every document, they build index structures (HNSW graphs, IVF clusters) that narrow the search space. A query that would scan 10 million vectors with brute force takes under 10 milliseconds with an HNSW index. The tradeoff is that results are approximate nearest neighbors, not exact, but the accuracy is typically 95-99% at practical recall levels.
| Database | Type | Language | Max Vectors (free tier) | Hybrid Search | Best For |
|---|---|---|---|---|---|
| Pinecone | Managed SaaS | - | 100K (serverless) | Yes | Zero-ops production |
| Qdrant | Self-hosted / Cloud | Rust | 1M (free cloud) | Yes (sparse vectors) | Self-hosted performance |
| ChromaDB | Embedded / Server | Rust (core) | Unlimited (local) | No (dense only) | Prototyping, small datasets |
| pgvector | Postgres extension | C | Unlimited (your Postgres) | Via tsvector | Teams already on Postgres |
| Weaviate | Self-hosted / Cloud | Go | 50K (free sandbox) | Yes (BM25 built-in) | Hybrid search out of the box |
| Milvus | Self-hosted / Cloud | Go/C++ | Unlimited (local) | Yes | GPU-accelerated, large scale |
Let's rebuild our search engine using Qdrant. The core logic stays the same: embed documents, store them, search by similarity. But now the database handles the indexing and approximate search instead of our brute-force loop.
TypeScript:
import OpenAI from "openai";
import { QdrantClient } from "@qdrant/js-client-rest";
const openai = new OpenAI();
const qdrant = new QdrantClient({ url: "http://localhost:6333" });
const COLLECTION = "home_repair";
async function getEmbeddings(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return response.data.map(d => d.embedding);
}
async function createCollection() {
// Delete if exists, then create fresh
const collections = await qdrant.getCollections();
if (collections.collections.some(c => c.name === COLLECTION)) {
await qdrant.deleteCollection(COLLECTION);
}
await qdrant.createCollection(COLLECTION, {
vectors: { size: 1536, distance: "Cosine" },
});
}
async function indexDocuments(texts: string[]) {
const embeddings = await getEmbeddings(texts);
// Upsert points with text stored as payload metadata
await qdrant.upsert(COLLECTION, {
wait: true,
points: texts.map((text, i) => ({
id: i,
vector: embeddings[i],
payload: { text },
})),
});
console.log(`Indexed ${texts.length} documents into Qdrant`);
}
async function search(query: string, topK = 3) {
const [queryEmbedding] = await getEmbeddings([query]);
const results = await qdrant.search(COLLECTION, {
vector: queryEmbedding,
limit: topK,
with_payload: true,
});
return results.map(r => ({
text: (r.payload as { text: string }).text,
score: r.score,
}));
}
async function main() {
await createCollection();
await indexDocuments([
"To fix a leaky faucet, first turn off the water supply valve under the sink.",
"Replace worn washers and O-rings to stop faucet drips permanently.",
"A running toilet usually means the flapper valve needs replacing.",
"Use plumber's tape on threaded connections to prevent leaks.",
"Annual HVAC filter replacement improves energy efficiency by 5-15%.",
"Clogged drains can be cleared with a plunger or drain snake.",
"Check your water heater's anode rod every 2-3 years to prevent tank corrosion.",
]);
const results = await search("my sink is dripping water");
console.log('\nQuery: "my sink is dripping water"\n');
for (const r of results) {
console.log(` [${r.score.toFixed(4)}] ${r.text}`);
}
}
main();Python:
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct
openai_client = OpenAI()
qdrant = QdrantClient(url="http://localhost:6333")
COLLECTION = "home_repair"
def get_embeddings(texts: list[str]) -> list[list[float]]:
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [d.embedding for d in response.data]
def create_collection():
# Recreate collection fresh
collections = [c.name for c in qdrant.get_collections().collections]
if COLLECTION in collections:
qdrant.delete_collection(COLLECTION)
qdrant.create_collection(
collection_name=COLLECTION,
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
def index_documents(texts: list[str]):
embeddings = get_embeddings(texts)
# Upsert points with text stored as payload metadata
qdrant.upsert(
collection_name=COLLECTION,
wait=True,
points=[
PointStruct(
id=i,
vector=embeddings[i],
payload={"text": text},
)
for i, text in enumerate(texts)
],
)
print(f"Indexed {len(texts)} documents into Qdrant")
def search(query: str, top_k: int = 3) -> list[dict]:
query_embedding = get_embeddings([query])[0]
results = qdrant.search(
collection_name=COLLECTION,
query_vector=query_embedding,
limit=top_k,
)
return [{"text": r.payload["text"], "score": r.score} for r in results]
create_collection()
index_documents([
"To fix a leaky faucet, first turn off the water supply valve under the sink.",
"Replace worn washers and O-rings to stop faucet drips permanently.",
"A running toilet usually means the flapper valve needs replacing.",
"Use plumber's tape on threaded connections to prevent leaks.",
"Annual HVAC filter replacement improves energy efficiency by 5-15%.",
"Clogged drains can be cleared with a plunger or drain snake.",
"Check your water heater's anode rod every 2-3 years to prevent tank corrosion.",
])
results = search("my sink is dripping water")
print('\nQuery: "my sink is dripping water"\n')
for r in results:
print(f" [{r['score']:.4f}] {r['text']}")The search results are identical to our brute-force version, but now the database handles the heavy lifting. With 7 documents, you won't notice a speed difference. At 100,000 documents, the brute-force approach takes hundreds of milliseconds while Qdrant returns results in under 5. At 10 million, Qdrant still returns in under 10 milliseconds. That's the HNSW index doing its job.
To run this locally, start Qdrant with Docker: docker run -p 6333:6333 qdrant/qdrant. The code above works as-is.
How Do You Prepare Real Documents for Embedding?
Production documents need chunking (splitting into coherent pieces), caching (avoiding redundant API calls), and often hybrid search (combining vector similarity with keyword matching). Embedding an entire 10-page document as one vector crushes all the meaning into a single point. The embedding for a document about "refund policies, shipping times, and product specifications" will be a vague average of all three topics, matching none of them well.
Chunking strategies
Chunking splits documents into pieces that each capture a coherent idea. The right chunk size depends on your content, but 300-500 tokens is a strong default. Too large and the embedding averages over too many concepts. Too small and you lose context that makes the text meaningful.
Three common strategies:
Fixed-size chunking splits text every N tokens with some overlap. Simple and predictable, but it ignores document structure. A chunk boundary might land in the middle of a sentence or split a code block in half.
Recursive splitting tries paragraph breaks first, then sentence breaks, then falls back to fixed-size. This preserves natural boundaries in the text and produces more coherent chunks.
Semantic chunking uses embeddings to detect topic shifts within a document. When consecutive sentences have low similarity, that's a natural split point. More expensive (you're embedding sentences individually) but produces the highest quality chunks.
Here's a recursive text splitter that handles real documents. It tries to split on paragraph boundaries first, falls back to sentences, and ensures overlap between consecutive chunks so no information is lost at the edges.
TypeScript:
interface Chunk {
text: string;
index: number;
metadata: { start: number; end: number };
}
function chunkText(
text: string,
maxChunkSize = 500,
overlap = 50
): Chunk[] {
// Split into paragraphs first, preserving natural boundaries
const paragraphs = text.split(/\n\n+/).filter(p => p.trim().length > 0);
const chunks: Chunk[] = [];
let currentChunk = "";
let chunkStart = 0;
let position = 0;
for (const paragraph of paragraphs) {
// If adding this paragraph exceeds the limit, finalize current chunk
if (currentChunk.length + paragraph.length > maxChunkSize && currentChunk.length > 0) {
chunks.push({
text: currentChunk.trim(),
index: chunks.length,
metadata: { start: chunkStart, end: position },
});
// Overlap: keep the last N characters from the previous chunk
const overlapText = currentChunk.slice(-overlap);
currentChunk = overlapText + " " + paragraph;
chunkStart = position - overlap;
} else {
currentChunk += (currentChunk ? "\n\n" : "") + paragraph;
}
position += paragraph.length + 2;
}
// Don't forget the last chunk
if (currentChunk.trim().length > 0) {
chunks.push({
text: currentChunk.trim(),
index: chunks.length,
metadata: { start: chunkStart, end: position },
});
}
return chunks;
}
// Example usage
const document = `
Refund Policy
All purchases can be refunded within 30 days of the original purchase date.
To request a refund, contact our support team with your order number.
After 30 days, we offer prorated refunds for annual subscriptions only.
Monthly subscriptions are non-refundable after the billing date.
Shipping Policy
Standard shipping takes 5-7 business days within the continental US.
Express shipping is available for an additional fee and arrives in 2-3 days.
International shipping times vary by destination and typically take 10-21 days.
Import duties and taxes are the responsibility of the buyer.
`;
const chunks = chunkText(document, 300, 50);
for (const chunk of chunks) {
console.log(`--- Chunk ${chunk.index} (${chunk.text.length} chars) ---`);
console.log(chunk.text);
console.log();
}Python:
from dataclasses import dataclass
@dataclass
class Chunk:
text: str
index: int
start: int
end: int
def chunk_text(
text: str,
max_chunk_size: int = 500,
overlap: int = 50,
) -> list[Chunk]:
# Split into paragraphs first, preserving natural boundaries
paragraphs = [p for p in text.split("\n\n") if p.strip()]
chunks: list[Chunk] = []
current_chunk = ""
chunk_start = 0
position = 0
for paragraph in paragraphs:
# If adding this paragraph exceeds the limit, finalize current chunk
if len(current_chunk) + len(paragraph) > max_chunk_size and current_chunk:
chunks.append(Chunk(
text=current_chunk.strip(),
index=len(chunks),
start=chunk_start,
end=position,
))
# Overlap: keep the last N characters from the previous chunk
overlap_text = current_chunk[-overlap:]
current_chunk = overlap_text + " " + paragraph
chunk_start = position - overlap
else:
current_chunk += ("\n\n" if current_chunk else "") + paragraph
position += len(paragraph) + 2
# Don't forget the last chunk
if current_chunk.strip():
chunks.append(Chunk(
text=current_chunk.strip(),
index=len(chunks),
start=chunk_start,
end=position,
))
return chunks
document = """
Refund Policy
All purchases can be refunded within 30 days of the original purchase date.
To request a refund, contact our support team with your order number.
After 30 days, we offer prorated refunds for annual subscriptions only.
Monthly subscriptions are non-refundable after the billing date.
Shipping Policy
Standard shipping takes 5-7 business days within the continental US.
Express shipping is available for an additional fee and arrives in 2-3 days.
International shipping times vary by destination and typically take 10-21 days.
Import duties and taxes are the responsibility of the buyer.
"""
chunks = chunk_text(document, max_chunk_size=300, overlap=50)
for chunk in chunks:
print(f"--- Chunk {chunk.index} ({len(chunk.text)} chars) ---")
print(chunk.text)
print()The overlap parameter is important. Without it, a question about "30-day refund policy for annual subscriptions" might miss the answer because the relevant information spans two chunks. With 50 characters of overlap, the end of one chunk bleeds into the beginning of the next, ensuring that boundary-crossing content still gets captured. Getting chunking right has an outsized impact on retrieval quality, often more than the model itself.
Hybrid search: when keywords matter
Vector search finds semantically similar results, but sometimes you need exact keyword matches. A customer searching for order number "ORD-2026-4891" needs a lexical match, not a semantic one. Product SKUs, error codes, email addresses, API endpoints, names. These are tokens where exact matching is essential.
Hybrid search combines vector similarity with keyword matching (BM25). You run both searches in parallel, then merge the results using reciprocal rank fusion (RRF). RRF scores each document by summing 1 / (k + rank) across both result lists. The constant k = 60 is the standard from the original research paper. Documents that rank highly in both searches bubble to the top.
TypeScript:
interface SearchResult {
id: string;
text: string;
score: number;
}
// Reciprocal Rank Fusion -- merges results from multiple search methods
function reciprocalRankFusion(
resultSets: SearchResult[][],
k = 60
): SearchResult[] {
const scores = new Map<string, { score: number; text: string }>();
for (const results of resultSets) {
for (let rank = 0; rank < results.length; rank++) {
const result = results[rank];
const rrf = 1 / (k + rank + 1); // +1 because rank is 0-indexed
const existing = scores.get(result.id);
if (existing) {
existing.score += rrf;
} else {
scores.set(result.id, { score: rrf, text: result.text });
}
}
}
return Array.from(scores.entries())
.map(([id, { score, text }]) => ({ id, text, score }))
.sort((a, b) => b.score - a.score);
}
// Simple BM25-style keyword search (term frequency, no IDF for brevity)
function keywordSearch(query: string, docs: { id: string; text: string }[]): SearchResult[] {
const queryTerms = query.toLowerCase().split(/\s+/);
return docs
.map(doc => {
const docLower = doc.text.toLowerCase();
// Count how many query terms appear in the document
const matchCount = queryTerms.filter(term => docLower.includes(term)).length;
return { ...doc, score: matchCount / queryTerms.length };
})
.filter(d => d.score > 0)
.sort((a, b) => b.score - a.score);
}
// Usage: combine vector results with keyword results
function hybridSearch(
vectorResults: SearchResult[],
keywordResults: SearchResult[],
topK = 5
): SearchResult[] {
const fused = reciprocalRankFusion([vectorResults, keywordResults]);
return fused.slice(0, topK);
}
// Example
const vectorHits: SearchResult[] = [
{ id: "1", text: "Plumbing repair guide for homeowners", score: 0.89 },
{ id: "2", text: "How to replace a faucet washer", score: 0.85 },
{ id: "3", text: "Home maintenance annual checklist", score: 0.72 },
];
const keywordHits: SearchResult[] = [
{ id: "2", text: "How to replace a faucet washer", score: 0.8 },
{ id: "4", text: "Faucet brands comparison and reviews", score: 0.6 },
{ id: "1", text: "Plumbing repair guide for homeowners", score: 0.4 },
];
const results = hybridSearch(vectorHits, keywordHits);
console.log("Hybrid search results (RRF):\n");
for (const r of results) {
console.log(` [${r.score.toFixed(4)}] ${r.text}`);
}Python:
from dataclasses import dataclass
from collections import defaultdict
@dataclass
class SearchResult:
id: str
text: str
score: float
# Reciprocal Rank Fusion -- merges results from multiple search methods
def reciprocal_rank_fusion(
result_sets: list[list[SearchResult]],
k: int = 60,
) -> list[SearchResult]:
scores: dict[str, dict] = defaultdict(lambda: {"score": 0.0, "text": ""})
for results in result_sets:
for rank, result in enumerate(results):
rrf = 1 / (k + rank + 1) # +1 because rank is 0-indexed
scores[result.id]["score"] += rrf
scores[result.id]["text"] = result.text
fused = [
SearchResult(id=id, text=data["text"], score=data["score"])
for id, data in scores.items()
]
fused.sort(key=lambda x: x.score, reverse=True)
return fused
# Simple BM25-style keyword search (term frequency, no IDF for brevity)
def keyword_search(query: str, docs: list[dict]) -> list[SearchResult]:
query_terms = query.lower().split()
scored = []
for doc in docs:
doc_lower = doc["text"].lower()
match_count = sum(1 for term in query_terms if term in doc_lower)
score = match_count / len(query_terms)
if score > 0:
scored.append(SearchResult(id=doc["id"], text=doc["text"], score=score))
scored.sort(key=lambda x: x.score, reverse=True)
return scored
# Usage: combine vector results with keyword results
def hybrid_search(
vector_results: list[SearchResult],
keyword_results: list[SearchResult],
top_k: int = 5,
) -> list[SearchResult]:
fused = reciprocal_rank_fusion([vector_results, keyword_results])
return fused[:top_k]
# Example
vector_hits = [
SearchResult(id="1", text="Plumbing repair guide for homeowners", score=0.89),
SearchResult(id="2", text="How to replace a faucet washer", score=0.85),
SearchResult(id="3", text="Home maintenance annual checklist", score=0.72),
]
keyword_hits = [
SearchResult(id="2", text="How to replace a faucet washer", score=0.8),
SearchResult(id="4", text="Faucet brands comparison and reviews", score=0.6),
SearchResult(id="1", text="Plumbing repair guide for homeowners", score=0.4),
]
results = hybrid_search(vector_hits, keyword_hits)
print("Hybrid search results (RRF):\n")
for r in results:
print(f" [{r.score:.4f}] {r.text}")Notice how "How to replace a faucet washer" rises to the top. It ranked well in both vector and keyword search, so RRF gives it the highest combined score. A document that only appeared in one search type still shows up, but with a lower fused score. This is the strength of hybrid search: it captures both semantic meaning and exact term relevance.
Embedding cache
The embedding API call is the most expensive operation in the pipeline, both in latency and cost. If 100 users search for "refund policy," you're paying for 100 identical embedding calls. A simple hash-based cache eliminates this waste entirely.
TypeScript:
import crypto from "crypto";
import OpenAI from "openai";
const openai = new OpenAI();
// In-memory cache -- swap for Redis or SQLite in production
const embeddingCache = new Map<string, number[]>();
async function getCachedEmbedding(text: string): Promise<number[]> {
const key = crypto.createHash("sha256").update(text).digest("hex");
if (embeddingCache.has(key)) {
return embeddingCache.get(key)!; // Cache hit -- zero cost, instant
}
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
const embedding = response.data[0].embedding;
embeddingCache.set(key, embedding);
return embedding;
}Python:
import hashlib
from openai import OpenAI
client = OpenAI()
# In-memory cache -- swap for Redis or SQLite in production
embedding_cache: dict[str, list[float]] = {}
def get_cached_embedding(text: str) -> list[float]:
key = hashlib.sha256(text.encode()).hexdigest()
if key in embedding_cache:
return embedding_cache[key] # Cache hit -- zero cost, instant
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
embedding = response.data[0].embedding
embedding_cache[key] = embedding
return embeddingIn production, replace the in-memory map with Redis or your database. The pattern stays the same. Cache invalidation is simple: when a document changes, delete its cache entry and re-embed. Query caches can use a TTL since the same queries tend to cluster in time.
The chunking, caching, and hybrid search patterns here are exactly what production RAG pipelines use under the hood. If you're building an agent with persistent memory, the same embedding and retrieval pipeline powers the semantic memory search that lets agents recall relevant context from past conversations.
Embeddings Are Infrastructure, Not Magic
Embeddings are the retrieval primitive. They turn text into meaning-preserving coordinates, and similarity search finds the closest matches. RAG pipelines use them to ground LLM answers in real documents. Agent memory systems use them to recall relevant context. Knowledge bases use them to make documentation searchable. Scorecards use semantic similarity to evaluate whether an agent's response matches the expected intent.
Four decisions matter:
- Embedding model: start with
text-embedding-3-smallat $0.02/M tokens. Move totext-embedding-3-largeif quality demands it. Self-host BGE-M3 if data can't leave your infrastructure. - Vector database: pgvector if you already run Postgres, Qdrant for self-hosted, Pinecone for managed.
- Chunk size: 300-500 tokens with overlap.
- Search type: hybrid (vector + keyword) beats either method alone.
When your agent connects to external systems through tools and MCP via function calling, the embedding pipeline becomes one of several retrieval sources working in parallel. The vector store handles product docs. A tool queries live inventory. Analytics tells you which source is actually driving answer quality.
The gap between a demo and production? Chunking strategy, embedding caching, hybrid search, and monitoring retrieval quality over time. None of those are hard once you understand the primitives.
Semantic Search, Built In
Chanl's knowledge base handles embedding, chunking, and hybrid search so you can focus on what your agent knows, not how it retrieves.
Try Knowledge BaseCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



