Session 11· 02· 20 min

Text Embeddings

What you'll learn
  • Explain what a text embedding is and why it enables semantic similarity
  • Generate embeddings with the OpenAI text-embedding-3-small model
  • Compute cosine similarity between two texts from scratch

What is a text embedding?

A text embedding is a fixed-length vector of floating-point numbers that captures the meaning of a piece of text. Two texts with similar meaning produce vectors that point in similar directions in high-dimensional space. This is what enables semantic search — finding chunks that mean the same thing even if they use different words.

Embedding pipeline
Text chunk
raw string
Tokenise
model tokeniser
Encode
transformer layers
Pool
mean / CLS token
Vector
1536 floats

Generating embeddings with OpenAI

embeddings.py
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

def generate_embedding(text: str, model: str = 'text-embedding-3-small') -> list[float]:
    text = text.replace('\n', ' ')
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

Cosine similarity from scratch

embeddings.py
import math

def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)
dot product measures how much the two vectors agree direction-by-direction
Dividing by the product of magnitudes normalises to the range [-1, 1]
Score of 1.0 means identical direction (semantically equivalent)
Score near 0 means orthogonal (unrelated meaning)
Guard against zero-length vectors to avoid division by zero

Similarity demo

embeddings.py
query = 'How do transformers handle long-range dependencies?'

texts = [
    'Attention mechanisms allow models to relate distant tokens directly.',
    'The Eiffel Tower is located in Paris, France.',
    'Self-attention captures relationships between all positions in a sequence.',
]

q_emb = generate_embedding(query)
for text in texts:
    t_emb = generate_embedding(text)
    score = cosine_similarity(q_emb, t_emb)
    print(f'{score:.3f}  {text[:60]}')
Knowledge Check
Two texts have a cosine similarity of 0.03. What does this indicate?
Recap — what you just learned
  • An embedding is a fixed-length float vector capturing text meaning; similar texts produce similar vectors
  • text-embedding-3-small produces 1536-dimensional vectors and is the recommended default
  • Cosine similarity = dot product divided by product of magnitudes; ranges from -1 to 1
  • Always replace newlines in text before embedding — they add noise without adding meaning
Next up: Semantic Search