Session 11· 02· 20 min

Text Embeddings

What you'll learn

▸Explain what a text embedding is and why it enables semantic similarity
▸Generate embeddings with the OpenAI text-embedding-3-small model
▸Compute cosine similarity between two texts from scratch

What is a text embedding?

A text embedding is a fixed-length vector of floating-point numbers that captures the meaning of a piece of text. Two texts with similar meaning produce vectors that point in similar directions in high-dimensional space. This is what enables semantic search — finding chunks that mean the same thing even if they use different words.

Embedding pipeline

Text chunk

raw string

Tokenise

model tokeniser

Encode

transformer layers

Pool

mean / CLS token

Vector

1536 floats

Generating embeddings with OpenAI

embeddings.py

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

def generate_embedding(text: str, model: str = 'text-embedding-3-small') -> list[float]:
    text = text.replace('\n', ' ')
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

Cosine similarity from scratch

embeddings.py

import math

def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)

①dot product measures how much the two vectors agree direction-by-direction

②Dividing by the product of magnitudes normalises to the range [-1, 1]

③Score of 1.0 means identical direction (semantically equivalent)

④Score near 0 means orthogonal (unrelated meaning)

⑤Guard against zero-length vectors to avoid division by zero

Similarity demo

embeddings.py

query = 'How do transformers handle long-range dependencies?'

texts = [
    'Attention mechanisms allow models to relate distant tokens directly.',
    'The Eiffel Tower is located in Paris, France.',
    'Self-attention captures relationships between all positions in a sequence.',
]

q_emb = generate_embedding(query)
for text in texts:
    t_emb = generate_embedding(text)
    score = cosine_similarity(q_emb, t_emb)
    print(f'{score:.3f}  {text[:60]}')

Knowledge Check

Two texts have a cosine similarity of 0.03. What does this indicate?

Recap — what you just learned

✓An embedding is a fixed-length float vector capturing text meaning; similar texts produce similar vectors
✓text-embedding-3-small produces 1536-dimensional vectors and is the recommended default
✓Cosine similarity = dot product divided by product of magnitudes; ranges from -1 to 1
✓Always replace newlines in text before embedding — they add noise without adding meaning

Next up: Semantic Search