Session 11· 02· 20 min
Text Embeddings
What you'll learn
- ▸Explain what a text embedding is and why it enables semantic similarity
- ▸Generate embeddings with the OpenAI text-embedding-3-small model
- ▸Compute cosine similarity between two texts from scratch
What is a text embedding?
A text embedding is a fixed-length vector of floating-point numbers that captures the meaning of a piece of text. Two texts with similar meaning produce vectors that point in similar directions in high-dimensional space. This is what enables semantic search — finding chunks that mean the same thing even if they use different words.
Embedding pipeline
Text chunk
raw string
Tokenise
model tokeniser
Encode
transformer layers
Pool
mean / CLS token
Vector
1536 floats
Generating embeddings with OpenAI
embeddings.py
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
def generate_embedding(text: str, model: str = 'text-embedding-3-small') -> list[float]:
text = text.replace('\n', ' ')
response = client.embeddings.create(input=[text], model=model)
return response.data[0].embeddingCosine similarity from scratch
embeddings.py
import math
def cosine_similarity(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
norm_a = math.sqrt(sum(x * x for x in a))
norm_b = math.sqrt(sum(x * x for x in b))
if norm_a == 0 or norm_b == 0:
return 0.0
return dot / (norm_a * norm_b)①dot product measures how much the two vectors agree direction-by-direction
②Dividing by the product of magnitudes normalises to the range [-1, 1]
③Score of 1.0 means identical direction (semantically equivalent)
④Score near 0 means orthogonal (unrelated meaning)
⑤Guard against zero-length vectors to avoid division by zero
Similarity demo
embeddings.py
query = 'How do transformers handle long-range dependencies?'
texts = [
'Attention mechanisms allow models to relate distant tokens directly.',
'The Eiffel Tower is located in Paris, France.',
'Self-attention captures relationships between all positions in a sequence.',
]
q_emb = generate_embedding(query)
for text in texts:
t_emb = generate_embedding(text)
score = cosine_similarity(q_emb, t_emb)
print(f'{score:.3f} {text[:60]}')Knowledge Check
Two texts have a cosine similarity of 0.03. What does this indicate?
Recap — what you just learned
- ✓An embedding is a fixed-length float vector capturing text meaning; similar texts produce similar vectors
- ✓text-embedding-3-small produces 1536-dimensional vectors and is the recommended default
- ✓Cosine similarity = dot product divided by product of magnitudes; ranges from -1 to 1
- ✓Always replace newlines in text before embedding — they add noise without adding meaning