1) What it is
- Embeddings = vector representations of data (words, images, users, products, etc.) in a continuous space.
- Embedding similarity = measuring how close two vectors are in that space.
- Intuition: similar things should have embeddings that are close together.
Example: In word embeddings, “king – man + woman ≈ queen”.
2) Why It Matters
- Converts raw objects (text, images, users) into numeric representations.
- Enables ML models to compare, cluster, and retrieve similar items.
- Powers modern search, recommendations, semantic understanding.
3) Similarity Measures
a) Cosine similarity
Most common. Measures angle between vectors.
$\text{cosine}(u,v) = \frac{u \cdot v}{\|u\|\|v\|}$
- Range: -1 (opposite) to 1 (identical).
- Example: word embeddings → semantically similar words have cosine ≈ 1.
b) Dot product
u⋅vu \cdot vu⋅v
- Larger = more similar.
- Used in neural networks (attention, matrix factorization).
c) Euclidean distance (L2)
$\|u – v\|_2$
- Smaller = more similar.
d) Manhattan distance (L1)
$\|u – v\|_1$
e) Other measures
- Jaccard similarity (for sparse sets).
- Mahalanobis distance (takes covariance into account).
4) Applications
- NLP:
- Semantic similarity between sentences/documents.
- Word embeddings (Word2Vec, GloVe, BERT embeddings).
- Vision:
- Face recognition → compare embedding similarity of faces.
- Image retrieval → “find images similar to this one.”
- Recommendation systems:
- User–item embeddings (matrix factorization, collaborative filtering).
- “People who liked this also liked that.”
- Clustering & anomaly detection:
- Group similar embeddings together, flag outliers.
5) Example (Python, Cosine Similarity)
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Two embeddings
u = np.array([[0.1, 0.8, 0.5]])
v = np.array([[0.2, 0.7, 0.4]])
sim = cosine_similarity(u, v)
print("Cosine similarity:", sim[0][0])
Output → 0.99 (very similar).
6) Limitations
Choice of metric matters (cosine vs Euclidean can rank differently).
Embedding quality = depends on how embeddings were trained.
High-dimensional issues (curse of dimensionality → distances lose meaning).
Sensitive to domain shifts (embeddings may not generalize).
Summary
- Embedding similarity = comparing objects in vector space.
- Common metrics: cosine similarity, dot product, Euclidean distance.
- Powers search, recommendations, semantic understanding, vision.
- Critical in LLMs, face recognition, recommendation engines.
