1) What it is

  • Embeddings = vector representations of data (words, images, users, products, etc.) in a continuous space.
  • Embedding similarity = measuring how close two vectors are in that space.
  • Intuition: similar things should have embeddings that are close together.

Example: In word embeddings, “king – man + woman ≈ queen”.


2) Why It Matters

  • Converts raw objects (text, images, users) into numeric representations.
  • Enables ML models to compare, cluster, and retrieve similar items.
  • Powers modern search, recommendations, semantic understanding.

3) Similarity Measures

a) Cosine similarity

Most common. Measures angle between vectors.

$\text{cosine}(u,v) = \frac{u \cdot v}{\|u\|\|v\|}$

  • Range: -1 (opposite) to 1 (identical).
  • Example: word embeddings → semantically similar words have cosine ≈ 1.

b) Dot product

u⋅vu \cdot vu⋅v

  • Larger = more similar.
  • Used in neural networks (attention, matrix factorization).

c) Euclidean distance (L2)

$\|u – v\|_2$

  • Smaller = more similar.

d) Manhattan distance (L1)

$\|u – v\|_1$

e) Other measures

  • Jaccard similarity (for sparse sets).
  • Mahalanobis distance (takes covariance into account).

4) Applications

  • NLP:
    • Semantic similarity between sentences/documents.
    • Word embeddings (Word2Vec, GloVe, BERT embeddings).
  • Vision:
    • Face recognition → compare embedding similarity of faces.
    • Image retrieval → “find images similar to this one.”
  • Recommendation systems:
    • User–item embeddings (matrix factorization, collaborative filtering).
    • “People who liked this also liked that.”
  • Clustering & anomaly detection:
    • Group similar embeddings together, flag outliers.

5) Example (Python, Cosine Similarity)

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Two embeddings
u = np.array([[0.1, 0.8, 0.5]])
v = np.array([[0.2, 0.7, 0.4]])

sim = cosine_similarity(u, v)
print("Cosine similarity:", sim[0][0])

Output → 0.99 (very similar).


6) Limitations

Choice of metric matters (cosine vs Euclidean can rank differently).
Embedding quality = depends on how embeddings were trained.
High-dimensional issues (curse of dimensionality → distances lose meaning).
Sensitive to domain shifts (embeddings may not generalize).


Summary

  • Embedding similarity = comparing objects in vector space.
  • Common metrics: cosine similarity, dot product, Euclidean distance.
  • Powers search, recommendations, semantic understanding, vision.
  • Critical in LLMs, face recognition, recommendation engines.