Definition

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification (mainly) and regression.

  • Core idea: Find the best separating hyperplane that maximizes the margin between different classes.
  • It uses support vectors (the critical data points closest to the decision boundary) to define that hyperplane.

Intuition

  • Suppose you have two classes of points in 2D. Many lines can separate them.
  • SVM chooses the one that maximizes the margin (distance between boundary and nearest points of each class).
  • This leads to better generalization.

Mathematics

  1. Decision Function (linear case):

$f(x) = w \cdot x + b$

  • $w$: weight vector (normal to the hyperplane)
  • $b$: bias (offset)
  1. Classification Rule:

$\hat{y} = \text{sign}(w \cdot x + b)$

  1. Margin:

$\text{Margin} = \frac{2}{\|w\|}$

  • SVM maximizes this margin while correctly classifying the training data (or with minimal errors).

Support Vectors

  • The data points closest to the hyperplane.
  • They determine the position and orientation of the decision boundary.
  • Other points don’t matter once the hyperplane is set.

Kernel Trick

  • When data is not linearly separable, SVM uses a kernel function to map data into a higher-dimensional space where it becomes separable.
  • Common kernels:
    • Linear
    • Polynomial
    • Radial Basis Function (RBF / Gaussian)
    • Sigmoid

Example: In 2D, a circle vs background is not linearly separable. Mapping to higher dimension (radius) makes it separable.


Types of SVMs

  1. Hard-Margin SVM
    • Assumes perfect separation (no misclassification).
    • Works only if data is clean and separable.
  2. Soft-Margin SVM
    • Allows some misclassification (controlled by penalty parameter $C$).
    • Balances margin maximization with classification errors.
  3. Support Vector Regression (SVR)
    • Extension of SVM for regression tasks.
    • Fits within an ϵ\epsilonϵ-tube around the true values.

Advantages

Works well with high-dimensional data.
Effective when classes are separable or nearly separable.
Kernel trick allows flexible decision boundaries.
Uses only support vectors → efficient in memory.

Disadvantages

Training can be slow on very large datasets.
Choice of kernel and parameters is critical.
Less interpretable than linear regression or logistic regression.


Applications

  • Text classification (spam detection, sentiment analysis)
  • Image recognition (face detection, handwriting recognition)
  • Bioinformatics (protein classification, cancer detection)
  • Finance (fraud detection, credit scoring)

In short:
SVMs classify data by finding the maximum-margin hyperplane that separates classes, using only the critical points (support vectors). With the kernel trick, they handle nonlinear boundaries very effectively.