Definition

An outlier is a data point that lies far away from the majority of other observations.
It is an unusual or extreme value that doesn’t follow the general trend of the data.

  • Example: If most people’s height is between 150–190 cm, but one value is 250 cm, that’s an outlier.

Causes of Outliers

  1. Measurement Error – faulty sensors, data entry mistakes.
  2. Natural Variation – genuine extreme values (e.g., very tall people).
  3. Sampling Issues – mixing populations or incorrect sampling method.

Why Outliers Matter

  • Skew Results: Can heavily influence the mean, standard deviation, regression coefficients, and MSE.
  • Model Performance: In regression, one extreme point can pull the regression line toward it.
  • Detection of Rare Events: Outliers can sometimes represent fraud, defects, or important rare cases.

Methods to Detect Outliers

  1. Statistical Rules
    • Z-score method: If ∣z∣>3|z| > 3∣z∣>3, treat as outlier.
    • IQR rule: If a value is below $Q1 – 1.5 \times IQR$ or above $Q3 + 1.5 \times IQR$.
  2. Visualization
    • Boxplot (points outside whiskers).
    • Scatterplot (isolated points).
  3. Machine Learning Approaches
    • Isolation Forest, DBSCAN, One-Class SVM.

Handling Outliers

  • Keep them: If they are real and important (e.g., rare but valid cases).
  • Remove them: If caused by error or not relevant.
  • Transform them: Apply log or square root transformations to reduce their influence.
  • Use robust models: Median-based methods, MAE instead of MSE, or robust regression.

Example

Suppose exam scores are: $[70, 75, 80, 85, 90, 92, 95, 100, 12]$

  • Most scores are between 70–100, but 12 is extremely low.
  • Mean with outlier = 77.7, without outlier = 86.4 → The outlier drags the average down.

In short:
An outlier is a data point far from the rest. It can signal errors, rare events, or meaningful anomalies. Handling depends on context—sometimes remove, sometimes keep.