Definition
An outlier is a data point that lies far away from the majority of other observations.
It is an unusual or extreme value that doesn’t follow the general trend of the data.
- Example: If most people’s height is between 150–190 cm, but one value is 250 cm, that’s an outlier.
Causes of Outliers
- Measurement Error – faulty sensors, data entry mistakes.
- Natural Variation – genuine extreme values (e.g., very tall people).
- Sampling Issues – mixing populations or incorrect sampling method.
Why Outliers Matter
- Skew Results: Can heavily influence the mean, standard deviation, regression coefficients, and MSE.
- Model Performance: In regression, one extreme point can pull the regression line toward it.
- Detection of Rare Events: Outliers can sometimes represent fraud, defects, or important rare cases.
Methods to Detect Outliers
- Statistical Rules
- Z-score method: If ∣z∣>3|z| > 3∣z∣>3, treat as outlier.
- IQR rule: If a value is below $Q1 – 1.5 \times IQR$ or above $Q3 + 1.5 \times IQR$.
- Visualization
- Boxplot (points outside whiskers).
- Scatterplot (isolated points).
- Machine Learning Approaches
- Isolation Forest, DBSCAN, One-Class SVM.
Handling Outliers
- Keep them: If they are real and important (e.g., rare but valid cases).
- Remove them: If caused by error or not relevant.
- Transform them: Apply log or square root transformations to reduce their influence.
- Use robust models: Median-based methods, MAE instead of MSE, or robust regression.
Example
Suppose exam scores are: $[70, 75, 80, 85, 90, 92, 95, 100, 12]$
- Most scores are between 70–100, but 12 is extremely low.
- Mean with outlier = 77.7, without outlier = 86.4 → The outlier drags the average down.
In short:
An outlier is a data point far from the rest. It can signal errors, rare events, or meaningful anomalies. Handling depends on context—sometimes remove, sometimes keep.
