Definition
Active learning is a machine learning approach where the model is trained iteratively, and it actively selects the most informative data points to be labeled (instead of labeling everything).
- Goal: achieve high accuracy with fewer labeled examples.
- Useful when labeling data is expensive or time-consuming (e.g., medical images, legal documents).
How It Works
- Start with a small labeled dataset + a large pool of unlabeled data.
- Train an initial model.
- Use a query strategy to select the most “valuable” unlabeled samples.
- Send those samples to an oracle (human annotator, expert) for labeling.
- Add them to the training set → retrain the model.
- Repeat until performance is good enough or budget is used.
Common Query Strategies
- Uncertainty Sampling
- Select samples where the model is least confident.
- Example: For binary classification, pick data where predicted probability ≈ 0.5.
- Query by Committee
- Train multiple models (committee).
- Pick samples where models disagree most.
- Expected Model Change
- Choose data that would most change the model if labeled.
- Diversity Sampling
- Pick examples that are different from existing training data, to cover the input space.
Applications
- Medical AI → radiologists only label uncertain X-rays.
- NLP → annotate only ambiguous sentences for intent classification.
- Fraud detection → human reviewers check uncertain transactions.
- Image recognition → label only the most informative images.
Example
- You have 100,000 unlabeled emails, but labeling costs \$2 each.
- Active learning strategy:
- Train on 1,000 labeled emails.
- Pick 500 most uncertain emails for labeling.
- Retrain, accuracy improves faster than random labeling.
Why It’s Important
- Reduces annotation cost.
- Improves model performance faster than random sampling.
- Helps handle imbalanced datasets (since rare/uncertain cases get prioritized).
Summary
Active learning = model-guided data labeling strategy.
The model queries the most informative unlabeled samples for labeling, so you can reach high performance with less data.
