Definition

Active learning is a machine learning approach where the model is trained iteratively, and it actively selects the most informative data points to be labeled (instead of labeling everything).

  • Goal: achieve high accuracy with fewer labeled examples.
  • Useful when labeling data is expensive or time-consuming (e.g., medical images, legal documents).

How It Works

  1. Start with a small labeled dataset + a large pool of unlabeled data.
  2. Train an initial model.
  3. Use a query strategy to select the most “valuable” unlabeled samples.
  4. Send those samples to an oracle (human annotator, expert) for labeling.
  5. Add them to the training set → retrain the model.
  6. Repeat until performance is good enough or budget is used.

Common Query Strategies

  1. Uncertainty Sampling
    • Select samples where the model is least confident.
    • Example: For binary classification, pick data where predicted probability ≈ 0.5.
  2. Query by Committee
    • Train multiple models (committee).
    • Pick samples where models disagree most.
  3. Expected Model Change
    • Choose data that would most change the model if labeled.
  4. Diversity Sampling
    • Pick examples that are different from existing training data, to cover the input space.

Applications

  • Medical AI → radiologists only label uncertain X-rays.
  • NLP → annotate only ambiguous sentences for intent classification.
  • Fraud detection → human reviewers check uncertain transactions.
  • Image recognition → label only the most informative images.

Example

  • You have 100,000 unlabeled emails, but labeling costs \$2 each.
  • Active learning strategy:
    • Train on 1,000 labeled emails.
    • Pick 500 most uncertain emails for labeling.
    • Retrain, accuracy improves faster than random labeling.

Why It’s Important

  • Reduces annotation cost.
  • Improves model performance faster than random sampling.
  • Helps handle imbalanced datasets (since rare/uncertain cases get prioritized).

Summary
Active learning = model-guided data labeling strategy.
The model queries the most informative unlabeled samples for labeling, so you can reach high performance with less data.