1. Population vs. Sample

In data analysis, a population refers to all possible data values relevant to a specific question or dataset.
Using data from 100% of the population is ideal, but in practice this is often impossible due to:

  • High cost
  • Time constraints
  • Logistical difficulty

2. Why Sample Size Is Used

A sample (or sample size) is a subset of the population that is intended to represent the whole population.

Purpose of using a sample:

  • Make predictions or draw conclusions about the population
  • Reduce cost and time
  • Enable analysis when full population data is unavailable

Example:
Instead of surveying millions of cat owners in Canada, a sample might include hundreds or thousands of cat owners.

If selected carefully, a sample can produce results that are nearly as reliable as using the full population.


3. Confidence and Uncertainty in Sampling

The size and quality of a sample affect how confident analysts can be that their conclusions represent the population.

Key trade-off:

  • Smaller samples → faster and cheaper, but more uncertainty
  • Larger samples → more confidence, but higher cost and effort

Because samples never include everyone, there is always some level of uncertainty.


4. Sampling Bias

Sampling bias occurs when a sample does not accurately represent the population.

This happens when:

  • Certain groups are overrepresented
  • Other groups are underrepresented or excluded

Example:

  • A survey of cat owners conducted only via smartphones
  • Cat owners without smartphones are excluded
  • The sample no longer represents all cat owners

5. Random Sampling

Random sampling is a method used to reduce sampling bias.

Definition:

  • Every member of the population has an equal chance of being selected

Benefits:

  • Improves representativeness
  • Reduces systematic bias
  • Increases confidence in conclusions

Example:

  • Cat owners in apartments in Ontario and houses in Alberta have equal chances of selection

6. Role of the Data Analyst

Sample size decisions are often made before data collection, but analysts should still:

  • Understand how the sample was created
  • Confirm that the sample aligns with the business objective
  • Evaluate whether the data is representative of the population

Knowing this helps analysts assess the strength and limitations of their analysis.


7. Key Takeaway

When full population data is unavailable:

  • Use an appropriate sample size
  • Minimize sampling bias
  • Prefer random sampling when possible

A well-chosen sample allows analysts to make reliable conclusions while balancing accuracy, cost, and time.