1. Why Data Selection Matters

Data analysts are surrounded by vast amounts of available data, but not all data is useful for every project. A key responsibility of a data analyst is deciding what data to collect and use so that analysis stays focused, efficient, and relevant to the business problem.

Although the volume of available data can feel overwhelming, project goals and business questions usually narrow the choices significantly.


2. Starting with the Problem

Data collection should begin with a clear analytical question.

Example question

  • What is causing increased rush-hour traffic in a city?

This question immediately shapes what kind of data is needed and rules out irrelevant data.


3. Understanding Data Sources

Knowing where data comes from is essential for evaluating its reliability and suitability.

First-party data

  • Collected directly by the analyst or organization
  • Uses internal resources
  • Example: observing and counting cars on city streets
  • Most preferred because the source is fully known

Second-party data

  • Collected by another organization directly from its audience
  • Purchased or shared
  • Example: traffic studies conducted by a local research organization
  • Generally reliable due to subject-matter expertise

Third-party data

  • Collected by external sources not directly involved
  • May pass through multiple organizations
  • Reliability varies
  • Requires careful checks for:
    • Accuracy
    • Bias
    • Credibility

Regardless of the source, all data must be evaluated for trustworthiness and approval for use.


4. Relevance Over Volume

Choosing the right data is more important than collecting large amounts of data.

Example

  • Financial data may not help explain traffic congestion
  • Traffic volume by time of day would be highly relevant

Data analysts must avoid distractions and focus only on data that supports the problem being solved.


5. Population vs. Sample

Population

  • The entire group of interest
  • Example: all cars traveling in a city

Collecting data from the full population is often impractical.


Sample

  • A subset of the population
  • Represents the population accurately
  • Example approaches:
    • Observing traffic at a key intersection
    • Selecting a random subset of traffic records

The sampling method depends on the project’s goals and constraints.


6. Choosing the Right Data Type

Selecting the correct data type helps ensure meaningful analysis.

Example

  • Traffic record dates stored in date format
  • Enables analysis of:
    • Day-of-week trends
    • Seasonal patterns
    • Recurring congestion periods

Correct data types improve accuracy and analytical flexibility.


7. Defining the Time Frame

The time period for data collection affects other decisions.

Historical data

  • Data that already exists
  • Useful when answers are needed quickly

Long-term data collection

  • Used to observe trends over time
  • Influences sampling strategy and storage choices

8. Key Takeaways

  • Data selection is a critical analytical decision
  • Business questions guide what data is relevant
  • First-, second-, and third-party data vary in reliability
  • All data must be checked for accuracy and credibility
  • Samples are often more practical than full populations
  • Choosing appropriate data types improves insights
  • Time frame influences data collection strategy

One-sentence summary

Effective data analysis depends on selecting relevant, reliable data by considering data sources, sampling, data types, and time frames in alignment with the problem being solved.