1. Why Data Selection Matters
Data analysts are surrounded by vast amounts of available data, but not all data is useful for every project. A key responsibility of a data analyst is deciding what data to collect and use so that analysis stays focused, efficient, and relevant to the business problem.
Although the volume of available data can feel overwhelming, project goals and business questions usually narrow the choices significantly.
2. Starting with the Problem
Data collection should begin with a clear analytical question.
Example question
- What is causing increased rush-hour traffic in a city?
This question immediately shapes what kind of data is needed and rules out irrelevant data.
3. Understanding Data Sources
Knowing where data comes from is essential for evaluating its reliability and suitability.
First-party data
- Collected directly by the analyst or organization
- Uses internal resources
- Example: observing and counting cars on city streets
- Most preferred because the source is fully known
Second-party data
- Collected by another organization directly from its audience
- Purchased or shared
- Example: traffic studies conducted by a local research organization
- Generally reliable due to subject-matter expertise
Third-party data
- Collected by external sources not directly involved
- May pass through multiple organizations
- Reliability varies
- Requires careful checks for:
- Accuracy
- Bias
- Credibility
Regardless of the source, all data must be evaluated for trustworthiness and approval for use.
4. Relevance Over Volume
Choosing the right data is more important than collecting large amounts of data.
Example
- Financial data may not help explain traffic congestion
- Traffic volume by time of day would be highly relevant
Data analysts must avoid distractions and focus only on data that supports the problem being solved.
5. Population vs. Sample
Population
- The entire group of interest
- Example: all cars traveling in a city
Collecting data from the full population is often impractical.
Sample
- A subset of the population
- Represents the population accurately
- Example approaches:
- Observing traffic at a key intersection
- Selecting a random subset of traffic records
The sampling method depends on the project’s goals and constraints.
6. Choosing the Right Data Type
Selecting the correct data type helps ensure meaningful analysis.
Example
- Traffic record dates stored in date format
- Enables analysis of:
- Day-of-week trends
- Seasonal patterns
- Recurring congestion periods
Correct data types improve accuracy and analytical flexibility.
7. Defining the Time Frame
The time period for data collection affects other decisions.
Historical data
- Data that already exists
- Useful when answers are needed quickly
Long-term data collection
- Used to observe trends over time
- Influences sampling strategy and storage choices
8. Key Takeaways
- Data selection is a critical analytical decision
- Business questions guide what data is relevant
- First-, second-, and third-party data vary in reliability
- All data must be checked for accuracy and credibility
- Samples are often more practical than full populations
- Choosing appropriate data types improves insights
- Time frame influences data collection strategy
One-sentence summary
Effective data analysis depends on selecting relevant, reliable data by considering data sources, sampling, data types, and time frames in alignment with the problem being solved.
