When both variables are quantitative (continuous), the most common goal is to determine whether they move together in a systematic way. Some relationships are clearly linear, while others may be curved or more complex. Two major tools discussed here—Pearson correlation and distance correlation—address different aspects of dependence.
1. Pearson Correlation: Measuring Linear Association
What problem Pearson correlation solves
Pearson correlation is designed to quantify how strongly two continuous variables are linearly associated. In practical terms, it answers questions like:
- When is above its average, does also tend to be above its average?
- Do larger values of one variable tend to correspond to larger (or smaller) values of the other?
- Is the relationship reasonably well approximated by a straight line?
Data setup
Assume we observe paired observations for .
We compute the sample means:
Definition and intuition
Pearson correlation is computed as:
The numerator measures whether deviations from the mean move together:
- If and tend to have the same sign, the numerator tends to be positive.
- If one tends to be positive when the other is negative, the numerator tends to be negative.
The denominator rescales the value so that the result becomes a standardized measure.
Range and interpretation
By the Cauchy–Schwarz inequality:
Interpretation uses both magnitude and sign:
- measures strength of linear association.
- : perfect positive linear relationship (all points fall exactly on an increasing straight line).
- : perfect negative linear relationship (all points fall exactly on a decreasing straight line).
- : no linear association (a straight-line pattern is not visible; it may still have a nonlinear relationship).
- The sign indicates direction:
- : larger-than-average values tend to correspond to larger-than-average values.
- : larger-than-average values tend to correspond to smaller-than-average values.
When Pearson correlation becomes undefined
Pearson correlation requires variation in both variables. If either variable has no variability, the denominator becomes zero (or non-positive in computational practice), making the correlation undefined.
2. When Pearson Correlation Becomes Exactly +1 or -1
A key result is that Pearson correlation becomes perfectly ±1 when the points lie exactly on a straight line:
In this setting:
- If , the correlation is .
- If , the correlation is .
- If , then all are equal to , so has zero variance and correlation is undefined.
The reasoning is based on the fact that if , then:
So all deviations of are exactly a constant multiple of deviations of , which produces perfect linear dependence.
3. Testing Whether the Pearson Correlation Could Be “Due to Chance”
A correlation computed from a sample might appear non-zero just because of random variation, especially in smaller samples. To address this, a classical hypothesis test is used:
- Null hypothesis: the true correlation is zero (no linear association in the population).
- Alternative hypothesis: the true correlation is not zero.
A test statistic is computed:
Under the null hypothesis, this statistic follows a Student’s -distribution with degrees of freedom.
The p-value is computed as:
If , the conclusion is that the observed correlation is unlikely to be zero, meaning there is statistically significant evidence of linear association.
4. Distance Correlation: Measuring General Dependence (Not Just Linear)
Why distance correlation is needed
Pearson correlation is powerful for linear relationships but can completely miss nonlinear dependence.
For example, if follows a curved pattern like a parabola in , Pearson correlation may be close to zero even though and are strongly related.
Distance correlation addresses this gap by measuring statistical dependence more broadly, not restricted to straight-line relationships.
What distance correlation measures
Distance correlation captures the magnitude of dependence between variables. Importantly:
- It does not indicate the direction (no positive/negative notion like Pearson).
- It is designed so that zero indicates independence under broad conditions.
Core construction idea (high-level intuition)
Distance correlation is built from pairwise distances among observations.
For :
- Compute distances
Then “center” these distances by subtracting row means and column means and adding the grand mean, producing adjusted distances . This adjustment ensures the distance matrix behaves like a mean-centered representation.
The same procedure is applied to , producing adjusted distances .
Distance covariance is then formed by averaging products , and distance variances are formed similarly from and . The empirical distance correlation is essentially:
and the reported distance correlation is the positive square root of this quantity.
If the product of distance variances is zero, the distance correlation is undefined (analogous to Pearson needing variability).
5. Why Pearson and Distance Correlation Can Differ Dramatically
The examples illustrate the strengths and limitations:
Four points on a straight line
- Pearson is near , showing an almost perfectly linear negative relationship.
- Distance correlation is near 1, confirming very strong dependence.
Thirteen points on a parabola
- Pearson is approximately 0, because the curved shape cancels linear association.
- Distance correlation is moderately positive, revealing non-linear dependence.
Twenty-five points on a lattice grid
- Pearson is 0 and distance correlation is 0, indicating no dependence pattern in either metric.
Five hundred random points
- Both correlations are close to zero, consistent with weak or absent dependence.
This set of examples reinforces a key lesson:
Pearson correlation detects linear patterns, while distance correlation detects broader dependence patterns.
6. Applying These Measures to Chicago Taxi Data: Distance vs. Time
Practical constraint: very large sample size
The taxi dataset has 217,631 observations, and distance correlation requires pairwise distances, which scales poorly because it involves an distance matrix. That quickly becomes computationally infeasible.
To address this, the analysis uses a 5% random sample without replacement.
The reasoning is that statistical theory supports that estimates computed from a properly drawn random sample can still be reliable and consistent indicators of the full dataset’s behavior, especially when the sample is still large.
Results on the sample
- Pearson correlation between trip distance and trip minutes: 0.8145, with a p-value effectively zero.
- Distance correlation: 0.8458.
Interpretation
Both numbers are high, meaning the association is strong. Pearson being high indicates that the relationship is strongly linear: longer distances tend to correspond to longer trip times. Distance correlation being slightly higher suggests that, beyond linearity, there may also be additional structure (for example, different speed regimes such as city vs. freeway driving) that still reflects strong dependence.
The key takeaway is that trip distance is highly informative about trip duration, even though the relationship is not perfectly linear due to variation in traffic, routes, and speeds.
Overall Takeaway
Pearson correlation is an excellent first tool when you care about straight-line association and direction. Distance correlation provides a broader view of dependence and can detect relationships Pearson may miss. In practical analysis, using both helps distinguish whether a relationship is primarily linear or whether meaningful nonlinear dependence exists as well.
