When both variables are quantitative (continuous), the most common goal is to determine whether they move together in a systematic way. Some relationships are clearly linear, while others may be curved or more complex. Two major tools discussed here—Pearson correlation and distance correlation—address different aspects of dependence.


1. Pearson Correlation: Measuring Linear Association

What problem Pearson correlation solves

Pearson correlation is designed to quantify how strongly two continuous variables are linearly associated. In practical terms, it answers questions like:

  • When xx is above its average, does yy also tend to be above its average?
  • Do larger values of one variable tend to correspond to larger (or smaller) values of the other?
  • Is the relationship reasonably well approximated by a straight line?

Data setup

Assume we observe nn paired observations (xi,yi)(x_i, y_i) for i=1,,ni = 1, \dots, n.
We compute the sample means:

  • xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i
  • yˉ=1ni=1nyi\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i

Definition and intuition

Pearson correlation rr is computed as:

r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}

The numerator measures whether deviations from the mean move together:

  • If xixˉx_i-\bar{x} and yiyˉy_i-\bar{y}​ tend to have the same sign, the numerator tends to be positive.
  • If one tends to be positive when the other is negative, the numerator tends to be negative.

The denominator rescales the value so that the result becomes a standardized measure.

Range and interpretation

By the Cauchy–Schwarz inequality:

  • 1r1-1 \le r \le 1

Interpretation uses both magnitude and sign:

  • r|r| measures strength of linear association.
    • r=1r = 1: perfect positive linear relationship (all points fall exactly on an increasing straight line).
    • r=1r = -1: perfect negative linear relationship (all points fall exactly on a decreasing straight line).
    • r=0r = 0: no linear association (a straight-line pattern is not visible; it may still have a nonlinear relationship).
  • The sign indicates direction:
    • r>0r>0: larger-than-average xx values tend to correspond to larger-than-average yy values.
    • r<0r<0: larger-than-average xx values tend to correspond to smaller-than-average yy values.

When Pearson correlation becomes undefined

Pearson correlation requires variation in both variables. If either variable has no variability, the denominator becomes zero (or non-positive in computational practice), making the correlation undefined.


2. When Pearson Correlation Becomes Exactly +1 or -1

A key result is that Pearson correlation becomes perfectly ±1 when the points lie exactly on a straight line:yi=a+bxiy_i = a + b x_i

In this setting:

  • If b>0b>0, the correlation is +1+1.
  • If b<0b<0, the correlation is 1-1.
  • If b=0b=0, then all yiy_i​ are equal to aa, so yy has zero variance and correlation is undefined.

The reasoning is based on the fact that if yi=a+bxiy_i = a + b x_i​, then:

  • yˉ=a+bxˉ\bar{y} = a + b\bar{x}
  • yiyˉ=b(xixˉ)y_i – \bar{y} = b(x_i – \bar{x})

So all deviations of yy are exactly a constant multiple of deviations of xx, which produces perfect linear dependence.


3. Testing Whether the Pearson Correlation Could Be “Due to Chance”

A correlation computed from a sample might appear non-zero just because of random variation, especially in smaller samples. To address this, a classical hypothesis test is used:

  • Null hypothesis: the true correlation is zero (no linear association in the population).
  • Alternative hypothesis: the true correlation is not zero.

A test statistic is computed:

t=rn21r2t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}

Under the null hypothesis, this statistic follows a Student’s tt-distribution with n2n-2 degrees of freedom.
The p-value is computed as:

p=2P(T>t)p = 2 \cdot P(T > |t|)

If p<0.05p < 0.05, the conclusion is that the observed correlation is unlikely to be zero, meaning there is statistically significant evidence of linear association.


4. Distance Correlation: Measuring General Dependence (Not Just Linear)

Why distance correlation is needed

Pearson correlation is powerful for linear relationships but can completely miss nonlinear dependence.

For example, if yy follows a curved pattern like a parabola in xx, Pearson correlation may be close to zero even though xx and yy are strongly related.

Distance correlation addresses this gap by measuring statistical dependence more broadly, not restricted to straight-line relationships.

What distance correlation measures

Distance correlation captures the magnitude of dependence between variables. Importantly:

  • It does not indicate the direction (no positive/negative notion like Pearson).
  • It is designed so that zero indicates independence under broad conditions.

Core construction idea (high-level intuition)

Distance correlation is built from pairwise distances among observations.

For xx:

  • Compute distances aij=xixja_{ij} = |x_i – x_j|

Then “center” these distances by subtracting row means and column means and adding the grand mean, producing adjusted distances AijA_{ij}​. This adjustment ensures the distance matrix behaves like a mean-centered representation.

The same procedure is applied to yy, producing adjusted distances BijB_{ij}.

Distance covariance is then formed by averaging products AijBijA_{ij}B_{ij}, and distance variances are formed similarly from Aij2A_{ij}^2 and Bij2B_{ij}^2​. The empirical distance correlation is essentially:

Rn2(x,y)=Vn2(x,y)Vn2(x,x)Vn2(y,y)\mathcal{R}_n^2(x,y)=\frac{\mathcal{V}_n^2(x,y)}{\mathcal{V}_n^2(x,x)\cdot \mathcal{V}_n^2(y,y)}

and the reported distance correlation is the positive square root of this quantity.

If the product of distance variances is zero, the distance correlation is undefined (analogous to Pearson needing variability).


5. Why Pearson and Distance Correlation Can Differ Dramatically

The examples illustrate the strengths and limitations:

Four points on a straight line

  • Pearson is near 1-1, showing an almost perfectly linear negative relationship.
  • Distance correlation is near 1, confirming very strong dependence.

Thirteen points on a parabola

  • Pearson is approximately 0, because the curved shape cancels linear association.
  • Distance correlation is moderately positive, revealing non-linear dependence.

Twenty-five points on a lattice grid

  • Pearson is 0 and distance correlation is 0, indicating no dependence pattern in either metric.

Five hundred random points

  • Both correlations are close to zero, consistent with weak or absent dependence.

This set of examples reinforces a key lesson:
Pearson correlation detects linear patterns, while distance correlation detects broader dependence patterns.


6. Applying These Measures to Chicago Taxi Data: Distance vs. Time

Practical constraint: very large sample size

The taxi dataset has 217,631 observations, and distance correlation requires pairwise distances, which scales poorly because it involves an n×nn \times n distance matrix. That quickly becomes computationally infeasible.

To address this, the analysis uses a 5% random sample without replacement.

The reasoning is that statistical theory supports that estimates computed from a properly drawn random sample can still be reliable and consistent indicators of the full dataset’s behavior, especially when the sample is still large.

Results on the sample

  • Pearson correlation between trip distance and trip minutes: 0.8145, with a p-value effectively zero.
  • Distance correlation: 0.8458.

Interpretation

Both numbers are high, meaning the association is strong. Pearson being high indicates that the relationship is strongly linear: longer distances tend to correspond to longer trip times. Distance correlation being slightly higher suggests that, beyond linearity, there may also be additional structure (for example, different speed regimes such as city vs. freeway driving) that still reflects strong dependence.

The key takeaway is that trip distance is highly informative about trip duration, even though the relationship is not perfectly linear due to variation in traffic, routes, and speeds.


Overall Takeaway

Pearson correlation is an excellent first tool when you care about straight-line association and direction. Distance correlation provides a broader view of dependence and can detect relationships Pearson may miss. In practical analysis, using both helps distinguish whether a relationship is primarily linear or whether meaningful nonlinear dependence exists as well.