Statistical inference involves using observed data to draw conclusions about quantities that cannot be directly observed. In most real-world problems, it is impossible or unethical to observe every case or every possible outcome, so inferences must be made based on limited samples.
1. Purpose of Statistical Inference
Statistical inference seeks to learn about unobserved or unknown quantities from numerical data.
For example, in a clinical trial comparing a new cancer drug to a standard treatment, the true five-year survival probabilities for each treatment in the population are unknown.
Since the entire population cannot be studied, conclusions about the true probabilities—and especially their difference—must be based on a sample of patients.
Even if all patients could receive one treatment, no one could receive both, so inference is still required to compare observed outcomes with unobserved, hypothetical outcomes under the other treatment.
2. Two Types of Estimands (Unobserved Quantities of Interest)
- Potentially observable quantities
- Values that could, in principle, be observed but are not available in the data.
- Examples:
- Future observations of a process.
- The outcome for a patient under the treatment they did not receive.
- Unobservable parameters
- Quantities that characterize the underlying process that generated the data.
- Examples: regression coefficients, overall survival probabilities, or variance parameters.
- These are not directly observable but determine how the data behave.
The distinction between these two types of quantities is not always strict, but it helps clarify how a statistical model connects data to real-world questions.
3. Notation for Parameters, Data, and Predictions
- θ (theta): Unobserved parameter vector or population quantities of interest.
Example: true survival probabilities for each treatment group. - y: Observed data.
Example: the number of survivors and deaths in each treatment group. - ỹ (y-tilde): Unknown but potentially observable quantities.
Example: outcomes that would have been seen if patients had received the other treatment, or outcomes for future patients.
All three symbols may represent multivariate quantities.
Notation conventions:
- Greek letters → parameters.
- Lowercase Roman letters → observed or observable scalars or vectors.
- Uppercase Roman letters → observed or observable matrices.
Vectors are column vectors by default; for instance, $u^T u$ is a scalar and $u u^T$ is an $n \times n$ matrix.
4. Observational Units and Variables
In most studies, data are collected from nnn observational units.
The dataset is written as $y = (y_1, \dots, y_n)$.
Example: in a clinical trial, $y_i = 1$ if patient $i$ is alive after five years and $y_i = 0$ otherwise.
If several outcomes are measured per unit, each $y_i$ is a vector, and the full dataset $y$ is an $n$-row matrix.
The outcomes $y$ are considered random because their observed values could have been different due to sampling and natural variation in the population.
5. Exchangeability
Statistical analyses often begin with the assumption that $y_1, \dots, y_n$ are exchangeable—that is, the joint probability $p(y_1, \dots, y_n)$ does not change when the indexes are permuted.
This means all units are treated symmetrically unless differences are explained by specific variables.
If the ordering itself carries information (as in a time series), the model would not be exchangeable.
Exchangeability is fundamental because it leads to the common modeling assumption that $y_i \mid \theta \text{ are iid for } i = 1, \dots, n, \quad \text{with } \theta \sim p(\theta).$
In the clinical trial, each patient’s survival indicator $y_i$ can be modeled as iid Bernoulli with unknown survival probability θ.
6. Explanatory Variables (Covariates)
In many studies, additional variables are recorded for each unit but are not modeled as random.
In the clinical trial, these might include age or previous health conditions.
Such variables are explanatory variables or covariates, denoted by $x$ for a single unit and $X$ for all units.
If there are $k$ explanatory variables and $n$ units, $X$ is an $n \times k$ matrix.
Exchangeability can be extended to pairs $(x_i, y_i)$: after incorporating relevant covariates, the order of the indexes no longer matters.
Conditional on $x$, the distribution of $y$ is the same for all units; two units with the same covariates have identical distributions of outcomes.
Any explanatory variable can also be included in the model for $y$ if treated as random.
This framework provides the foundation for regression analysis, experimental design, and causal modeling.
7. Hierarchical (Multilevel) Modeling
When data are collected at multiple levels, hierarchical models are appropriate.
For example, suppose two medical treatments are tested in several cities.
Within each city, patients are exchangeable, and across cities, the city-level results can also be treated as exchangeable.
At both levels, explanatory variables may be included:
- Individual level: patient-specific factors such as age or prior health.
- Group level: city-level features such as hospital type or average socioeconomic status.
Conditional on these explanatory variables, exchangeability holds within and across levels.
Hierarchical modeling allows for both individual variation and group-level patterns, enabling information sharing across units while preserving local differences.
8. Core Principles
- Statistical inference connects observed data to unobserved or hypothetical quantities.
- θ represents unobserved parameters; y represents observed data; ỹ represents potential or future observations.
- Exchangeability justifies modeling data as iid given θ.
- Explanatory variables explain systematic differences among units while maintaining conditional exchangeability.
- Hierarchical models extend these principles to multiple levels of grouping and variation.
Together, these ideas form the conceptual and mathematical basis for probabilistic modeling, linking data, parameters, and predictions within the framework of statistical inference.
