Decision trees are motivated by how real-world decisions are often made: not all at once, but through a sequence of smaller, simpler choices. Instead of evaluating every possible alternative simultaneously, an incremental decision-making model breaks a complex decision into manageable steps. At each step, only a limited set of alternatives is considered, and the immediate cost of that decision is minimized. By repeating this process, a complex decision can be resolved efficiently and transparently.

A familiar everyday example is deciding whether milk is still safe to use. The final outcome may be to discard the milk or to use it for baking, but reaching that outcome usually involves intermediate checks, such as smell, expiration date, or appearance. Each check narrows down the possibilities and reduces uncertainty. Decision trees formalize this kind of reasoning in a data-driven and replicable way.


From Intuition to Data-Driven Decisions

Rather than relying on intuition, guesswork, or trial and error, a decision tree constructs decision steps analytically. At each step, the algorithm considers one feature at a time and generates alternative splits based on observed data. These alternatives are replicable, meaning that the same data will always lead to the same decision rules.

The objective is not merely to split the data, but to do so in a way that makes the distributions of outcomes as different as possible across the resulting segments. These outcome distributions determine the expected costs or consequences of decisions made within each segment. In this sense, decision trees link data patterns directly to decision quality.


Supervised Recursive Partitioning

Decision trees operate under supervision, meaning that a label variable is available. The algorithm uses this label to guide how the data should be partitioned. The core idea is recursive partitioning: the data are split into segments, and each segment may be split again using the same logic.

To understand this, consider a feature $x$ and a label $y$. When examining the values of $x$ alone, no obvious segmentation may be visible. However, once $y$ is plotted against $x$, distinct patterns may emerge. These patterns suggest that different ranges of $x$ correspond to systematically different values of $y$.


Identifying an Optimal Dividing Line

Suppose a single split is made at $x = x_0$, creating two segments: $S_1 = {x \le x_0}$ and $S_2 = {x > x_0}$. The goal is to choose $x_0$ such that the values of $y$ within each segment are more tightly clustered around their own segment means than around the overall mean of $y$.

Let $\bar{y}_1$ and $\bar{y}2$ denote the mean values of $y$ in $S_1$ and $S_2$, respectively, and let $\bar{y}$ denote the overall mean. The quality of a split is measured by the criterion $C = \sum{y \in S_1}(y – \bar{y}1)^2 + \sum{y \in S_2}(y – \bar{y}_2)^2$.

A split is considered beneficial if this quantity is smaller than the total variation $\sum (y – \bar{y})^2$. In essence, the algorithm seeks the dividing line that minimizes the within-segment sum of squared deviations.


A Straightforward Search Algorithm

To find the optimal split, the algorithm evaluates every possible candidate value of $x$ as a potential dividing line. For each candidate, the data are divided into left and right subsets, and the sum of squared deviations from the segment means is computed. The value of $x$ that yields the smallest value of $C$ is selected as the optimal split.

This exhaustive but simple procedure guarantees that the chosen split is optimal according to the specified criterion.


Recursive Splitting and Stopping Rules

After the first split is made, each resulting segment is examined separately. If a segment still exhibits variability in $y$, the same procedure is applied again within that segment. This recursive process continues until no further reduction in variability is possible.

When the sum of squared deviations within a segment reaches zero, all observations in that segment share the same value of $y$. At that point, further splitting is unnecessary, and the algorithm stops. The final result is a set of disjoint segments, each associated with a constant outcome.


Segments as Rules: Business Logic and Equations

Each segment produced by this process can be expressed either as a set of business rules or as a mathematical equation using indicator functions. For example, a rule-based description might specify outcomes using nested conditions on $x$, while an equivalent mathematical representation assigns values of $y$ using indicator functions $I(\cdot)$, where $I(b)=1$ if condition $b$ is true and $0$ otherwise.

This dual representation highlights an important strength of decision trees: they are both interpretable and formally precise.


Decision Trees as Rule-Based Models

At its core, a decision tree is a collection of rules that divide observations into mutually exclusive segments. Each rule corresponds to a path through the tree, defined by feature-based conditions. Across these segments, the distributions of the target variable are intentionally made as different as possible.

In fields such as medicine and public health, this approach is often referred to as recursive data partitioning. Regardless of the terminology, the underlying idea remains the same: complex decisions are simplified by breaking them into a sequence of data-driven, optimal choices.


Conceptual Summary

Decision trees formalize incremental decision-making by recursively partitioning data based on feature values. Each split is chosen to maximize separation in outcome distributions, thereby reducing uncertainty and decision cost. The resulting model consists of transparent, interpretable rules that can be applied consistently across similar data, making decision trees both analytically rigorous and practically useful.