Purpose of Cluster Profiling

Clustering algorithms such as k-means or hierarchical clustering are effective at discovering latent group structures in data. However, their outputs—cluster IDs—are often difficult to interpret directly. Cluster profiling addresses this limitation by explaining what defines each cluster in terms of the original features.

A common and effective approach to cluster profiling is to train a classification tree that predicts cluster membership. The goal is not to rediscover clusters, but to translate clusters into interpretable decision rules that characterize typical observations within each cluster.


Role of Decision Trees in Cluster Profiling

A decision tree is particularly well suited for cluster profiling because:

  • It produces explicit, rule-based descriptions
  • Each leaf node corresponds to a homogeneous subgroup
  • Rules are easy to interpret and communicate
  • Feature thresholds reveal dominant drivers of cluster separation

In this context, the decision tree is used purely as an interpretability tool, not as a clustering method.


Scikit-Learn Decision Tree Module

Scikit-learn provides decision tree implementations in the tree module:

  • tree.DecisionTreeClassifier() for categorical targets
  • tree.DecisionTreeRegressor() for continuous targets
  • tree.plot_tree() for visualizing the learned tree structure

Important Usage Assumptions and Constraints

The Scikit-learn decision tree implementation has several practical constraints that must be respected:

  1. All input features must be numeric
    • Nominal and ordinal variables encoded as strings are not supported
    • Categorical features must be encoded numerically in advance
  2. No missing values are allowed
    • All observations must be complete across all input features
    • Missing-value handling must be done prior to modeling
  3. Variable names are not automatically retained
    • Feature names must be manually supplied when visualizing the tree
    • Otherwise, generic feature indices are displayed

These constraints influence data preprocessing decisions and must be considered when designing the workflow.


Methodology: Characterizing Clusters with a Classification Tree

The process of cluster profiling using a decision tree follows a clear sequence of steps:

  1. Train a clustering model
    • The clustering algorithm discovers latent structure
    • Cluster IDs are assigned to all observations
  2. Train a classification tree
    • The cluster ID is treated as the nominal target variable
    • The same features used for clustering are used as predictors
    • The objective is high classification accuracy and pure leaf nodes
  3. Extract decision rules
    • Focus on leaf nodes with zero impurity
    • These nodes describe dominant patterns within clusters
  4. Interpret clusters
    • Translate decision rules into real-world descriptions
    • Do not expect the tree to recreate the clustering process

A key principle is that clustering has already completed its task. The decision tree does not compete with the clustering algorithm; it explains its results.


Capital Bike Share Dataset

The dataset used for illustration comes from the Capital Bikeshare program in Washington, DC. It captures bicycle rental activity under varying weather conditions between 2011 and 2012.

Features Used

Three continuous (interval-scale) features are selected:

  • temp: hourly temperature in Celsius
  • humidity: relative humidity in percent
  • windspeed: wind speed in km/h

A total of 10,886 observations are used, all of which are free of missing values across these features.


Determining the Number of Clusters

Both the Elbow Method and the Silhouette Method indicate that a two-cluster solution is optimal. This suggests that the data naturally separates into two dominant weather-related usage patterns.


Two-Cluster Solution: Summary Statistics

The resulting clusters have nearly equal sizes:

  • Cluster 0: 5,466 observations
  • Cluster 1: 5,420 observations

Feature-wise comparison shows:

  • Mean temperature is nearly identical across clusters
  • Cluster 0 has higher humidity and lower wind speed
  • Cluster 1 has lower humidity and higher wind speed

Temperature does not appear to be a primary discriminating factor between clusters.


Classification Tree Specification

To profile the clusters:

  • Target variable: Cluster ID (nominal)
  • Predictors: temperature, humidity, wind speed
  • Maximum depth: 4 (corresponding to 3 decision levels)
  • Splitting criterion: Entropy

The limited depth enforces interpretability while allowing sufficient flexibility.


Classification Performance

The classification tree correctly predicts 99.63% of cluster memberships. This indicates that:

  • Cluster boundaries are well defined
  • Weather variables strongly explain cluster membership
  • Clusters are internally consistent

Importantly, the decision tree does not use temperature at any split, reinforcing the earlier observation that temperature plays a minimal role in distinguishing clusters.


Interpreting Node-Level Decision Rules

Each leaf node corresponds to a rule composed of inequalities involving humidity and wind speed. Leaf nodes with zero entropy represent perfectly pure segments, meaning all observations in that node belong to the same cluster.

Dominant Rules for Cluster 0

  • $humidity > 63.5$ and $windspeed \le 27.001$
    Covers 93% of Cluster 0 observations

Additional smaller segments include:

  • $61.5 < humidity \le 63.5$ and $windspeed \le 14$
  • $humidity > 66.5$ and $windspeed > 27.001$

These rules indicate humid conditions with moderate winds.


Dominant Rules for Cluster 1

  • $humidity \le 59.5$
    Covers 93% of Cluster 1 observations

Other minor segments involve:

  • $59.5 < humidity \le 61.5$ and $windspeed > 8$
  • $61.5 < humidity \le 62.5$ and $windspeed > 14$

These rules indicate dry conditions, often accompanied by stronger winds.


Final Cluster Interpretation

Cluster 0: Tourist Season Conditions

  • Represents warm, slightly humid days
  • Typically includes a gentle breeze
  • Weather is consistent with summer tourism activity
  • High humidity is the dominant defining feature

Cluster 1: Typical School and Working Days

  • Represents dry, comfortable weather
  • Common during spring and autumn
  • Lower humidity is the defining characteristic
  • Weather conditions are conducive to routine commuting

Key Conceptual Takeaways

  • Decision trees provide transparent explanations of clusters
  • High classification accuracy validates cluster separability
  • Pure leaf nodes reveal core cluster characteristics
  • Not all original features are necessarily informative
  • Cluster profiling complements clustering rather than replacing it

By combining unsupervised clustering with supervised decision trees, cluster profiling bridges the gap between statistical discovery and human interpretability, transforming abstract cluster labels into actionable and understandable insights.