5.2 Statistical Inference:
Readings from: An Introduction to Statistical Learning with applications in R, https://www.statlearning.com/
5.2.8 Tree-Based Methods
This chapter introduces tree-based methods for regression and classification, which work by straifying or segmenting (i.e. dividing) the predictor space into simple regions using splitting rules that form a decision tree. These methods are easy to interpret, but usually less accurate than other supervised learning techniques. To improve accuracy, thiz chapter also covers advanced methods like bagging, random forests, boosting, and Bayesian additive regression trees, which combine many trees to enhance predictions, though this comes with some loss of interpretability.
5.2.8.1 The Basics of Decision Trees
Decision Trees are applicabile to both Regression and Classification problems.
5.2.8.1.1 Regression Trees
Decision trees are a class of predictive modeling algorithms that partitions the predictor space (the set of all possible values of the input variables) into distinct and non-overlapping regions, using a sequence of decision rules based on the values of the input variables.
Each rule is based on the value of one input variable, and each split divides the data into two branches according to whether the rule is true or false.
In regression, the goal is to predict a numerical (continuous) response variable. The decision tree algorithm recursively splits the predictor space into regions by finding the value and variable that best separates the data to minimize prediction error. At each split (internal node), the data is divided based on whether the value of a particular predictor variable is less than or greater than/equal to a threshold. We can also use decision trees for predition.
5.2.8.1.2 Prediction via Strartification of the Feature Space
To make a prediction for a new observation (player), you start at the top of the tree and follow the path dictated by the player’s values for each predictor. Each internal node represents a decision rule, and each leaf represents the predicted value for the response variable, which is the average of the training responses in that region.
Decision trees segment the predictor space using sequential binary splits based on variable thresholds. At each internal node, the left branch corresponds to observations where the splitting condition is true and the right branch corresponds to the opposite. Each terminal node (leaf) holds the mean response for all training points that fall in that region.
Data Preprocessing
Missing Data: Before fitting the model, any missing information are removed from the dataset. This is necessary because regression methods require a complete set of predictor and response values.
Log Transformation: The variable is log-transformed (i.e., replaced by its natural logarithm) so that its distribution is closer to normal (bell-shaped). This is a common step in regression analysis when the response variable is skewed or has outliers, because it can make modeling and interpretation more stable and meaningful.
Prediction via Stratification of the Feature Space
Builing a regression tree for prediction occurs in a two-step process.
Step 1: Dividing the Predictor Space
The goal is to divide the predictor space—meaning the set of all possible combinations of the input variables \(X_1, X_2, \ldots, X_p\)—into \(J\) distinct, non-overlapping regions \(R_1, R_2, \ldots, R_J\).
Each region is like a box in the multi-dimensional space defined by the features.
Step 2: Making Predictions
For each region \(R_j\), all observations that fall inside it (i.e., all data points whose feature values meet the region’s conditions) are assigned the same prediction.
This prediction is the average (mean) of the response variable (\(y\)) for all training observations in \(R_j\).
If a new observation falls into region \(R_2\), its predicted value is just the mean of the response values for all training data points in \(R_2\).
How are regions constructed?
In theory, the regions could be of any shape or size; they don’t have to be boxes or rectangles. In practice, we restrict ourselves to “boxes” (high-dimensional rectangles) because they are easier to interpret and computationally feasible. Ultimately, the goal is to minimize the Residual Sum of Squares (RSS).
- The ideal partition is the one that minimizes the total sum of squared differences between the observed responses and the mean response in each region:
\[ \sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2 \]
- \(y_i\) is the actual value for observation \(i\)
- \(\hat{y}_{R_j}\) is the average response for region \(R_j\)
This is called the residual sum of squares (RSS) and is a measure of how well the regions fit the data.
Why Not Try All Possibilities?
Trying every possible way to divide the predictor space into \(J\) regions would require checking an enormous number of combinations, which is computationally impossible for all but the very smallest datasets.
The Greedy, Top-Down Approach (aka Recursive Binary Splitting)
To approach via “Top-Down”, we start with the entire dataset considered as one region. At each step, we look for the single best way to split one region into two new regions, according to a simple decision rule.
To approach “Greedy” means that at each step, we make the locally optimal choice: we pick the split that gives the greatest immediate reduction in RSS, without considering how this split will affect future splits. This means we do not look ahead or try to optimize the tree globally; we only optimize at the current step.
Recursive Binary Splitting Step-by-Step Process
For each feature \(X_j\) and for each possible value \(s\) that feature can take:
- Divide the observations into two regions:
\[ R_1(j, s) = \{X \mid X_j < s\} \]
\[ R_2(j, s) = \{X \mid X_j \geq s\} \]
- For each region, calculate the mean of the response for training observations inside that region.
- Compute the sum of squared differences from the mean (RSS) for both regions.
- Find the feature \(j\) and split point \(s\) that result in the smallest total RSS.
- This is the greedy and considered the best split.
- Once the split is made on the data into two regions, repeat this recursively:
- For each existing region, find the best new split (best feature and best cutpoint) that further reduces the RSS within that region.
- At each stage, only one region is split at a time (not all at once). This recursive splitting continues, region by region, always choosing the locally best split at each step. The stopping criterion to stop occurs when a region reaches some minimal size (e.g., no region has fewer than five observations), or another condition is met (such as a minimum reduction in RSS).
- Once the regions \(R_1, R_2, \ldots, R_J\) have been created, the tree is complete. To make a prediction for a new observation, find which region it falls into (by following the tree’s decision rules) and assign it the mean response value of that region, computed from the training data.
Limitations to Regression Decision Tree
The tree presented is likely an over-simplification of the real, potentially complex, relationship among predictors. It does not capture interactions or nuances beyond the splits used.
Advantages to Regression Decision Tree
- They are easy to interpret from the path from the root to any leaf and can be described in plain language. Graphical representation makes it clear how predictions are made and which features matter most in different regions.
- Compared to other regression models (such as linear regression, polynomial regression, or more flexible models) may fit the data more closely, but are often harder to interpret than a simple tree.
5.2.8.1.2.1 Pruning
The process described above may produce good predictions, but is likely to overfit the data. This is where we can introduce tree pruning to balance the fit and complexity.
The overfitting occurs because as we continue to split on the training data, the tree will typically make very accurate predictions on the training set (“training error” is low), but it will likely perform poorly on new, unseen data (high “test error”) because it has captured too much of the idiosyncratic noise in the training data—a phenomenon known as overfitting.
There is a trade-off between complexity and generalization. A smaller tree (with fewer splits and larger regions) generally has:
- Lower variance: Its predictions are less sensitive to small changes in the training data, so its test error is often lower.
- Greater interpretability: It’s easier to visualize and understand.
- Slightly higher bias: It may be less accurate on the training data because it doesn’t fit all the fine details.
Therefore, the goal is to find a tree size (number of splits/leaves) that best balances good generalization (low test error) with simplicity and interpretability.
Naive Stopping Strategy
One naive strategy is to stop growing the tree once the reduction in RSS at each split falls below a threshold (i.e., only split if the improvement is “large enough”). This can prevent overfitting by not making splits that don’t help much. The limitation from this approach can be shortsighted. Sometimes, an early split may not seem helpful (little reduction in RSS), but necessary to enable a later, very helpful split. If we don’t allow the early split, we may miss important structure in the data.
Grow Large, Then Prune Strategy
The preferred approach is to grow a large tree first, allowing many splits, even those that seem only marginally useful. Once the large tree (\(T_0\)) is built, we then prune it back, removing splits and branches to produce a series of smaller subtrees. Each subtree is defined by a subset of the splits of the original large tree. The goal is to find the subtree that gives the best performance on new data, i.e., the lowest test error rate.
The subtree selection, aka, the pruning process starts with an estimate of how well it will predict on new, unseen data. This is done by using methods such as cross-validation or a validation set to measure the prediction error on data not used to build the tree. There are challengees when the tree is enormous, as there are an enormous number of possible subtrees, so it’s not feasible to check them all.
To efficiently search for good subtrees, we use a procedure called cost complexity pruning (also known as weakest link pruning). Rather than examining every possible subtree, we consider a sequence of subtrees defined by a tuning parameter \(\alpha\).
For each value of \(\alpha\), we define a measure:
\[ \sum_{m=1}^{|T|} \sum_{i: x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T| \]
- The first term is the total RSS (fit to the training data).
- The second term penalizes large trees (\(|T|\) is the number of terminal nodes in the tree), with the strength of the penalty determined by \(\alpha\).
- When \(\alpha = 0\), we get the fully grown tree (no penalty for complexity).
- As \(\alpha\) increases, the penalty for extra leaves grows, so optimal trees become smaller and simpler.
As you increase \(\alpha\) from zero, branches are pruned from the fully grown tree in a nested, predictable fashion, so that for each \(\alpha\), there’s a single best subtree. The process yields a sequence of trees, ranging from the largest (unpruned) tree to the smallest (just a root node).
To pick the optimal value of \(\alpha\), and hence the best subtree, we use K-fold cross-validation:
- Split the data into \(K\) roughly equal parts (“folds”).
- For each fold:
- Grow a large tree and prune it using the remaining \(K-1\) folds.
- Use the left-out fold to estimate the prediction error for each candidate subtree.
- Grow a large tree and prune it using the remaining \(K-1\) folds.
- Repeat for all folds and average the error for each subtree (or each value of \(\alpha\)).
- Choose the value of \(\alpha\) that minimizes the average cross-validated prediction error.
- Return the subtree corresponding to this optimal \(\alpha\).
Other considerations:
- The penalty parameter \(\alpha\) controls the trade-off between goodness of fit (low RSS) and simplicity (small number of leaves).
- This is analogous to regularization in linear models (like Lasso), where a penalty term discourages overfitting by shrinking coefficients or reducing the number of predictors.
- As \(\alpha\) increases, the selected tree becomes smaller, prioritizing generalization over fitting every detail of the training data. The sequence of pruned subtrees is nested. As you prune, each smaller tree is a subset of the previous larger tree.
5.2.8.1.3 Classification Trees
A classification tree is a type of decision tree used specifically for predicting outcomes that are qualitative or categorical—meaning the responses fall into distinct classes or groups (for example, “spam” vs “not spam,” “disease” vs “no disease,” etc.). This is in contrast to a regression tree, which is used when the predicted outcome is quantitative or numerical (for example, predicting a person’s income or the price of a house).
Key differences:
In a regression tree, for any input, the tree outputs the mean value of the training observations in the terminal node (final leaf) where the input belongs. In a classification tree, the output is the most frequently occurring class among the training observations in the terminal node where the input falls.
General Example: Suppose you want to predict whether a customer will buy a product (“Yes” or “No”) based on their age and income. A classification tree will assign each customer to the most common class (“Yes” or “No”) among similar customers (those in the same terminal node), while a regression tree would predict an average value (like expected spending).
Interpreting Classification Trees When using classification trees, two things are of interest:
Class Prediction: For each region (i.e., each terminal node of the tree), the tree predicts the most common class among the training data in that region. Class Proportions: Besides just the predicted class, we often also care about the proportion of each class in a region. For example, if a node has 90% “Yes” and 10% “No,” we’re more confident in predicting “Yes” than if the split was 51% to 49%.
Growing the Tree: Binary Splitting Both regression and classification trees are built using a process called recursive binary splitting:
At each step, the data is split into two groups (“binary”), based on some criterion, to maximize the “purity” of the resulting groups. This continues until a stopping criterion is reached (like each node has a minimum number of observations, or a maximum depth).
Which Criterion to Use?
Regression Trees: Commonly use Residual Sum of Squares (RSS) to decide where to split. Classification Trees: RSS is not suitable, so other metrics are used.
Classification Error Rate The classification error rate is a simple measure of how often a randomly chosen observation in a region does not belong to the most common class in that region.
\[ E = 1 - \max_k(\hat{p}_{mk}), \] where:
- \(\hat{p}_{mk}\) is the proportion of training observations in node \(m\) belonging to class \(k\).
- \(\max_k(\hat{p}_{mk})\) gives the highest class proportion (i.e., the most common class in the node).
Generalized Example:
If a node has 80% “A” and 20% “B,” then \(E = 1 - 0.8 = 0.2\). This means 20% of the training data in that region would be misclassified if we always predict the most common class.
Node Purity and Alternative Metrics Classification error rate, while intuitive, is not very sensitive for splitting decisions. Two main alternatives are used in practice because they are more sensitive to how mixed or pure a node is: 1. Gini Index
\[ G = \sum_{k=1}^K \hat{p}_{mk}(1 - \hat{p}_{mk}) \]
- \(K\) is the number of possible classes. Measures the “total variance” among the class proportions. Node purity: If a node contains only one class (\(\hat{p}_{mk} = 1\) for some \(k\)), the Gini index is 0 (maximum purity).
If classes are mixed (\(\hat{p}_{mk}\) are closer to each other), the index is higher (less pure). Used to evaluate the quality of a split—lower is better.
Generalized Example:
For a node with 50% “A” and 50% “B” \(\hat{p}_{mA} = 0.5\), \(\hat{p}_{mB} = 0.5\) \[ G = 0.5(1-0.5) + 0.5(1-0.5) = 0.25 + 0.25 = 0.5 \]
For a pure node (100% “A”) \[ G = 1(1-1) + 0(1-0) = 0 + 0 = 0 \]
- Entropy (Cross-Entropy or Information Gain)
\[ D = -\sum_{k=1}^{K} \hat{p}_{mk} \log \hat{p}_{mk} \]
- Measures disorder or uncertainty. Entropy is 0 if a node is pure (one class only).
- Like the Gini index, lower entropy means higher purity.
- Used extensively, especially in algorithms like ID3 and C4.5.
Generalized Example:
For a node with all observations in one class: \[ D = -1 \log 1 = 0 \]
For a node with equal proportions \(\hat{p}_{mA} = 0.5, \hat{p}_{mB} = 0.5\):
\[ D = -0.5 \log 0.5 - 0.5 \log 0.5 = -0.5(-0.693) - 0.5(-0.693) = 0.693 \]
Why Prefer Gini or Entropy Over Error Rate? The Gini index and entropy are more sensitive to changes in the node’s composition than the classification error rate. This sensitivity helps the tree algorithm to make better splits by favoring splits that increase the purity of the resulting nodes, even if the most common class does not change.
Importance of Node Purity Node purity is crucial because:
Higher purity nodes mean more confident predictions. Even if both child nodes after a split predict “Yes,” one might be 100% “Yes,” while the other is only 70% “Yes.” Predictions from the pure node are more reliable.
Summary of Key Points (without summarizing the text):
Classification trees predict qualitative outcomes by assigning new observations to the most common class in their region. The tree is grown by recursive binary splitting, guided by measures of node impurity (Gini index, entropy), not just classification error rate. Gini index and entropy are sensitive to how “mixed” or “pure” a node is, making them preferable for tree growth. Trees can handle both continuous and categorical predictors. Splits that increase node purity are valuable, even if they do not change the most common class, because they yield more confident predictions.
5.2.8.1.4 Trees Versus Linear Models
Model Assumptions
Linear Regression Model:
Linear regression is a classical statistical method that assumes the relationship between the input variables (features) and the output (response) can be modeled as a linear combination of the features. The general mathematical form is:
\[
f(X) = \beta_0 + \sum_{j=1}^{p} X_j \beta_j
\]
- \(X_j\): The value of the \(j\)-th feature (input variable) for a given observation.
- \(\beta_j\): The coefficient representing the effect of the \(j\)-th feature on the response.
- \(\beta_0\): The intercept term.
- \(p\): The number of features.
This model assumes that as you change \(X_j\) by one unit, the output \(f(X)\) changes by exactly \(\beta_j\) units, holding all other features constant. The relationship is strictly additive and linear.
Regression Tree Model: Regression trees (and, by extension, classification trees) make no assumption of linearity. Instead, they partition the feature space into distinct regions and assign a constant prediction to each region. The model can be written as:
\[ f(X) = \sum_{m=1}^{M} c_m \cdot 1_{(X \in R_m)} \]
- \(R_m\): The m-th region, which is a subset of the feature space as defined by the splits in the tree.
- \(c_m\): The predicted value for region \(R_m\) (e.g., the mean response for regression, or the most common class for classification).
- \(1_{(X \in R_m)}\): An indicator function that is 1 if \(X\) is in region \(R_m\), and 0 otherwise.
- \(M\): The number of terminal nodes (leaves) in the tree.
The tree divides the feature space into non-overlapping regions, and any observation falling into a given region gets the same prediction, regardless of the exact values of its features (as long as they fall within that region).
Which Model is Better? The text emphasizes that the choice between linear models and tree-based models depends on the nature of the relationship between features and response:
When Linear Models Work Best: If the relationship is truly linear or well-approximated by a linear function (i.e., the effect of each feature on the response is additive and constant), then linear regression is preferable. It leverages the linear structure, yielding more accurate and efficient predictions.
When Trees Work Best: If the relationship is complex or highly non-linear—meaning the effect of features on response changes depending on the values of other features, or involves abrupt jumps or different regimes—then trees can be superior, since they are able to model such complexity by dividing the feature space into arbitrary regions.
Partitioning Feature Space Trees work by dividing the input (feature) space into rectangular regions, each associated with a constant prediction. Each split is defined by a rule on a single feature—e.g., “if \(X_1 < 0\), go left; else, go right.” This axis-aligned splitting leads to rectangular (or box-shaped) regions in two or more dimensions.
Assessing Model Performance The text suggests that model performance should be evaluated empirically, using metrics such as test error, which estimates how well the model will generalize to new, unseen data. Cross-validation and validation set approaches are standard ways to estimate this error.
Beyond Test Error: Interpretability and Visualization Besides predictive performance, other factors can influence the choice between trees and linear models:
Interpretability: Trees often provide more interpretable models, especially for non-technical audiences. The sequence of splits can be visualized as a flow chart or tree diagram, making it easy to explain decision logic. Visualization: Trees naturally produce visualizations that show how the feature space is partitioned, which is helpful for understanding and communicating the model.
Linear models assume and exploit a linear relationship, resulting in additive effects and straight-line decision boundaries. Trees partition the feature space into regions, allowing for complex, nonlinear, and interaction effects, but with step-like boundaries. The best model depends on the true nature of the data: linear models for linear relationships; trees for nonlinear or interaction-heavy relationships. Model choice should be guided by performance (test error), but also by interpretability and visualization needs.
5.2.8.1.5 Advantages and Disadvantages of Trees
- Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression! Deep Explanation: Decision trees work by making a series of simple decisions, each based on the value of a single feature. At each step, you answer a yes/no or categorical question about the data (for example: “Is income greater than $50,000?” or “Is the color red?”), and move down the tree accordingly until you reach a final prediction at a terminal node (leaf).
This process mimics the way people often make decisions in everyday life—by following a sequence of conditional rules. Each split in the tree is easy to understand on its own: you don’t need to understand complex mathematics or interpret coefficients. In contrast, linear regression requires understanding how each feature contributes to the outcome through its coefficient, which can be less intuitive, especially when there are many variables or when the variables are correlated.
- Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters. Deep Explanation: Humans often naturally make decisions by breaking down complex choices into a sequence of simpler ones, focusing on one factor at a time. Decision trees replicate this process:
Each split considers a single factor and its value, just as a human might evaluate one aspect of a situation before moving to the next. The tree’s structure—starting from a root and splitting into branches—resembles the way people follow a set of “if-then-else” rules. This resemblance to human reasoning makes trees feel more “natural” and relatable.
- Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small). Deep Explanation: Decision trees can be visualized as branching diagrams, where each internal node represents a split, each branch represents a possible outcome of the split, and each leaf node represents a prediction.
This visual format is accessible and can be understood without specialized training. Small trees especially can be drawn on paper or a whiteboard, making them practical for presentations or team discussions. Visualizing the flow from root to leaf helps non-experts trace the logic leading to any prediction.
- Trees can easily handle qualitative predictors without the need to create dummy variables. Deep Explanation: Many real-world datasets contain categorical (qualitative) variables, such as color, type, or category. Classical methods like linear regression require these to be converted into numeric forms—typically by creating “dummy” or “indicator” variables (one for each category level).
Decision trees, by contrast, can split directly on categorical variables, assigning different branches to different categories. There is no need for preprocessing or transformation; the tree inherently deals with both numeric and categorical predictors.
Disadvantages of Decision Trees 1. Trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book. Deep Explanation: While trees are intuitive and easy to use, they often do not perform as well as more sophisticated models in terms of prediction accuracy. Reasons include:
Trees make sharp splits in the data, which can lead to overfitting—fitting noise rather than signal, especially with small or noisy datasets. Single trees may not capture subtle relationships between variables that more flexible or ensemble methods (like random forests, boosting, or even well-tuned linear models) can exploit. Other methods can smooth over the data, reducing variance and capturing more complex relationships, leading to better predictions on new data.
- Trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree. Deep Explanation: Trees are sensitive to the specific data used to build them:
If the training data changes slightly—such as by adding or removing a few observations, or if a single value is changed—the structure of the resulting tree can change drastically. This is called instability or lack of robustness. The splits chosen at each step depend heavily on the data available, and a small change early in the tree can propagate through all subsequent splits, leading to an entirely different tree. This instability can reduce confidence in the model, especially when explanations need to be consistent across different samples.
Aggregating Trees: Overcoming Limitations Deep Explanation: To address the weaknesses of single decision trees, ensemble methods aggregate the predictions of many trees to improve performance and stability:
Bagging (Bootstrap Aggregating): Multiple trees are built on different random samples of the training data; their predictions are averaged (for regression) or voted on (for classification). Random Forests: An extension of bagging that also randomizes which features are considered at each split, further reducing correlation among trees. Boosting: Trees are built sequentially, each focusing on correcting errors made by previous ones. Their predictions are combined for better accuracy.
These ensemble methods result in models that are more accurate and robust than a single tree, often rivaling or exceeding the performance of other advanced algorithms.
5.2.8.2 Bagging, Random Forest, Boosting
This section introduces ensemble methods, which combine many simple models (called weak learners, such as regression or classification trees) to create a more powerful predictive model. The section will cover specific ensemble methods: bagging, random forests, boosting, and Bayesian additive regression trees.
5.2.8.2.1 Bagging
The Bootstrap and Its Role in Bagging Bootstrap Overview: The bootstrap is a statistical technique that involves repeatedly sampling (with replacement) from a dataset to create multiple new samples, known as “bootstrap samples.” Originally, the bootstrap was developed to estimate the variability (such as standard deviation or confidence intervals) of a statistic when the underlying distribution is unknown or when direct calculation is too difficult.
High Variance in Decision Trees Variance in Statistical Models: Variance refers to how much a model’s predictions would change if it were trained on a different dataset from the same process. High variance means small changes in the data can lead to large changes in the model.
Decision Trees: Decision trees are especially prone to high variance. If you randomly split your data in half, fit a tree to each half, the resulting trees could be very different—choosing different variables and splits, and producing different predictions.
Contrast With Low Variance Methods: Methods like linear regression (when the sample size n is much larger than the number of predictors p) tend to have low variance, so results are stable across different samples of the data.
The Principle of Bagging (Bootstrap Aggregation) Bagging Purpose: Bagging is a general approach to reducing the variance of a model. The idea is to take multiple models, each trained on a different sample of the data, and average their predictions.
Mathematical Foundation: Suppose you have \(n\) independent measurements \(Z_1, Z_2, \ldots, Z_n\) each with variance \(\sigma^2\). The variance of their mean, \(\bar{Z}\), is \(\sigma^2 / n\). Thus, averaging reduces variance.
Extension to Models: If you could train \(B\) separate models, each on a different independent dataset, and average their predictions, you would get a result with much lower variance than any individual model.
\[ \hat{f}_{\mathrm{avg}}(x) = \frac{1}{B} \sum_{b=1}^B \hat{f}^b(x) \]
where \(\hat{f}^b(x)\) is the prediction from the \(b\)-th model.
Practical Bagging With Bootstrapping Limitation: In reality, we don’t have multiple independent datasets; we typically have only one. Solution: We use the bootstrap:
From the original dataset, generate B bootstrap samples (random samples with replacement). Train a separate model (e.g., a decision tree) on each bootstrap sample. For prediction, average the predictions of all models (for regression), or use majority vote (for classification).
\[ \hat{f}_{\mathrm{bag}}(x) = \frac{1}{B} \sum_{b=1}^B \hat{f}^{*b}(x) \]
where \(\hat{f}^{*b}(x)\) is the prediction from the model trained on the \(b\)-th bootstrap sample.
Bagging for Decision Trees Why It’s Useful: Decision trees are particularly suited for bagging because:
Each tree, grown deep and unpruned, has high variance (unstable, sensitive to data). Averaging many such trees dramatically reduces variance, leading to more reliable predictions. Each tree individually may overfit, but their average prediction is much more stable.
Practical Implementation:
Construct \(B\) deep, unpruned regression trees, each on a different bootstrap sample. For regression: average the predictions of all \(B\) trees. For classification: for each test observation, record the predicted class from each tree, and use the most common class (majority vote).
Performance:
Bagging has been shown to substantially improve prediction accuracy, sometimes using hundreds or thousands of trees to produce a single robust model. The number \(B\) (number of trees) is not very sensitive; typically, using a sufficiently large \(B\) (such as \(B = 100\)) is enough, and large values do not cause overfitting.
Summary of Bagging’s Deep Principles
Bagging is about reducing the instability (variance) of a model by averaging many versions of it, each trained on a slightly different version of the data (via bootstrapping). It is especially powerful for models like decision trees that are accurate but unstable. Bagging leads to more reliable, stable, and generally more accurate predictions, without increasing the risk of overfitting as the number of trees grows.
5.2.8.2.1.1 Out of the bag estimation
Out-of-Bag (OOB) Error Estimation in Bagging The Challenge of Test Error Estimation In predictive modeling, an important goal is to estimate how well a model will perform on new, unseen data. Traditionally, this is done using methods like:
Cross-validation: Partitioning the data into several subsets, repeatedly training the model on some subsets and testing it on the others. Validation set approach: Setting aside a portion of the data only for testing.
However, these methods can be computationally expensive, especially when the dataset is large or when model training is slow.
The Structure of Bagging Recall that bagging (bootstrap aggregation) repeatedly fits models (e.g., decision trees) to bootstrap samples of the data:
Bootstrap sample: A sample drawn with replacement from the original dataset, typically the same size as the original data. Key detail: Because sampling is with replacement, about two-thirds of the original data points are included in each bootstrap sample, while about one-third are excluded.
Mathematical Expectation:
For each tree, the probability that a given observation is not selected in a bootstrap sample is \((1 - 1/n)^n\), which tends to \(e^{-1} \approx 0.37\) as n becomes large. So, approximately 37% (about one-third) of the observations are not used to train each tree.
The Out-of-Bag (OOB) Set
Definition: For each bootstrap sample used to train a tree, the observations not included in that sample are called the “out-of-bag” (OOB) observations for that tree.
Implication: Each observation in the dataset will be OOB for about one-third of the trees in a bagged ensemble of B trees. For example, if you build 150 bagged trees, each observation will be left out of around 50 trees.
Constructing OOB Predictions
For each observation i, consider all trees for which observation i was OOB (i.e., not used in training that tree). For each of these trees, make a prediction for observation i. Aggregate these predictions:
Regression problem: Average the predictions from all the OOB trees for observation i. Classification problem: Take a majority vote among the OOB tree predictions for observation i.
This results in a single OOB prediction for each observation.
Calculating the OOB Error
Once you have an OOB prediction for every observation in the dataset, you can compare these predictions to the true values. Calculate the overall error:
For regression: Compute the mean squared error (MSE) between actual values and OOB predictions. For classification: Compute the classification error rate (fraction of misclassified observations) using the OOB predictions.
This overall error is called the OOB error.
Why OOB Error Is a Valid Test Error Estimate
Independence: For each observation, its OOB prediction is made using only trees that did not see that observation during training. This is conceptually similar to cross-validation, where you test on data not used for training. Efficiency: OOB error estimation does not require explicit data splitting or repeated model retraining—it is a natural byproduct of the bagging process. Equivalence: With a large enough number of trees (B), the OOB error closely approximates the error you would get from leave-one-out cross-validation, where each observation is left out once.
Practical Advantages
No Need for Separate Validation Set: OOB error uses all available data for both training and error estimation, maximizing data efficiency. Computationally Efficient: Especially valuable when working with large datasets, where traditional cross-validation would require retraining the model many times, greatly increasing computational cost. Continuous Monitoring: As you build more trees, you can monitor the OOB error to determine when adding more trees no longer improves performance.
Summary of Deep Principles
OOB error estimation is a built-in, efficient, and unbiased method for assessing the predictive performance of bagged ensemble models. It leverages the structure of the bagging process, requiring no extra data or computational effort compared to traditional cross-validation. OOB error is especially useful for large datasets and complex models, where computational efficiency and data maximization are critical.
5.2.8.2.1.2 Variable importance Measures
Single Decision Tree Interpretability One of the greatest strengths of a single decision tree is interpretability:
The tree can be visualized as a flowchart or diagram, where the path from the root to a leaf represents a sequence of splits on different variables. By examining the tree, you can easily see which variables are used for splits, at what thresholds, and in what order, giving a clear, visual sense of how predictions are made and which variables are most influential. For example, in a medical diagnosis tree, you could see that “Age > 60” or “Blood Pressure > 140” are key decision points.
Loss of Interpretability in Bagging When you use bagging (bootstrap aggregation) to improve predictive accuracy:
Instead of a single tree, you have a large collection of trees (hundreds or thousands). Each tree is built on a different bootstrap sample of the data, leading to different splits and variable usage across the ensemble. As a result, you can no longer visualize the model as a single tree or easily identify which variables are most important just by looking at a diagram. The price you pay for improved accuracy and stability is reduced interpretability: the model becomes a “black box,” and you lose the straightforward understanding of variable roles that a single tree provides.
Quantifying Variable Importance in Bagged Trees Despite the complexity of the ensemble, it is still possible to summarize the overall importance of each predictor variable across all the trees. This is done using statistical measures that aggregate information from all trees:
For Regression Trees: Residual Sum of Squares (RSS)
RSS Definition: In regression, RSS measures the total squared difference between the observed values and the predicted values. Variable Importance Calculation: For each tree, every time a split is made on a predictor variable, that split reduces the RSS for the training data in that node. The decrease in RSS for that split is credited to the variable used for the split. Aggregation: For each variable, sum all RSS decreases attributed to that variable across all splits and all trees, then average over the number of trees For Regression Trees: Residual Sum of Squares (RSS)
RSS Definition: In regression, RSS measures the total squared difference between the observed values and the predicted values. Variable Importance Calculation: For each tree, every time a split is made on a predictor variable, that split reduces the RSS for the training data in that node. The decrease in RSS for that split is credited to the variable used for the split. Aggregation: For each variable, sum all RSS decreases attributed to that variable across all splits and all trees, then average over the number of trees . Interpretation: A large total decrease in RSS for a variable means that splits on this variable frequently and substantially help to explain (predict) the response variable, indicating high importance.
For Classification Trees: Gini Index
Gini Index Definition: The Gini index is a measure of node impurity in classification trees. A lower Gini index means the node is more pure (contains mostly one class). Variable Importance Calculation: For each split in each tree, when a variable is used to split a node, the Gini index for that node decreases (the node becomes purer). The decrease in Gini index is attributed to the variable used. Aggregation: Sum up the total decrease in Gini index for each variable across all splits and all trees, then average over all \(B\) trees. Interpretation: A variable that consistently reduces the Gini index by a large amount is important for classifying the data correctly.
Visualizing Variable Importance
After computing the variable importance scores, you can create a plot (such as a bar chart) showing the relative importance of each variable. The height of each bar represents the mean decrease in RSS (for regression) or Gini index (for classification) attributed to that variable, relative to the largest value. Variables with the largest mean decreases are the most important in the ensemble model.
5.2.8.2.2 Random Forest
The Motivation for Random Forests Bagging (bootstrap aggregation) improves the prediction accuracy of unstable models like decision trees by averaging the predictions of many trees, each built on a different bootstrap sample. However, bagged trees—while reducing variance—can still be highly correlated with each other if some predictors are very strong.
Correlation Among Trees: If one feature in your dataset is extremely predictive, almost every tree in the bagged ensemble will split first (or high up) on that feature. This causes most of the trees to look similar and make similar mistakes or successes, thus their errors are correlated. Effect on Variance Reduction: Averaging reduces variance most effectively when the individual elements being averaged are uncorrelated. Averaging highly correlated predictions does not yield as much variance reduction as averaging uncorrelated ones.
The Key Innovation: Decorrelating the Trees Random forests introduce a simple but powerful tweak to reduce the correlation among trees:
At each split in each tree, do not consider all predictors as candidates for the split. Instead, randomly select a subset of \(m\) predictors (out of the full set of \(p\) predictors), and only allow the split to be made on one of those \(m\) predictors. This process is repeated independently at every split in every tree, and a new random subset of \(m\) predictors is chosen for each split.
Typical Choice for \(p\):
For classification problems, it is common to use \(m \approx \sqrt{p}\), where \(p\) is the total number of predictors. For regression, often \(m \approx p/3\).
Why Randomly Limit Split Candidates? This restriction may seem counterintuitive—why not use the best variable at every split? But it has a clear and important rationale:
Suppose there is one very strong predictor and several moderately strong ones. In bagging, almost every tree will select the strongest predictor for its top split, causing the trees to be similar and their predictions to be correlated. In random forests, because only a random subset of predictors is considered at each split, the strong predictor is often not even among the candidates. This gives other predictors a chance to be used for splits, leading to more diverse trees.
The Effect: More Diverse, Less Correlated Trees
Increased Diversity: By forcing trees to consider different subsets of predictors, random forests ensure that the trees in the ensemble are more varied in their structure, splits, and predictions. Decreased Correlation: Because the trees are less likely to always split on the same dominant variables, their errors are less likely to be correlated. Greater Variance Reduction: Averaging these less correlated predictions results in a greater reduction in variance of the ensemble’s prediction compared to bagging, improving predictive performance and reliability.
Mathematical/Algorithmic Process
- Bootstrap Sampling:
As in bagging, generate \(B\) bootstrap samples from the data. - Tree Construction:
- For each bootstrap sample, grow an unpruned decision tree.
- At each node, randomly select m predictors (from the total \(p\) predictors).
- Find the best split among only these \(m\) predictors.
- Repeat until the tree is fully grown (no pruning).
Prediction Aggregation:
- For regression: average the predictions of all trees.
- For classification: use majority vote among the trees.
Insights and Implications
Random forests generalize bagging: If you set m=p, random forests reduce to bagging. Choosing m is key: Smaller m increases tree diversity and reduces prediction correlation, but if m is too small, important predictors may be ignored too often, reducing predictive power. No Need to Worry About Overfitting by Increasing B: Like bagging, random forests do not overfit as more trees are added; use as many trees as needed until the error rate stabilizes. Powerful for Many Predictors and Correlated Data: Random forests are especially effective in settings with many predictors and strong correlations among them, such as biological data with thousands of genes, financial prediction with many indicators, or image classification with many pixels. Summary of Deep Principles
Random forests improve upon bagging by decorrelating trees through random predictor selection at each split. This decorrelation increases the effectiveness of variance reduction through averaging, leading to more reliable, accurate models, especially when predictors are correlated. Random forests are robust, perform well on complex data, and can handle large numbers of predictors without overfitting as the number of trees grows.
5.2.8.2.3 Boosting
What is BART (Bayesian Additive Regression Trees)? BART is an ensemble method that uses decision trees as its basic building blocks to model complex relationships in data, especially for regression problems. While related to bagging, random forests, and boosting, BART introduces a unique Bayesian perspective and a different way to generate and update trees.
How Do Bagging, Random Forests, and Boosting Relate to BART?
Bagging and Random Forests: These methods create an ensemble by averaging the predictions of many individual regression trees. Each tree is built independently (separately) using random samples of data (and/or predictors). The final prediction is simply the average (for regression) or majority vote (for classification) across all trees.
Boosting: This approach builds trees sequentially, not independently. Each new tree is trained to correct the errors (residuals) made by the ensemble so far. The final model is a weighted sum of all trees, with each tree focusing on what previous trees missed.
BART’s Position: BART borrows ideas from both:
Like bagging and random forests, each tree is built using randomness. Like boosting, each tree tries to capture signal not yet accounted for by the ensemble so far. The key novelty is in how BART generates and updates its trees, integrating a Bayesian framework.
BART Algorithm: Mathematical and Procedural Details Let’s walk through the BART algorithm, step-by-step, using the notation and formulas from the text: Notation and Initialization
Let \(K\) be the number of regression trees in the ensemble. Let \(B\) be the number of algorithm iterations (draws). For each tree \(k\) and iteration \(b\), \(\hat{f}_k^b(x)\) is the prediction at input \(x\) from the \(k\)th tree at the \(b\)th iteration.
Step 1: Initialization
At the start (first iteration), all trees are initialized to have a single root node (no splits).
The value at the root is set to the overall mean of the response values, divided equally among all K trees: \[ \hat{f}_k^1(x) = \frac{1}{nK} \sum_{i=1}^n y_i \]
for all \(k = 1, \ldots, K\), where \(n\) is the number of data points.
The sum across all trees gives the overall mean: \[ \hat{f}^1(x) = \sum_{k=1}^K \hat{f}_k^1(x) = \frac{1}{n}\sum_{i=1}^n y_i \]
Step 2: Sequential Iterative Updates
For each subsequent iteration \(b = 2, \ldots, B\), the BART algorithm updates each of the \(K\) trees, one at a time. (a) For each tree \(k\): i. Compute the partial residuals For each observation \(i\), calculate the partial residual:
\[ r_i = y_i - \sum_{k'<k} \hat{f}_{k'}^b(x_i) - \sum_{k'>k} \hat{f}_{k'}^{b-1}(x_i) \]
- \(y_i\) is the observed response.
- The first sum subtracts the predictions from all trees already updated in this iteration.
- The second sum subtracts the predictions from all trees yet to be updated (using their values from the previous iteration).
- Update the tree by perturbation
Instead of fitting a completely new tree to the residuals, BART perturbs (modifies) the current tree \(\hat{f}_k^{b-1}(x)\) to improve the fit to the partial residuals.
The perturbation can involve: Changing the structure of the tree: by adding, removing, or changing splits (branches). Changing the predicted values at the terminal nodes (leaves): adjusting the output values at the ends of the branches.
Among all possible perturbations, those that improve the fit to the partial residuals are favored, but some randomness is maintained to explore the space of possible trees.
- Aggregate the predictions
- After updating all K trees for iteration b, the total prediction is: \[ \hat{f}^b(x) = \sum_{k=1}^K \hat{f}_k^b(x) \]
Step 3: Burn-in Period and Final Estimation
The first L\(L\)iterations are considered the “burn-in” period and are discarded, since early models may not represent a good fit.
The final prediction for any x is computed by averaging the predictions from the remaining \(B - L\) iterations: \[ \hat{f}(x) = \frac{1}{B-L} \sum_{b=L+1}^B \hat{f}^b(x) \]
This averaging produces the final regression estimate from the BART model.
Additionally, you can use the full set of post-burn-in predictions \(\{\hat{f}^{L+1}(x), \ldots, \hat{f}^B(x)\}\) to compute uncertainty intervals (such as percentiles), which is a strength of the Bayesian approach.
Algorithm 8.3: Pseudocode Summary
Initialize all trees to the same single-node tree with the overall mean response divided among trees.
Repeat for each iteration \(b = 2, \ldots, B\)
- For each tree \(k\)
- Compute the partial residuals for all data points, subtracting the predictions of all other trees.
- Propose a random perturbation to the tree (change structure or leaf values).
- Select perturbations that improve fit to the partial residuals, but maintain randomness.
- Aggregate the predictions from all trees.
- After discarding \(L\) burn-in iterations, average the remaining predictions to get the final model.
Why Does BART Work? What Are Its Strengths? Guarding Against Overfitting: Unlike boosting, which fits new trees as closely as possible to the residuals (potentially overfitting if too many trees are added), BART only perturbs existing trees at each step and typically keeps each tree small. This prevents any single tree from fitting the noise in the data. Bayesian Model Averaging: The algorithm can be interpreted as a Bayesian procedure, where the ensemble of trees is sampled from a posterior distribution, and the averaging step produces a Bayesian posterior mean prediction. Uncertainty Quantification: By keeping all predictions after burn-in, BART can provide not just a point estimate, but credible intervals (ranges) for predictions, reflecting model uncertainty. Minimal Tuning Required: BART generally performs well “out-of-the-box,” meaning you do not have to spend a lot of effort tuning hyperparameters. Generalized Practical Scenario Suppose you are predicting house prices based on dozens of features (size, location, age, number of rooms, etc.), and you suspect there are complex, nonlinear relationships among features and target. You want an ensemble that:
Captures complex patterns (like bagging/random forests/boosting) Avoids overfitting (like boosting might do if unchecked) Gives you a measure of uncertainty for each predicted price
You fit a BART model with a reasonable number of trees and iterations. The model sequentially perturbs its trees to better fit the data without overfitting, and after discarding early (“burn-in”) iterations, averages the remaining predictions for each house. You can use the spread of these predictions to form a “confidence interval” for each price estimate.
- Additive Model Structure BART models the regression function as a sum of \(K\) regression trees:
\[ f(x) = \sum_{k=1}^{K} g(x ; T_k, M_k) \]
Here, \(g(x; T_k, M_k)\) is the prediction of the kth regression tree at input x.
\(T_k\) represents the structure of the kth tree (how it splits the input space).
\(M_k\) represents the set of predicted values at the terminal nodes (leaves) of the kth tree. The total prediction is the sum of predictions from all \(K\) trees.
- Iterative Bayesian Updating via Markov Chain Monte Carlo (MCMC) BART uses an MCMC algorithm to sample from the posterior distribution over possible ensembles of trees. The process iteratively updates each tree, aiming to fit the residuals left by all other trees.
At each MCMC iteration \(b = 1, 2, \ldots, B\) and for each tree \(k = 1, \ldots, K\)
- Partial Residuals For each observation \(i = 1, \ldots, n\), compute the partial residual for the \(k\)th tree: \[ r_i = y_i - \sum_{k' \neq k} \hat{f}_{k'}(x_i) \]
In the BART update described in the text, this is done sequentially for each tree, so the sum over \(k' < k\) uses current iteration values and \(k' > k\) uses previous iteration values:
\[ r_i = y_i - \sum_{k' < k} \hat{f}_{k'}^{b}(x_i) - \sum_{k' > k} \hat{f}_{k'}^{b-1}(x_i) \]
- Tree Perturbation Instead of Full Refitting
Instead of fitting a brand new tree to the residuals (as in boosting), BART proposes a random perturbation to the existing tree structure and/or its leaf values:
Structural changes: Add, remove, or change splits in the tree.
Leaf value changes: Adjust the predicted value at the terminal nodes.
Perturbations are accepted probabilistically, favoring those that improve the fit to the partial residuals (i.e., those that increase posterior probability).
- Posterior Summaries and Burn-in
After discarding the first \(L\) “burn-in” iterations, the remaining sampled models are averaged to form the final prediction: \[ \hat{f}(x) = \frac{1}{B-L} \sum_{b=L+1}^{B} \hat{f}^b(x) \]
where
\[ \hat{f}^b(x) = \sum_{k=1}^{K} \hat{f}_k^b(x) \]
This Bayesian averaging not only gives a point estimate, but also allows you to compute intervals (e.g., a 95% credible interval) for uncertainty quantification by looking at the distribution of the predictions \(\hat{f}^{L+1}(x), \ldots, \hat{f}^B(x)\).
- Regularization and Overfitting Avoidance
Tree size is typically kept small (shallow trees with few splits), limiting each tree’s capacity to overfit.
Bayesian priors are imposed on the structure and predictions of the trees, further regularizing the ensemble.
The perturbation scheme and the Bayesian averaging prevent the ensemble from fitting the training data too closely, a common problem in boosting.
- Posterior Predictive Distribution
For any new input \(x^*\), BART provides a full posterior predictive distribution, not just a single estimate:
For each post-burn-in iteration, compute \(\hat{f}^b(x^*)\). The collection \(\{\hat{f}^{L+1}(x^*), \ldots, \hat{f}^B(x^*)\}\) can be used to compute percentiles or credible intervals, quantifying uncertainty in the prediction.
- Hyperparameters
- \(K\): Number of trees in the ensemble (commonly 100–200).
- \(B\): Number of MCMC iterations (often 1,000–10,000 or more).
- \(L\): Number of burn-in iterations (e.g., 100–200).
- These are chosen based on problem complexity and computational resources, but BART is known to be robust to these choices.
5.2.8.2.3.1 Why Use Trees as Base Learners in Ensembles?
Decision trees are a popular choice for building ensemble methods because:
Flexibility: Trees can model both linear and non-linear relationships. Mixed Data Types: Trees handle both qualitative (categorical) and quantitative (numerical) predictors without requiring transformation or specialized encoding. Automatic Variable Selection: Trees naturally select relevant variables by choosing splits that best separate the data. Interpretability (for single trees): The structure of a tree can be visualized and interpreted.
However, single trees are unstable and prone to overfitting. Ensemble methods combine multiple trees to improve predictive performance and stability.
Bagging (Bootstrap Aggregating) How it works:
Build multiple trees, each on a different random sample (with replacement) of the training data (bootstrap sample). Each tree is grown independently of the others. For prediction, average the results (regression) or take a majority vote (classification) across all trees.
Implications:
Because all trees are trained on similar (but not identical) datasets, and because the trees use all available variables at each split, they often find similar splits and end up structurally similar. This similarity means the ensemble might not explore the full range of possible models (“model space”), and can get stuck in “local optima”—missing better solutions elsewhere in the model space.
Random Forests How it works:
Like bagging, but with an added layer of randomness: at each split in each tree, only a random subset of predictors is considered as split candidates. Each tree is still trained independently on a bootstrap sample.
Mathematical Mechanism: If there are p predictors, at each split, randomly select m predictors (commonly m=p for classification, or m=p/3 for regression), and choose the best split among only those m.
Implications:
By restricting the set of variables eligible for splits, trees become more diverse: they don’t always split on the strongest predictor. This decorrelation between trees means the average prediction is more stable and less sensitive to the peculiarities of the data. The ensemble explores a broader range of models than bagging.
Boosting How it works:
Uses the original dataset; does not sample data. Builds trees sequentially—each tree is trained to predict the errors (residuals) of the combined previous trees. Each new tree is “shrunk” (its predictions are weighted less strongly) before being added to the ensemble. This is often called a “slow” or “weak” learning approach.
Mathematical Mechanism: At each step, the model is updated as:
\[ F_{m}(x) = F_{m-1}(x) + \nu \cdot h_{m}(x) \]
Where:
- \(F_{m}(x)\) is the ensemble after m trees,
- \(h_{m}(x)\) is the new tree fit to current residuals,
- \(\nu\) is a learning rate (shrinkage parameter, typically \(0 < \nu < 1\)). Implications:
Boosting is highly effective at capturing complex patterns. The sequential dependency means each tree “corrects” the errors of the ensemble so far. Without proper regularization (like shrinkage or limiting tree depth), boosting can overfit—fit the training data too closely and perform poorly on new data.
BART (Bayesian Additive Regression Trees) How it works:
Like boosting, BART fits trees sequentially, each trying to explain part of the signal not captured by the ensemble so far. Unlike boosting, BART does not fit entirely new trees at each step. Instead, it perturbs (modifies) existing trees from the previous iteration, either by changing their structure (splitting/pruning) or by adjusting the values at the leaves. The update process is stochastic and Bayesian: random proposals for modifying trees are made, with preference given to those that improve the fit to the residuals.
Mathematical Mechanism: Let \(K\) be the number of trees and \(B\) the number of iterations. At each iteration, the prediction is:
\[ \hat{f}^{b}(x) = \sum_{k=1}^{K} \hat{f}_{k}^{b}(x) \]
After discarding \(L\) burn-in iterations, the final prediction is:
\[ \hat{f}(x) = \frac{1}{B-L} \sum_{b=L+1}^{B} \hat{f}^{b}(x) \]
Where each \(\hat{f}_{k}^{b}(x)\) is a tree updated by perturbing the previous tree for that position in the ensemble.
Implications:
BART’s random perturbation mechanism guards against overfitting and local minima by continually exploring new tree structures, rather than fitting residuals as closely as possible. BART provides not only a point estimate but also a distribution of predictions, allowing for uncertainty quantification. The process can be viewed as a Markov chain Monte Carlo algorithm sampling from the posterior distribution over possible ensembles.
Key Differences and Similarities
Bagging and Random Forests: Grow trees independently using random samples (bagging), with random forests introducing additional randomness at each split. Boosting: Builds trees sequentially, each correcting the residuals of the ensemble so far, but risks overfitting without shrinkage. BART: Grows trees sequentially, but instead of fitting new trees, it perturbs existing ones in a Bayesian way, robustly exploring the model space and providing uncertainty measures.
5.2.9 Unsupervised learning
Supervised learning with refer to statistical methods such as regression and classification. The goal of supervised learning to predict a response variable \(Y\) using features \(X_{1}, X_{2}, \ldots, X_{p}\). This chapter, however, focuses on unsupervised learning, where there is no response variable \(Y\). Instead, the aim is to discover patterns or interesting structures within the features themselves. The chapter specifically covers two methods of unsupervised learning: principal components analysis (for visualizing or preprocessing data) and clustering (for finding unknown subgroups in the data).
5.2.9.1 Challenges in unsupervised learning
Supervised learning is well-understood, with clear goals and established tools for evaluating performance, such as predicting a response variable and checking accuracy. In contrast, unsupervised learning is more challenging and subjective because there is no response variable, making it difficult to assess results or check the correctness of findings. Unsupervised learning is often used for exploratory data analysis, and its techniques are important in many fields, such as identifying subgroups in genetic data, grouping customers in online shopping, or personalizing search engine results.
5.2.9.2 Principal Components Analysis (PCA)
Principal Components Analysis (PCA) is a method used to reduce a large set of correlated variables into a smaller set of representative variables (principal components) that explain most of the variability in the data. PCA identifies directions in the data with the most variation and creates new variables along those directions. It is an unsupervised technique, relying only on the features and not on any response variable. PCA is useful for data visualization, dimensionality reduction, and even filling in missing data, and is discussed here as a tool for exploring data without labeled outcomes.
5.2.9.2.1 What are principal components?
We aim to visualize high-dimensional data and reduce it in principal components analysis. Suppose you have a dataset with \(n\) observations. For each observation, you have measured \(p\) different features or variables, denoted \(X_{1}, X_{2}, \ldots, X_{p}\). You want to visually explore this data, which is a common first step in exploratory data analysis.
One simple approach is to create scatterplots of the data. A scatterplot typically shows two variables at a time, with one on the x-axis and the other on the y-axis. If you have p variables, you could, in theory, make a scatterplot for every possible pair of variables. The number of such pairs is given by the binomial coefficient \(\binom{p}{2} = \frac{p(p-1)}{2}\). For example, if you have 10 features (i.e., \(p = 10\)), there are 45 different scatterplots you could make.
However, as \(p\) increases, the number of possible scatterplots grows quickly and becomes unmanageable. Not only is it impractical to look at so many plots, but most of them won’t be very useful. Each scatterplot only shows the relationship between two features at a time, and therefore contains only a small part of the overall information present in the entire dataset. When \(p\) is large, you need a better way to visualize the structure of the data.
The goal is to find a way to represent the data in a low-dimensional space (for example, two or three dimensions) that still preserves as much of the important information (variation) in the data as possible. If you can find such a representation, you can plot the data in this new space and get a meaningful visualization, even if the original data had many more dimensions.
What Principal Components Analysis (PCA) Does
Principal Components Analysis (PCA) is a mathematical technique that addresses this problem. PCA finds a new set of dimensions, called principal components, which are linear combinations of the original features. Each observation in your data (which originally lived in a p-dimensional space) can now be represented in a lower-dimensional space (for example, two or three principal components), where each new dimension captures as much of the variation (difference among observations) in the data as possible.
The idea is that not all of the original \(p\) variables are equally “interesting” in terms of how much they vary across your observations. PCA finds the directions (in the \(p\)-dimensional space) along which the data varies the most. These directions are called principal components. The first principal component is the direction along which the data varies the most; the second is the direction (orthogonal to the first) along which the remaining variation is maximized, and so on.
How the First Principal Component is Defined
The first principal component is a normalized linear combination of the original variables: \[ Z_{1} = \phi_{11} X_{1} + \phi_{21} X_{2} + \cdots + \phi_{p1} X_{p} \]
The coefficients\(\phi_{11}, \ldots, \phi_{p1}\) are called the loadings for the first principal component. The vector \(\phi_{1} = (\phi_{11}, \phi_{21}, \ldots, \phi_{p1})^{T}\) is called the loading vector.
Normalization means that the sum of the squares of the loadings is equal to 1: \[ \sum_{j=1}^{p} \phi_{j1}^{2} = 1 \]
This normalization is important so that you don’t get arbitrarily large variances by just making the coefficients very large.
Finding the First Principal Component
Suppose your dataset is represented as a matrix \(\mathbf{X}\) of size \(n \times p\), where \(n\) is the number of observations and \(p\) is the number of features. Before finding principal components, you center each variable (feature) so that it has mean zero. This is done by subtracting the mean of each column from every entry in that column.
You look for the linear combination of features (using coefficients \(\phi_{11}, \ldots, \phi_{p1}\) that gives a new variable \(z_{i1}\)for each observation \(i\):
\[ z_{i1} = \phi_{11} x_{i1} + \phi_{21} x_{i2} + \cdots + \phi_{p1} x_{ip} \]
The goal is to choose the coefficients \(\phi_{11}, \ldots, \phi_{p1}\) so that the sample variance of the \(z_{i1}\) values is as large as possible, subject to the constraint that the sum of squared coefficients is 1:
\[ \underset{\phi_{11}, \ldots, \phi_{p1}}{\operatorname{maximize}} \left\{ \frac{1}{n} \sum_{i=1}^{n} \left( \sum_{j=1}^{p} \phi_{j1} x_{ij} \right)^2 \right\} \quad \text{subject to} \quad \sum_{j=1}^{p} \phi_{j1}^2 = 1 \]
This is a classic optimization problem, and it can be solved using a method called eigen decomposition from linear algebra. The resulting \(z_{i1}\) values (for \(i = 1, 2, \ldots, n\)) are called the scores of the first principal component. The loading vector \(\phi_{1}\) defines the direction in which the data vary the most. If you project the data points onto this direction, you obtain the principal component scores.
Geometric Interpretation
Geometrically, the loading vector \(\phi_{1}\)gives a direction in the original feature space. Each observation (data point) is projected onto this direction, and the resulting value is the score for that observation on the first principal component. In the special case where there are only two features, this direction can be easily drawn as a line on a scatterplot. In higher dimensions, it’s a direction in p-dimensional space. As an example, if you have two features, the first principal component loading vector might be \(\phi_{11} = 0.839\), \(\phi_{21} = 0.544\). This defines a specific direction in the two-dimensional space of the features.
The Second Principal Component
After finding the first principal component, you can look for a second linear combination of the features (the second principal component) that:
Has the largest possible variance among all linear combinations that are uncorrelated with the first principal component; and Is orthogonal (perpendicular) to the first principal component in the feature space.
Mathematically, the second principal component scores are:
\[ z_{i2} = \phi_{12} x_{i1} + \phi_{22} x_{i2} + \cdots + \phi_{p2} x_{ip} \]
where the loading vector \(\phi_{2} = (\phi_{12}, \phi_{22}, \ldots, \phi_{p2})^T\) is chosen so that \(\phi_{1}\) is orthogonal to \(\phi_{2}\) and the sum of squared coefficients is 1. If you only have two features, the direction of the second principal component is uniquely determined (it must be perpendicular to the first). With more features, there can be several principal components, each defined in the same way (maximizing variance, being orthogonal to all previous components).
Using Principal Components for Visualization
Once you have computed the principal components, you can use them to create low-dimensional representations of the data. For example, you can make a scatterplot of the data using the first two principal component scores as the axes. This gives you a two-dimensional plot that tries to preserve as much of the variation in the original high-dimensional data as possible.
You can also plot the first principal component against the second, or the second against the third, and so on, to explore different aspects of the data’s structure. Geometrically, this is like projecting the data into the subspace spanned by the first few principal components and plotting the projected points.
5.2.9.2.2 Another interpretation of Principle Components
Principal component loading vectors and their relationship with variance. The first two principal component loading vectors in a simulated three-dimensional dataset are shown in Figure 12.2. These two loading vectors span a plane in 3D space. A “plane” in 3D is a flat, two-dimensional surface, and in this context, it is defined by the two principal component loading vectors. This plane is special because it is the plane along which the set of data points (the observations) have the highest variance. In other words, if you were to look for the flattest surface that best fits the spread of the data, the plane defined by the first two principal component directions is that surface.
These Principal Components are directions of maximal variation. Principal component loading vectors were described as directions in feature space along which the data varies the most. That is, PCA finds directions in the dataset such that, if you were to project all your data points onto these directions, the spread (variance) of the projected values would be as large as possible. The principal component scores are the coordinates of the projections of the data points onto these directions.
When we think about principal components, they are closest to low-dimensional planes as possible for all data points. This is an alternative (but equivalent) interpretation of principal components.Principal components identify low-dimensional linear surfaces (lines, planes, hyperplanes, etc.) that are as close as possible to all of the data points. “Closeness” here is measured using the average squared Euclidean distance from each data point to the surface. The Euclidean distance is the straight-line distance between two points in space.
The first Principal Component is the first line in \(p\)-dimensional space (where \(p\)oints (where \(n\) is the number of observations). “As close as possible” means that the sum of the squared distances from each data point to the line is as small as possible, among all possible lines through the origin in \(p\)-dimensional space. In Figure 6.15, this is illustrated by dashed lines showing the distance from each observation to the line. The appeal of this is that you find a single dimension that lies “in the middle” of all the data points, providing a summary of the entire dataset in just one dimension.
When we state “closest”, the principal component are low-dimensional approximations. The first two principal components define a plane in \(p\)-dimensional space that is as close as possible (in terms of average squared Euclidean distance) to all \(n\) observations. In general, the first \(M\) principal components (for any \(M\)) define an \(M\)-dimensional linear subspace (hyperplane) that is as close as possible to the data cloud. This interpretation leads to the idea of approximating each element \(x_{ij}\) of the data matrix (the value of feature \(j\) for observation \(i\)) by a sum involving the first \(M\) principal components:
\[ x_{ij} \approx \sum_{m=1}^{M} z_{im} \phi_{jm} \]
Here, \(z_{im}\) is the score of observation \(i\) on principal component \(m\). \(\phi_{jm}\) is the loading of feature \(j\) on principal component \(m\).
In other words, for each observation-feature pair, the value is approximated by multiplying the score of the observation on each principal component by the loading of the feature on that principal component, and summing over the first \(M\) components.
The formal optimization of the best low-dimensional approximation can initially be expressed as:
\(x_{ij} \approx \sum_{m=1}^{M} a_{im} b_{jm}\)
where \(a_{im}\) and \(b_{jm}\)are numbers to be determined.
The best approximation (in the sense of being closest to the actual data) is the one that minimizes the sum of squared residuals (the squared differences between the actual data and the approximation), summed over all features and all observations:
\[ \underset{A \in \mathbb{R}^{n \times M}, B \in \mathbb{R}^{p \times M}}{\operatorname{minimize}} \left\{ \sum_{j=1}^{p} \sum_{i=1}^{n} \left( x_{ij} - \sum_{m=1}^{M} a_{im} b_{jm} \right)^2 \right\} \]
Here, \(A\) is an \(n \times M\) matrix (with entries \(a_{im}\)), and \(B\) is a \(p \times M\) matrix (with entries \(b_{jm}\)).
It turns out that the solution to this minimization problem is given by the first \(M\) principal component scores and loadings: for the optimal \(A\) and \(B\), \(a_{im} = z_{im}\) (the score of observation \(i\) on component \(m\)), and \(b_{jm} = \phi_{jm}\) (the loading of feature \(j\) on component \(m\)).
This means the smallest possible value of the sum of squared residuals (the best possible approximation) is:
\[ \sum_{j=1}^{p} \sum_{i=1}^{n} \left( x_{ij} - \sum_{m=1}^{M} z_{im} \phi_{jm} \right)^2 \]
The first \(M\) principal component score vectors and the first \(M\) loading vectors together give the best possible linear, \(M\)-dimensional approximation to the data (in terms of squared Euclidean distance). If you choose \(M\) large enough (specifically, \(M = \min(n-1, p)\)), then the approximation becomes exact:
\[ x_{ij} = \sum_{m=1}^{M} z_{im} \phi_{jm} \]
This means that, for a sufficiently large number of principal components, you can reconstruct the entire original dataset exactly from the principal component scores and loadings.
5.2.9.2.3 The Proportion of Variance Explained (PVE) in PCA
When we perform Principal Component Analysis (PCA) on a dataset, we transform the data from its original coordinate system (often, each axis is a measured variable or feature) to a new system defined by the principal components. We want to project the data so that we can capture patterns in the data. For example, if we have a dataset with three features (so every observation is a point in 3D space), we can use PCA to find three new axes (principal components) that are linear combinations of the original axes.
Often, the most important structure in the data can be seen by looking only at the first few principal components (those that capture the most variation in the data). For example, by projecting the data onto the first two principal components, we get a two-dimensional view of the data. If the data really “lives” mostly in a two-dimensional subspace, this projection will preserve most of the relationships between the observations.
In Figure 12.2, the first two principal component loading vectors are used to create a two-dimensional “plane” in the original three-dimensional space. When the data is projected onto this plane, the main patterns are preserved: groups of points that are near each other in 3D also remain close in the 2D projection. Similarly, in the USArrests data example (with 50 observations and 4 variables), projecting the data onto the first two principal components gives a good summary of the main structure in the data.
This leads to an important question: How much information do we lose by keeping only the first few principal components? Or, put differently: How much of the total variation in the data is captured by the first few principal components? This motivates the concept of the proportion of variance explained (PVE) by each principal component.
The total variance in the data is a measure of how much the values of all variables differ from their means across all observations. In PCA, we usually center each variable to have mean zero (subtract the mean of each column from every entry in that column). The total variance is calculated by summing the variances of each variable:
\[ \sum_{j=1}^{p} \operatorname{Var}(X_j) = \sum_{j=1}^{p} \frac{1}{n} \sum_{i=1}^{n} x_{ij}^2 \]
- \(p\) is the number of variables (features).
- \(n\) is the number of observations (samples).
- \(x_{ij}\) is the centered value (mean-zero) of variable \(j\) for observation \(i\).
So, for each feature, we square each centered value, sum over all observations, divide by n (to get the variance for that feature), and then sum over all features.
Each principal component is itself a new variable, created as a specific linear combination of the original variables:
\[ z_{im} = \sum_{j=1}^{p} \phi_{jm} x_{ij} \]
\(z_{im}\) is the score of observation \(i\) on principal component m. \(\phi_{jm}\)is the loading (weight) of variable \(j\) on principal component \(m\).
The variance explained by the mth principal component is: \[ \frac{1}{n} \sum_{i=1}^{n} z_{im}^2 = \frac{1}{n} \sum_{i=1}^{n} \left( \sum_{j=1}^{p} \phi_{jm} x_{ij} \right)^2 \]
This formula squares each score, sums over all observations, and divides by n to get the variance.
The (Proportion of Variance Explained) PVE for the \(m\)th principal component is the fraction of the total variance that is captured by that component:
\[ \text{PVE}_m = \frac{ \sum_{i=1}^n z_{im}^2 }{ \sum_{j=1}^p \sum_{i=1}^n x_{ij}^2 } = \frac{ \sum_{i=1}^n \left( \sum_{j=1}^p \phi_{jm} x_{ij} \right)^2 }{ \sum_{j=1}^p \sum_{i=1}^n x_{ij}^2 } \]
The numerator is the total variance explained by the mth principal component. The denominator is the total variance in the entire dataset.
Each PVE is a positive number between 0 and 1. The sum of the PVEs for all principal components is 1, meaning that together, they account for all the variance in the data. To find the cumulative PVE for the first \(M\) principal components, simply add the PVEs for each of those components.
Relationship to Data Approximation and Residual Error Recall from the previous section: the first M principal components provide the best M-dimensional linear approximation to the data (in terms of minimizing the squared error between the actual data and the approximation). The total variance in the data can be split into two parts:
The variance explained by the first \(M\) principal components. The mean squared error (MSE) of the approximation using those \(M\) components.
This decomposition is written:
\[ \sum_{j=1}^p \frac{1}{n} \sum_{i=1}^n x_{ij}^2 = \sum_{m=1}^M \frac{1}{n} \sum_{i=1}^n z_{im}^2 + \frac{1}{n} \sum_{j=1}^p \sum_{i=1}^n \left( x_{ij} - \sum_{m=1}^M z_{im} \phi_{jm} \right)^2 \]
The left side is the total variance in the data. The first term on the right is the variance explained by the first M principal components. The second term is the mean squared error of the best M-dimensional approximation.
This relationship means that maximizing the variance explained by the first \(M\) principal components is equivalent to minimizing the mean squared error of the approximation.
The PVE for the first \(M\) components can also be written in terms of total and residual sum of squares \(R^2\):
\[ 1 - \frac{ \sum_{j=1}^p \sum_{i=1}^n \left( x_{ij} - \sum_{m=1}^M z_{im} \phi_{jm} \right)^2 }{ \sum_{j=1}^p \sum_{i=1}^n x_{ij}^2 } = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}} \]
TSS (Total Sum of Squares): the sum of squared entries in the data matrix. RSS (Residual Sum of Squares): the sum of squared errors of the \(M\)-dimensional approximation.
This formula is analogous to the \(R^2\) statistic in regression, representing the proportion of variance explained by the approximation, or in this case, by the first \(M\) principal components.
5.2.9.2.4 Uniqueness of the Principal Components
A principal component loading vector is a set of coefficients (weights) assigned to the original variables (features) in a dataset. In PCA, each principal component is defined as a particular linear combination of the original variables. The coefficients in this linear combination are called the loadings, and for each principal component, these loadings are collected into a vector—the principal component loading vector. If you have\(p\) variables, the loading vector for the \(m\)th principal component is:
\[ \phi_{1m},\, \phi_{2m},\, \ldots,\, \phi_{pm} \]
This vector describes a direction in the \(p\)-dimensional space of your features.
When we say a principal component loading vector is “unique up to a sign flip,” we mean that the direction defined by the loading vector is unique, but the actual vector you get could point in either direction along the same line. That is, both \(\phi\) and \(-\phi\) define exactly the same line or direction in space. If you run PCA using two different software packages, you may find that the loading vectors for a principal component are the same except for the sign of every element: one package might give [0.5,-0.7,0.5], while another gives [-0.5,0.7,-0.5]. Both are equally valid. The sign does not matter for interpretation, because the principal component represents a direction, not an arrow pointing a particular way along that direction.
In mathematical terms, the principal component loading vector specifies a one-dimensional linear subspace (a line) in p-dimensional space. A line has no preferred direction: the line through the origin defined by \(\phi\) is exactly the same as the line defined by \(-\phi\). Flipping the sign of the loading vector just reverses the direction of the vector, not the subspace it spans. For example, imagine a line through the origin in 3D space. If you specify a direction, say (1,2,3), that line also includes the direction (-1,-2,-3). Both directions describe the same infinite line, just pointing in opposite directions.
The principal component scores for each observation (the projections of the data onto the principal component direction) are also unique up to a sign flip. Suppose the score for observation i on component m is zim. If you flip the sign of the loading vector (\(\phi_{jm}\)to \(-\phi_{jm}\)), then the scores for each observation also flip sign (\(z_{im}\) to \(-z_{im}\). This is because:
\[ z_{im} = \sum_{j=1}^p \phi_{jm} x_{ij} \]
When using principal components to approximate or reconstruct the original data (using the formula \(x_{ij} \approx \sum_{m=1}^M z_{im} \phi_{jm}\), the product \(z_{im} \phi_{jm}\) is what matters. If you flip the sign of both the loading vector and the scores, you have:
\(z_{im} \to -z_{im}\)
\(\phi_{jm} \to -\phi_{jm}\)
The product \(z_{im} \phi_{jm} \to (-z_{im})(- \phi_{jm}) = z_{im} \phi_{jm}\)
Therefore, the reconstructed or approximated value xij does not change at all if both are sign-flipped. The final result is completely unchanged.
In geometric terms, the principal component direction is a line through the origin in feature space. The sign flip corresponds to moving along the line in the opposite direction, but because the line is infinite in both directions, the choice of “which way is positive” is arbitrary. This is analogous to how, in a coordinate system, the direction of axes is arbitrary, and flipping all values along an axis doesn’t change the data’s structure, just the orientation. When comparing principal component results from different sources (e.g., different software, textbooks, or computations), do not be concerned if the loadings or scores appear with opposite signs. As long as the directions (and relationships between variables) are preserved, the analysis and interpretation remain valid and unchanged.
5.2.9.2.5 Deciding How Many Principal Components to Use
Suppose you have a data matrix \(\mathbf{X}\) with \(n\) rows (observations, or samples) and \(p\) columns (variables, or features). The process of Principal Component Analysis (PCA) decomposes this matrix into a set of principal components. Mathematically, the maximum number of distinct principal components you can extract is \(\min(n-1, p)\):
If you have more features than observations \(p > n-1\), then you can only get \(n-1\) principal components with nonzero variance, because after centering, the data lies in an \(n-1\) dimensional subspace. If you have more observations than features \(n > p\), you can get \(p\) principal components.
However, in practice, we are rarely interested in keeping all possible principal components. We want to reduce the dimensionality of the data and retain only those components that capture the main structure or variation in the data.
Our aim is to use the smallest number of principal components that give us a good understanding of the data. “Good understanding” here means being able to capture the main patterns, clusters, or trends present in the data, while discarding noise or less informative variation.
There is no single, universally agreed-upon rule for how many principal components should be kept. The answer depends on the specific context, the data, and the goals of the analysis.
A scree plot is a graphical tool used to help decide how many principal components to retain.
In a scree plot, the x-axis shows the component number (1st, 2nd, 3rd, etc.), and the y-axis shows the proportion of variance explained (PVE) by each component.
The plot often starts high (with the first component explaining a large amount of variance), then drops off as you go to higher-numbered components.
Often, the scree plot shows a sharp drop in variance explained after the first few components, then levels off. The “elbow” in the scree plot is the point where this drop occurs—the plot bends like an elbow. The components before the “elbow” explain a sizable amount of variance and are typically retained. Components after the elbow explain much less variance and are often discarded. For example, if the first two components explain a large proportion of variance, and the third and fourth add little, one might keep only the first two.
If the third principal component explains less than 10% of the variance, and the fourth explains less than half that, then those components add little to our understanding. In such a case, the “elbow” appears after the second component, suggesting that two components are enough.
This scree plot method is ad hoc (informal and based on visual inspection, not a strict rule). There is no universally accepted, objective criterion for how many principal components are “enough.” What is enough depends on:
The specific dataset (how much structure is in the data, how quickly the variance explained drops off). The area of application (for example, genetics, finance, image analysis, etc.). The goals of the analysis (visualization, pattern discovery, noise reduction, etc.).
PCA is most often used for exploratory data analysis—looking for patterns, clusters, or interesting structure in the data without a target variable. In practice, we:
Examine the first few principal components for interesting patterns. If these components reveal structure, clusters, or trends, we may look at a few more. If the first few components show no interesting structure, it’s unlikely that later components will.
The Process is Iterative and Subjective
If you find interesting patterns in the first few components, you may look at additional components until nothing new is learned. This process is subjective and depends on the analyst’s goals and judgment.
In supervised analysis, we have an outcome variable (for example, in regression or classification). When using PCA for supervised analysis, such as principal components regression (PCR), we can use objective criteria to select the number of components:
The number of principal components to use can be treated as a tuning parameter (just like the number of neighbors in k-NN, or the penalty parameter in ridge regression). We can use methods like cross-validation to select the number that gives the best predictive performance on held-out data.
This makes the process of choosing the number of components more objective and reproducible.
The subjectivity in unsupervised PCA (no outcome variable) reflects the open-ended, exploratory nature of the analysis. The objectivity in supervised PCA (with outcome variable) reflects the well-defined nature of prediction tasks, where the “best” number of components is the one that gives the best predictive accuracy as determined by a clear metric.
5.2.9.2.6 Other uses for principal components
In Section 6.3.1, regression was performed not on the original variables (features) themselves, but on the principal component score vectors.
Each principal component score vector is a new variable created by projecting the data onto a principal component loading vector (a direction in the original variable space). For a data set with n observations and p variables, the principal component scores form an \(n \times p\) matrix (if all components are used), but you can select just the first \(M\) components to get an \(n \times M\) matrix.
Generalization to Other Statistical Methods
Principal Component Regression (PCR):
Instead of using all \(p\) original variables as predictors in a regression model, you use only the scores from the first \(M\) principal components as predictors. This reduces dimensionality and can help avoid problems like multicollinearity and overfitting, especially when \(p\) is large or when predictors are highly correlated.
Classification:
In classification problems (for example, deciding which category an observation belongs to), you can use the principal component scores as input features to your classifier (such as logistic regression, support vector machines, decision trees, etc.). This can help the classifier focus on the main structure in the data and ignore irrelevant details.
Clustering:
Clustering algorithms (like k-means) can also be applied to the matrix of principal component scores instead of the original data. By using just the principal component scores, you often reveal the true groupings in the data more clearly, since noise and irrelevant dimensions are reduced or removed.
Using the \(n \times M\) Matrix
The \(n \times M\) matrix is formed by keeping only the first \(M\) principal component score vectors (columns), where \(M\) is much smaller than \(M \ll p\), the total number of original variables.
For example, if you start with 100 variables (\(p = 100\)) but most of the variation and structure in the data is explained by just 3 principal components (\(M = 3\)), you can reduce your data to an \(n \times 3\) matrix.
This dimensionality reduction leads to more efficient and often more interpretable analysis.
Why This Leads to Less Noisy Results
In real-world data, not all variables contain useful information (signal). Many variables may be mostly noise or random variation. PCA orders the principal components so that the first component explains the most variance, the second explains the next most, and so on. Often, the signal (meaningful structure and patterns) in the data is captured by the first few principal components.
For example, in gene expression data, thousands of genes (variables) may be measured, but only a few combinations (principal components) capture most of the biological differences between samples.
The later principal components often capture only noise or minor details. By using only the first \(M\) principal components as features, you remove much of the noise, leading to more robust and reliable statistical analysis.
Start with your original data matrix (n observations, p variables). Perform PCA to find the principal components. Project your data onto the first M principal components to get an \(n \times M\) matrix of scores. Use this matrix as the input for regression, classification, clustering, or any other statistical method.
5.2.9.3 Clustering methods
Clustering is a group of techniques used to find subgroups (clusters) within a dataset by grouping together similar observations and separating dissimilar ones. The definition of “similar” or “different” is context-dependent and relies on domain knowledge. For example, in medical data, clustering can be used to find unknown subtypes of a disease by grouping tissue samples with similar measurements. Clustering is an unsupervised learning problem, meaning there is no known outcome variable; the goal is to discover structure (clusters) in the data. In contrast, supervised learning tries to predict a known outcome. Both clustering and principal component analysis (PCA) simplify data, but PCA does so by reducing dimensionality and summarizing variance, while clustering finds homogeneous groups within the observations. Clustering is widely used, including in marketing for “market segmentation,” where the aim is to identify groups of people likely to respond similarly to marketing strategies. There are many clustering methods, but two of the most common are:
K-means clustering: Partitions data into a pre-specified number of clusters. Hierarchical clustering: Does not require the number of clusters to be specified in advance; it creates a tree-like diagram (dendrogram) that shows relationships among observations for all possible numbers of clusters.
Clustering can be done on observations or features, depending on the analysis goal. The rest of the chapter focuses on clustering observations based on their features.
5.2.9.3.1 K-means clustering
K-means clustering is a widely used, conceptually simple, and effective algorithm for dividing a dataset into a set number (\(K\)) of distinct, non-overlapping groups called clusters. The key goal is to partition the data so that:
Each observation (data point) belongs to exactly one cluster. Data points within the same cluster are as similar as possible to each other. Data points in different clusters are as dissimilar as possible.
To use K-means, you must specify in advance the number of clusters (\(K\)) you want to find. The algorithm then automatically assigns each observation to one of these clusters.
Notation and Cluster Assignment
Suppose you have n observations (data points) and \(p\) features (variables). You want to split the observations into \(K\) clusters. Here’s how we formalize this:
Let \(C_1, C_2, ..., C_K\) be the sets of indices for the \(K\) clusters.
- For example, if the 4th, 7th, and 10th observations belong to cluster 2, then \(C_2 = \{4, 7, 10\}\).
These sets must satisfy two important properties:
Every observation must belong to at least one cluster: \[ C_{1} \cup C_{2} \cup ... \cup C_{K} = \{1, 2, ..., n\} \]
No observation belongs to more than one cluster (clusters do not overlap): \[ C_k \cap C_{k'} = \emptyset \quad \text{for all } k \neq k' \]
The Optimization Objective: Within-Cluster Variation
The idea behind K-means is to find a clustering such that the observations within each cluster are as similar as possible. This is measured by within-cluster variation—a measure of how “spread out” the points in a cluster are.
For cluster \(C_k\), the within-cluster variation is denoted \(W(C_k)\). The goal is to find a partition (assignment of points to clusters) that minimizes the total within-cluster variation across all \(K\) clusters.
Mathematically, the objective is: \[ \underset{C_1, ..., C_K}{\operatorname{minimize}} \left\{ \sum_{k=1}^K W(C_k) \right\} \] That is, find the partition of the data into K clusters such that the sum of the within-cluster variations is as small as possible.
How Do We Measure Within-Cluster Variation? The most common way is to use the squared Euclidean distance between observations. For cluster \(C_k\): \[ W(C_k) = \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2 \]
- \(|C_k|\) is the number of observations in cluster \(C_k\).
- The double sum \(\sum_{i, i' \in C_k}\) sums over every possible pair of observations in the cluster.
- For each pair, we sum the squared differences across all features (\(p\)).
- This computes the average squared distance between every pair of observations within cluster C\(C_k\).
Combining this with the original objective gives: \[ \underset{C_1, ..., C_K}{\operatorname{minimize}} \left\{ \sum_{k=1}^{K} \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2 \right\} \]
Why is This Problem Hard?
For \(n\) observations and \(K\) clusters, there are \(K^n\) possible ways to assign each point to a cluster. This number grows astronomically with even moderate \(n\) and \(K\). Therefore, it is infeasible to check every possible assignment to find the absolute best (global optimum).
The K-means Algorithm: A Practical (Local) Solution Despite the complexity, a simple iterative algorithm can be used to find a local optimum, which is often good enough in practice. This is the classic K-means algorithm:
Algorithm Steps
Initialization:
- Randomly assign each observation to one of the K clusters (so each gets a temporary cluster label).
Iterate until assignments stop changing: a) For each cluster, compute the centroid (mean) of all observations assigned to that cluster. The centroid is a vector of length \(p\), with each element being the average value for that feature in the cluster. b) Reassign each observation to the cluster whose centroid is closest (using Euclidean distance).
- Repeat steps 2a and 2b until the cluster assignments no longer change (convergence).
Why Does This Work?
- At each iteration, the within-cluster variation (the objective) always decreases or stays the same; it never increases.
- This is because:
- The centroids are calculated to minimize the total squared distance for the current assignments.
- Reassigning observations to the nearest centroid can only decrease the total distance.
- The process stops at a local optimum: a set of assignments where moving a single observation to a different cluster would not decrease the objective further.
Important Note: Local Optima
- The result depends on the initial random assignment. Running the algorithm multiple times with different initializations and choosing the best result (lowest total within-cluster variation) is standard practice.
The algorithm is demonstrated in the figures:
Figure 12.7: Shows how the same data can be clustered into 2, 3, or 4 clusters, with each observation colored according to its cluster assignment. Figure 12.8: Shows the progression of the K-means algorithm on a toy example. It illustrates:
- Initial random assignments,
- Calculation of initial centroids,
- Iterative reassignments and centroid updates,
- Final clustering after convergence.
Selecting the Number of Clusters K To use K-means, you must decide in advance how many clusters you expect in the data. This is a nontrivial problem and is addressed later in the text.
Mathematical Details - The within-cluster variation can also be rewritten for computational convenience:
\[ \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2 = 2 \sum_{i \in C_k} \sum_{j=1}^p (x_{ij} - \bar{x}_{kj})^2 \]
- Here, \(\bar{x}_{kj}\) is the mean of feature \(j\) in cluster \(k\).
- This identity shows why updating the centroids and reassigning points always decreases the objective.
5.2.9.3.2 Hierarchial clustering
Hierarchical clustering is an alternative to K-means clustering that does not require specifying the number of clusters in advance. Instead, it builds a tree-like structure called a dendrogram to represent how observations are grouped together at different levels of similarity. This approach has the advantage of providing a visual, flexible representation of the data’s clustering structure. The section focuses on bottom-up (agglomerative) hierarchical clustering, which is the most common type, where clusters are formed by progressively merging the closest pairs of observations or groups.
5.2.9.3.2.1 Interpreting a dendrogram
Building and Interpreting a Dendrogram The Structure of a Dendrogram
- A dendrogram is a tree-like diagram commonly used to visualize the results of hierarchical clustering.
- It is usually drawn “upside down”—the leaves (the bottom ends) represent individual observations, and as you move up the tree, branches join together, combining observations and clusters.
- It is usually drawn “upside down”—the leaves (the bottom ends) represent individual observations, and as you move up the tree, branches join together, combining observations and clusters.
- The process of building a dendrogram starts with each observation as its own separate cluster (the leaves).
- Clusters are then merged together step by step according to some similarity or distance measure, until all observations are combined into one single cluster (the trunk).
How to Read a Dendrogram
- In the dendrogram, each leaf corresponds to one observation.
- Branches form when two leaves (or clusters) are joined together—this represents the merging of two similar observations or groups.
- As you move up the tree, clusters previously formed at lower levels merge into larger clusters.
- The lower a fusion occurs (closer to the bottom), the more similar those observations or clusters are.
- The higher a fusion occurs (closer to the top), the less similar the merged clusters are.
- The lower a fusion occurs (closer to the bottom), the more similar those observations or clusters are.
- The vertical axis of the dendrogram reflects the distance or dissimilarity at which clusters are joined.
- For any two observations, you can trace up the tree to the point where they first join; the height of this connection is a measure of their dissimilarity.
Common Misinterpretations
The horizontal axis (left-to-right order) of the dendrogram is arbitrary and can be rearranged at each fusion point without changing the meaning of the tree.
For n observations, there are \(2^{n-1}\) possible orderings. Observations close together horizontally are not necessarily more similar than those far apart unless their branches are directly joined at a low height.
Two observations might be plotted next to each other (horizontally) but actually become part of the same cluster only at a high level (meaning they are not particularly similar). The correct way to assess similarity is by looking at the vertical position where their branches first merge.
Identifying Clusters: Cutting the Dendrogram
- To decide how many clusters to extract from a dendrogram, you make a horizontal cut across the tree at a chosen height.
- Every branch below the cut line is considered a separate cluster.
- The number of clusters equals the number of branches intersected by the cut.
- Every branch below the cut line is considered a separate cluster.
The height of the cut determines the granularity:
- A high cut yields fewer, larger clusters.
- A low cut yields more, smaller clusters.
- Cutting at height zero gives every observation its own cluster.
This approach is flexible: you can extract any number of clusters from a single dendrogram, unlike K-means, which requires a fixed \(K\).
The Nature and Limitations of Hierarchical Clustering
The term hierarchical refers to the fact that clusters at a lower cut (more clusters) are always fully contained within clusters at a higher cut (fewer clusters). This is called “nested structure.”
In some real-world cases, this assumption of nested clusters does not fit the actual data structure.
- The best split into two clusters might be by gender (men vs. women).
- The best split into three clusters might be by nationality (American, Japanese, French).
- These groupings are not nested: the three-way split does not arise from further splitting either of the two gender groups.
- The best split into two clusters might be by gender (men vs. women).
As a result, hierarchical clustering can sometimes perform worse than K-means (which does not assume nested clusters), yielding less accurate or meaningful groups for a given number of clusters.
Interpreting a dendrogram requires focusing on vertical fusion heights to assess similarity, not horizontal proximity.
The dendrogram’s flexibility allows you to select any number of clusters after clustering has been performed.
However, the hierarchical nature (nesting) of clusters may not always reflect the true structure in the data, and in such cases, hierarchical clustering can be less accurate than other methods.
5.2.9.3.2.2 The Hierarchial Clustering Algorithm
The Goal: Building a Dendrogram
The end product of hierarchical clustering is a dendrogram, a tree diagram that shows how observations or clusters are combined in a sequence of steps until all are joined together. The dendrogram provides a visual summary of the nested grouping structure found in the data, allowing you to see at a glance which observations are most similar and how clusters merge.
Step 1: Measuring Dissimilarity Between Observations
The process begins by defining a dissimilarity measure between every pair of observations (data points).
The most common dissimilarity measure is Euclidean distance—the straight-line distance between two points in feature space. More generally, any appropriate distance or dissimilarity metric can be used, and the choice may depend on the context or data type.
Step 2: Initial Clusters
At the start (the bottom of the dendrogram), each observation is treated as its own cluster.
For \(n\) observations, there are initially \(n\) clusters, each containing just one data point. For example, if you have 9 data points, you begin with clusters \(\{1\}, \{2\}, ..., \{9\}\).
Step 3: Iterative Merging (Fusing) of Clusters
The algorithm proceeds iteratively: at each step, it finds the two clusters that are most similar (i.e., that have the smallest dissimilarity between them) and merges them into a single cluster.
After the first merge, there are \(n-1\) clusters. After the second, \(n-2\), and so on, until finally all observations are merged into one cluster at the top of the dendrogram.
Step 4: Recalculating Dissimilarities Between Clusters
After each merge, you now have clusters that could contain more than one observation. You must define how to compute the dissimilarity between two clusters, not just between two single points.
This is not obvious: there are many possible ways to do this.
The process of extending the concept of dissimilarity between pairs of points to groups of points is called the notion of linkage.
Linkage Methods: How to Measure Dissimilarity Between Clusters
The linkage method specifies how to calculate the distance between two groups of observations (clusters).
The four most common types of linkage are:
- Complete linkage: The maximum distance between any pair of observations, one from each cluster. This tends to produce compact, spherical clusters since a cluster will only merge with another if all its members are sufficiently close to all members of the other cluster.
- Single linkage: The minimum distance between any pair of observations, one from each cluster. This can lead to long, “trailing” clusters because clusters can be joined by single points that are close, even if the rest of the cluster is far away.
- Average linkage: The average distance between all pairs of observations, one from each cluster. This is a compromise between single and complete linkage.
- Centroid linkage: The distance between the centroids (means) of the two clusters. This can result in a phenomenon called “inversion,” where the height at which two clusters are merged is less than the height of earlier merges, which can make interpretation difficult.
- Start: Compute all pairwise dissimilarities for the \(n\) observations (for n points, there are \(n(n-1)/2\) pairs). Each observation is its own cluster. Iterate:
- For \(i = n, n-1, ..., 2\):
- Find the two clusters with the smallest inter-cluster dissimilarity (according to the chosen linkage method) and fuse them. Record the height of the fusion in the dendrogram.
- Find the two clusters with the smallest inter-cluster dissimilarity (according to the chosen linkage method) and fuse them. Record the height of the fusion in the dendrogram.
- Recompute the pairwise dissimilarities among the new set of clusters.
- Recompute the pairwise dissimilarities among the new set of clusters.
- Repeat this process until all points are merged into a single cluster.
Important Notes and Implications
The linkage method chosen has a significant effect on the resulting dendrogram and the structure of the clusters found.
- Complete and average linkage methods are generally preferred because they tend to produce more balanced, interpretable cluster trees.
- Single linkage can produce “chained” clusters that are not compact.
- Centroid linkage, used mostly in specialized contexts like genomics, can produce confusing inversions.
- The final dendrogram depends not only on the linkage method, but also on the original measure of dissimilarity (e.g., Euclidean distance).
Repeat this process until all points are merged into a single cluster.
5.2.9.3.2.3 Choice of Dissimilarity Measure
Euclidean Distance as Dissimilarity
In most clustering examples so far, Euclidean distance has been used as the dissimilarity measure.
For two observations (data points), their Euclidean distance is the square root of the sum of squared differences across all variables/features. Mathematically, for points \(x\) and \(y\) with \(p\) features, \(d(x, y) = \sqrt{\sum_{j=1}^p (x_j - y_j)^2}\)
This measure is intuitive for geometric data: points close together in space are considered similar.
Alternative: Correlation-Based Distance
Sometimes, other measures are more appropriate. For example, correlation-based distance considers two observations similar if their profiles (across features) are highly correlated, even if their values are far apart in Euclidean terms.
This is an unusual use of correlation: typically, correlation measures the relationship between two variables (columns), but here it is used between two observations (rows). Correlation-based distance focuses on the shape or pattern of the observations, not their overall magnitude. For instance, if two shoppers both buy the same items in the same proportions (even if one buys 100x more than the other), they will have a high correlation and thus a small correlation-based distance.
The Critical Impact of the Dissimilarity Measure
The choice of dissimilarity measure is crucial because it fundamentally determines how clusters are formed.
Different measures can yield very different dendrograms (see Figure 12.14 for how linkage method affects cluster shapes). The best choice depends on the data and the purpose of clustering.
Illustration with Figure 12.15
This figure shows three observations (e.g., shoppers) measured on 20 variables (e.g., items).
Observations 1 and 3 have similar values for each variable, so their Euclidean distance is small, but they are weakly correlated (the shapes of their profiles differ), so correlation-based distance is large between them. Observations 1 and 2 have different values for each variable, so their Euclidean distance is large, but the shape of their profiles is similar (e.g., both buy the same items in the same proportions, just at different scales), so correlation-based distance is small.
The Role of Scaling Variables Before Clustering Why Scaling Matters
Variables may have different units or different variances.
Example: In online shopping data, socks might be purchased much more often than computers. Without scaling, variables with larger magnitude or variance will dominate the dissimilarity calculation, overshadowing variables that are less frequent or have smaller numerical values.
What Happens If We Don’t Scale?
The clustering may reflect mainly the variable(s) with the largest values or most frequent purchases.
In Figure 12.16 (left), clustering based on raw purchase counts means that differences in sock purchases dominate the dissimilarity, and computer purchases have little effect. This could be undesirable if the retailer cares about both product categories equally, or more about computers.
Scaling to Standard Deviation One
By scaling each variable to have mean zero and standard deviation one (often called z-score normalization), every variable contributes equally to the dissimilarity calculation.
In Figure 12.16 (center), after scaling, both socks and computers affect the clustering similarly. This prevents high-frequency (or high-magnitude) variables from dominating.
Scaling for Different Units
If variables are measured in different units (e.g., centimeters vs. kilometers, or number of items vs. dollars spent), scaling is essential to make their contributions comparable.
In Figure 12.16 (right), clustering on the basis of dollars spent means computer purchases (being more expensive) now dominate the dissimilarity.
Choosing the Best Approach
The decision to scale or not, and which dissimilarity measure to use, depends on the scientific context and the specific goals of the analysis.
What do you mean by “similar” in your context? Are you interested in grouping by overall activity, by preference profiles, or by some other criterion?
These choices are just as important for K-means clustering as for hierarchical clustering.
Euclidean distance measures absolute differences in values; it is affected by scaling and magnitude. Correlation-based distance measures similarity in patterns or shapes of variable profiles, ignoring overall size. Scaling variables (to standard deviation one) can ensure all variables contribute equally, preventing highly variable features from dominating the results. The combination of the dissimilarity measure and scaling decisions fundamentally determines the clusters you find and their interpretability.
5.2.9.3.3 Practical Issues in Clustering
Practical Issues in Clustering: Why Small Decisions Matter Clustering is a powerful tool in unsupervised learning, meaning it is used when we do not have any outcome variables or labels for our data. It helps us find groups or structure in the data that might not be obvious. However, successfully applying clustering methods is not as simple as running an algorithm—there are many practical considerations and choices that must be made along the way. Each of these choices can dramatically affect the results you get, and therefore, the conclusions you draw from your analysis. Key Decisions Before and During Clustering 1. Standardization or Scaling of Variables
Why consider scaling? The measurements for different features (variables) in your dataset might not be on the same scale or might vary in their units or ranges. For example, height could be measured in centimeters (ranging from 150-200), while weight could be in kilograms (ranging from 40-120). If you do not standardize (scale) the variables, features with larger numerical ranges will dominate the calculation of distances (dissimilarities), and thus, the clustering results. Typical approach: Standardizing each variable so that it has mean zero and standard deviation one, which puts all features on an equal footing, is a common practice. This is especially important for distance-based clustering methods like K-means and hierarchical clustering.
- Decisions Specific to Hierarchical Clustering
Choice of Dissimilarity Measure: You must decide how to quantify how “different” or “far apart” two observations are. Common choices are Euclidean distance, Manhattan distance, or even correlation-based distances. The chosen measure affects which observations are considered similar and therefore grouped together. Choice of Linkage Type: When clusters contain more than one observation, you must decide how to define the distance between two clusters. Do you use the shortest distance between any pair (single linkage), the farthest distance (complete linkage), the average distance (average linkage), or the distance between cluster means (centroid linkage)? Each choice can produce vastly different dendrograms and clustering structures. Choosing Where to Cut the Dendrogram: Once the dendrogram is built, you must decide at what height to cut across the tree to form clusters. Cutting at different heights changes the number of clusters you get and how the data are grouped.
- Decisions Specific to K-Means Clustering
Choosing the Number of Clusters (K): K-means requires the analyst to specify the number of clusters in advance. There is no universally correct value for K, and the clustering outcome can change dramatically with different choices. Analysts often try different values and evaluate the results using domain knowledge or quantitative criteria (such as the elbow method or silhouette score).
The Impact of These Decisions Every one of these decisions can substantially affect the final clustering results. For example:
Failing to scale variables may cause clusters to reflect only the features with the largest values, ignoring potentially important structure in the other features. Using different linkage methods may produce clusters of very different shapes and sizes. Choosing the dissimilarity measure can mean the difference between grouping data by overall magnitude vs. by pattern. Picking different cut points in a dendrogram changes the number and composition of clusters. Selecting different values of K in K-means may reveal or obscure natural groupings in the data.
There is usually no single “correct” answer in clustering. The right choice depends on the data, the context, and the purpose of the analysis. In practice, analysts often try several different approaches, compare the results, and choose the solution that is most interpretable or useful for the application at hand. Any clustering that reveals interesting, valid structure in the data is worth considering.
5.2.9.3.3.1 Small Decisions with Big Consequences
Clustering Always Produces Clusters
Whenever we apply a clustering algorithm—whether K-means, hierarchical, or any other method—to a dataset, the algorithm will always produce some grouping of the data into clusters.
This is true even if there is no real, meaningful structure or subgroups present in the data. The clustering algorithm will partition or group the points according to its rules, regardless of whether those groups correspond to any real pattern in the underlying data.
The Central Question: Are the Clusters Real or Just Noise?
The key issue we face after running a clustering algorithm is: Do the clusters we found represent real, meaningful subgroups in the data, or are they merely artifacts created by the algorithm as it tries to organize random fluctuations (noise) in the dataset?
In other words, if there were truly no structure in the data, the algorithm would still force the observations into clusters. How do we know if the clusters we see have any statistical or scientific validity?
The Challenge of Replicability
A natural way to test the validity of clusters is to ask: If we collected a new, independent sample of data from the same population, would we find the same clusters?
If the clustering structure is real, it should appear again in new data. If it was just a consequence of random noise, the clusters in the new dataset are likely to be completely different. However, in practice, obtaining truly independent data can be difficult or expensive, and subtle differences in data collection or preprocessing can lead to different results.
Statistical Evaluation: p-values for Clusters
To address this, researchers have developed various statistical techniques to assign a p-value to a cluster.
A p-value in this context is a measure of how likely it is to see a cluster of a given strength or tightness purely by chance, under some null model where there are no real clusters. If the p-value is very small, it suggests that the observed cluster is unlikely to have occurred by chance, and therefore may represent a real subgroup. These methods might involve:
Randomly shuffling or permuting the data and repeating the clustering to see how often similar clusters appear by chance. Creating synthetic datasets from a null distribution (with no cluster structure) and comparing the clustering results. Bootstrapping (resampling with replacement from the data) to assess the stability of clusters.
The goal is to provide a statistical measure of confidence or evidence for the existence of each cluster.
Lack of Consensus on Best Practices
Despite the development of many such statistical validation techniques, there is no general agreement among statisticians and data scientists on a single best method for cluster validation.
Different methods may have different assumptions, may be appropriate for different kinds of data or clustering algorithms, and may yield different answers. The field is still evolving, and the best approach may depend on the specific context, the data structure, and the goals of the analysis. Therefore, cluster validation remains a challenging and active area of research.
Further Reading
For more detailed discussion and technical explanation of these methods, the text refers to ESL (presumably “The Elements of Statistical Learning”), where more advanced or alternative approaches to cluster validation are discussed.
5.2.9.3.3.2 Validating the Clusters Obtained
Clustering Always Produces Clusters
Whenever we apply a clustering algorithm—whether K-means, hierarchical, or any other method—to a dataset, the algorithm will always produce some grouping of the data into clusters.
This is true even if there is no real, meaningful structure or subgroups present in the data. The clustering algorithm will partition or group the points according to its rules, regardless of whether those groups correspond to any real pattern in the underlying data.
The Central Question: Are the Clusters Real or Just Noise?
The key issue we face after running a clustering algorithm is: Do the clusters we found represent real, meaningful subgroups in the data, or are they merely artifacts created by the algorithm as it tries to organize random fluctuations (noise) in the dataset?
In other words, if there were truly no structure in the data, the algorithm would still force the observations into clusters. How do we know if the clusters we see have any statistical or scientific validity?
The Challenge of Replicability
A natural way to test the validity of clusters is to ask: If we collected a new, independent sample of data from the same population, would we find the same clusters?
If the clustering structure is real, it should appear again in new data. If it was just a consequence of random noise, the clusters in the new dataset are likely to be completely different. However, in practice, obtaining truly independent data can be difficult or expensive, and subtle differences in data collection or preprocessing can lead to different results.
Statistical Evaluation: p-values for Clusters
To address this, researchers have developed various statistical techniques to assign a p-value to a cluster.
A p-value in this context is a measure of how likely it is to see a cluster of a given strength or tightness purely by chance, under some null model where there are no real clusters. If the p-value is very small, it suggests that the observed cluster is unlikely to have occurred by chance, and therefore may represent a real subgroup. These methods might involve:
Randomly shuffling or permuting the data and repeating the clustering to see how often similar clusters appear by chance. Creating synthetic datasets from a null distribution (with no cluster structure) and comparing the clustering results. Bootstrapping (resampling with replacement from the data) to assess the stability of clusters.
The goal is to provide a statistical measure of confidence or evidence for the existence of each cluster.
Lack of Consensus on Best Practices
Despite the development of many such statistical validation techniques, there is no general agreement among statisticians and data scientists on a single best method for cluster validation.
Different methods may have different assumptions, may be appropriate for different kinds of data or clustering algorithms, and may yield different answers. The field is still evolving, and the best approach may depend on the specific context, the data structure, and the goals of the analysis. Therefore, cluster validation remains a challenging and active area of research.
Further Reading
For more detailed discussion and technical explanation of these methods, the text refers to ESL (presumably “The Elements of Statistical Learning”), where more advanced or alternative approaches to cluster validation are discussed.
5.2.9.3.3.3 Other Considerations in Clustering
Forced Assignment of Every Observation
Both K-means and hierarchical clustering algorithms are designed to assign every observation in your data set to a cluster, with no exceptions. This means that, no matter how unusual or dissimilar an observation is compared to the rest, it will still be grouped into one of the clusters.
In practice, this can be problematic when the data contain outliers—observations that are very different from the rest of the data, or even from each other.
Example: Outliers Among Clusters
Imagine a scenario where most observations naturally form a small number of genuine, meaningful subgroups (clusters).
For example, suppose you have a dataset of 100 points, with 95 of them forming three clear clusters, and the remaining 5 points being scattered far from these clusters and from each other.
Standard clustering algorithms like K-means or hierarchical clustering will still assign these 5 outliers to one of the existing clusters, because they have no mechanism for leaving observations unassigned.
This forced assignment can “pull” the clusters in unnatural directions, distorting the centroid (mean) or boundaries of the clusters, and making the resulting groups less meaningful or interpretable. The presence of even a few such outliers can have a disproportionate effect on the cluster assignments of all the other points.
Outlier Accommodation: Mixture Models
Mixture models provide an alternative approach to clustering that can better handle outliers.
Rather than forcing each observation into a single cluster, mixture models assign probabilities to each observation for belonging to each cluster. This is called soft clustering. For example, a point might have a 95% probability of belonging to cluster 1, a 4% probability for cluster 2, and a 1% probability for cluster 3. Crucially, mixture models can accommodate outliers by assigning them low probability to every cluster, rather than forcing them into a group where they do not fit. This approach is more flexible and can yield more accurate, meaningful clusters in the presence of outliers or “noise” points. Mixture models are a generalization of K-means and are especially useful in more complex or noisy datasets. More details are available in specialized texts like ESL (“The Elements of Statistical Learning”).
Sensitivity of Clustering Results to Data Perturbations Lack of Robustness
Another important issue is that clustering methods, including K-means and hierarchical clustering, are often not robust to small changes in the dataset.
In other words, small perturbations—such as removing a few observations at random, or making small changes to the data—can have a large effect on the resulting clusters. This lack of robustness can make it difficult to interpret cluster assignments as truly reflecting stable, underlying structure.
Illustration of Instability
Suppose you have n observations and cluster them using your chosen algorithm. Now, imagine you randomly remove a few of the n observations (say, 10% of the data), and then cluster the remaining observations again using the same method. You might expect that the clusters you obtain in the second analysis would be similar to those from the first analysis, especially for the observations that were present in both runs.
However, in practice, it is often observed that the two sets of clusters can be quite different—the assignments may change, the shapes and boundaries of clusters may shift, or the number of clusters may even change. This instability can undermine confidence in the clustering results and suggests that the clusters found may not be robust representations of real structure in the data.
Implications
These issues highlight that clustering is not a foolproof or universally reliable process.
The presence of outliers, the forced assignment of every point, and the sensitivity to small changes all mean that clustering results should be interpreted with caution. Analysts should consider running clustering multiple times on different subsets of the data, or using methods (such as mixture models or robust clustering algorithms) that can better handle outliers and instability. It is also crucial to validate clusters using external knowledge or statistical techniques, rather than relying solely on the output of a single clustering run.
5.2.9.3.3.4 A Tempered Approach to Interpreting the Results of Clustering
The Value and Limitations of Clustering
Clustering is a powerful tool for exploring structure in unlabeled (unsupervised) data, and can reveal natural groupings that are not immediately obvious. However, as described in earlier sections, clustering is not a foolproof process. There are several pitfalls and sources of uncertainty that must be kept in mind to avoid misinterpretation.
The Role of User Decisions in Clustering Results
Small choices can have big consequences. The outcome of a clustering analysis depends heavily on a series of seemingly minor decisions made by the analyst. These include:
How variables are scaled or standardized (for example, whether features are put on the same scale or left in their original units). The choice of distance or dissimilarity measure (such as Euclidean distance, correlation-based distance, or others). The type of linkage used in hierarchical clustering (single, complete, average, centroid, etc.). The number of clusters to extract (in methods like K-means).
Each of these choices can dramatically change which clusters are found, how the clusters are shaped, and even how many clusters are detected.
Sensitivity Analysis: Trying Different Parameters
Because clustering results are so dependent on analyst choices, it is recommended that clustering be performed multiple times, using different parameter settings.
For example, try clustering the data with and without variable standardization, using different linkage methods, or trying different distance metrics. After running these analyses, compare the results to see which patterns are robust (i.e., consistently appear across many choices) and which are not. Patterns that emerge consistently are more likely to reflect real structure in the data, while patterns that change with small parameter tweaks may be artifacts.
Testing Robustness: Subsets and Perturbations
Clustering algorithms are known to be non-robust: small changes in the data, or in the clustering parameters, can lead to big changes in the results.
To assess robustness, it is advised to cluster multiple random subsets of the data, or to repeat clustering after removing or perturbing some observations. This helps identify which clusters are stable and which are highly sensitive to the particular data points included. Robust clusters—those that persist across many random subsamples—are more likely to be meaningful.
Responsible Reporting and Interpretation
The results of any clustering analysis should be reported with caution.
Clustering output should never be interpreted as representing the absolute or final truth about the underlying structure of a dataset. The clusters identified are always a function of the data, the algorithm, and the choices made by the analyst, and can change with new data or different parameter settings. Therefore, clustering results should be viewed as exploratory: they are valuable for generating hypotheses, guiding further investigation, or suggesting possible groupings for future study.
Scientific Process: Clustering as a Hypothesis Generator
The most appropriate role for clustering in scientific research is as a starting point for developing scientific hypotheses.
Once clusters are found, they should be tested, validated, and explored using independent data whenever possible. Follow-up studies should check if the same clusters are found in new data, or if the clusters are associated with meaningful differences in outcomes or other variables.
Summary of the Approach
Perform clustering with various choices of standardization, distance measure, and linkage type. Cluster on subsets of the data to test robustness. Look for consistent, stable patterns across analyses. Report findings as exploratory and conditional, not as final or absolute truths. Use clustering outcomes to guide further hypothesis development and validation, ideally with independent datasets.
5.2.10 Support Vector Machines
Simpilest and one of the most elegant classifier
Introduction by Visually Explained:
https://www.youtube.com/watch?v=_YPScrckx28
Pros: works well with small datasets, easy to interpret, straightforward to implement.
The support vector machine (SVM) is a popular classification method developed in the 1990s. SVMs are considered strong “out of the box” classifiers that perform well in many situations. The maximal margin classifier is a simple linear classifier, but is limited as it cannot be applied to most datasets due to the requirement of a linear boundary. The support vector classifier, is more flexible, and the support vector machine, can handle non-linear boundaries. Theres also SVMs for multiclass problems and these will be compared to other statistical methods like logistic regression.
5.2.10.1 Maximal Margin Classifier
5.2.10.1.1 What is a hyperplane?
In mathematics, a hyperplane is a fundamental geometric concept. In a space with p dimensions (called \(p\)-dimensional space), a hyperplane is a flat surface that has one dimension less than the ambient space. In other words, it is a subspace of dimension \(p\)-1. The hyperplane does not need to pass the origin (affine) and is positioned by \(\beta_0\), whereas, \((\beta_1, \ldots, \beta_p)\) determines the orientation of the hyperplane.
- If \(p\)=2: The space is a plane (think of a sheet of paper). A hyperplane in this space is a straight line. This line divides the plane into two regions.
- If \(p\)=3: The space is three-dimensional (like the world we live in). Here, a hyperplane is a flat, two-dimensional surface—a plane. This plane can divide the three-dimensional space into two parts.
- If \(p\)=3>3: Though we cannot visualize it easily, the concept still applies. For example, in 4D, a hyperplane is a three-dimensional “slice” of that space.
A hyperplane of two dimensions can be defined as: \[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0 \]
- \(\beta_0, \beta_1, \beta_2\) are constants (parameters or coefficients).
- \(X_1, X_2\) are the coordinates of a point in two dimensions.
A point \(X = (X_1, X_2)^T\) is on the hyperplane if it satisfies this equation.
When we add more dimensions, \(p\)-dimensions, the equations becomes: \[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p = 0 \]
- \((X_1, X_2, \ldots, X_p)^T\) is any point in p-dimensional space.
- \(\beta_1, \ldots, \beta_p\) are the coefficients for each coordinate.
- \(\beta_0\) is called the intercept or bias.
A point \(X\) lies on the hyperplane if plugging its coordinates into the equation makes the left side exactly zero.
If \(X\) does not satisfy the equation (\(X\) is not 0), then we will either have: \[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p > 0 \]
or
\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p < 0 \]
where \(X\) will be on one side of the hyperplane or on the other side of the hyperplane.
5.2.10.1.2 Classification Using a Seperating hyperplane
Data represented as matrix \(\mathbf{X}\) with \(n\) rows and \(p\) columns (\(n\) x \(p\)). Each row corresponds to a single observation (or data point) in a p-dimensional space. Generally, we write the \(i\)-th observation as a column vector: \[ x_i = \begin{pmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{ip} \end{pmatrix} \]
for \(i = 1, 2, \ldots, n\). That is, each observation contains \(p\) features (variables, measurements, or inputs), and there are \(n\) such observations.
The dataset is assumed to be labeled for binary classification. Each observation \(x_i\) has a corresponding label \(y_i\), where \(y_i\) takes one of two possible values: -1 or +1. These represent the two classes. For example:
- \(y_i = +1\) might represent “positive” class
- \(y_i = -1\) might represent “negative” class
This labeling is arbitrary; the important point is that there are two classes, and each observation is assigned to one or the other. You may also have a test observation, denoted as \(x^*\), which is a new point in \(p\)-dimensional space that you want to classify. This test point is written as: \[ x^* = \begin{pmatrix} x_1^* \\ x_2^* \\ \vdots \\ x_p^* \end{pmatrix} \]
The values in \(x^*\) are the observed feature measurements for this new data point.
The task is to use the labeled training data \(\{(x_1, y_1), ..., (x_n, y_n)\}\) to build a classifier—a rule or function that, given any new observation \(x^*\), predicts whether it should belong to class +1 or class -1. There are various methods for doing this (such as logistic regression, linear discriminant analysis, or decision trees), but this section focuses on using a separating hyperplane.
A separating hyperplane is a hyperplane that perfectly divides the training data according to their class labels:
All observations of class +1 are on one side of the hyperplane. All observations of class -1 are on the other side.
This is made precise by the following property:
For every \(i\) such that \(y_i = +1\): \[ \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} > 0 \]
For every \(i\) such that \(y_i = -1\): \[ \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} < 0 \]
Which can be written formally as: \[ y_i\left(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}\right) > 0, \quad \forall i = 1, ..., n \]
- If \(y_i = +1\), then the term in parentheses must be positive.
- If \(y_i = -1\), then the term in parentheses must be negative.
- In both cases, the product is positive.
If a hyperplane exists, it can be used to classify new data points. For any new observation \(x^*\), compute: \[ f(x^*) = \beta_0 + \beta_1 x_1^* + \beta_2 x_2^* + \cdots + \beta_p x_p^* \]
The sign of \(f(x^*)\) determines the predicted class:
- If \(f(x^*) > 0\), assign to class +1.
- If \(f(x^*) < 0\), assign to class -1.
Additionally, the magnitude (absolute value) of f(x∗) gives information about the confidence of the prediction:
- If \(|f(x^*)|\) is large, the point \(f(x^*)\) is far from the hyperplane, so the prediction is made with high confidence.
- If \(|f(x^*)|\) is small (close to zero), \(f(x^*)\) is near the hyperplane, so the prediction is less certain.
Thus, the hyperplane not only divides the space into two regions but also provides a “distance” measure that reflects confidence in the classification.
A classifier based on a separating hyperplane always leads to a linear decision boundary. That is, the rule for classifying new points is based on a linear combination of their features, and the dividing surface between the two classes is flat (no curves).
5.2.10.1.3 Maximal Margin Classifier
When your data can be perfectly separated by a hyperplane (that is, you can draw a flat surface so that all points from one class are on one side and all points from the other class are on the other side), there is not just one unique hyperplane that can do this. In fact, there are infinitely many separating hyperplanes. This is because you can move and rotate a seperating hyperplane infinitely in many directions which will result in still perfectly separating the two classes as long as the hyperplane does not cross any of the data points.
So how is the “best” then defined?
A natural and widely-used principle is to choose the maximal margin hyperplane. This is also known as the optimal separating hyperplane. For any given hyperplane, you can measure the perpendicular distance from each data point to the hyperplane. This is the shortest path from the point to the hyperplane, not just along any direction. For each separating hyperplane, you find the smallest such distance among all data points. This is called the margin—the distance from the hyperplane to the closest point among all the training observations.
The maximal margin hyperplane is the one that maximizes this margin. In other words, it is the separating hyperplane that is as far away as possible from the closest data points in the training set. Mathematically: Out of all possible separating hyperplanes, choose the one where the minimal distance from any training observation to the hyperplane is the greatest possible.
The larger the margin, the more confident we can say a classifier in in its predictions. This makes the classifier more robust to small changes or noise in the data. The hope is that the classifier has large enough margins on the training data to reduce the risk of misclassification. In otherwords, we optimized generalization.
Once you have the maximal margin hyperplane, you can classify a new data point \(x^*\) by looking at which side of the hyperplane it falls on.
The rule is: \[ f(x^*) = \beta_0 + \beta_1 x_1^* + \beta_2 x_2^* + \cdots + \beta_p x_p^* \]
- If f\(f(x^*) > 0\), assign to class +1
- If \(f(x^*) < 0\), assign to class -1
- The coefficients \(\beta_0, \beta_1, ..., \beta_p\) are those that define the maximal margin hyperplane.
The maximal margin hyperplane lies at the center of the widest possible “slab” (region between two parallel hyperplanes) that can be fit between the two classes, such that no points lie inside the slab.
If you examine the data, you will typically find that some points are exactly at the edge of the margin, i.e., they are closest to the maximal margin hyperplane.
These points are called support vectors.
They are “supporting” the maximal margin hyperplane in the sense that if you moved any one of these points, the position of the maximal margin hyperplane would change. Only these points matter for determining the hyperplane: moving any other point (that is farther from the margin) will not affect the maximal margin hyperplane, as long as it does not cross into the margin.
In 2D, you usually see two parallel dashed lines on either side of the maximal margin hyperplane, and the support vectors are the points that lie exactly on these dashed lines.
In any number of dimensions, support vectors are the points that are at the minimal distance from the hyperplane. The number of support vectors is at least p (the number of features), but can be more. The property that only the support vectors determine the hyperplane is crucial; it leads to efficient algorithms and is a key idea in support vector machines (SVMs).
Overfitting and High Dimensions
While the maximal margin classifier is often successful, it can overfit when the number of features \(p\) is large compared to the number of data points \(n\). That is, in high-dimensional spaces, even random or noisy data can sometimes be perfectly separated, but the resulting hyperplane may not generalize well to new data.
5.2.10.1.4 Construction of the Maximal Margin Classifier
Suppose you have a dataset consisting of n training observations. Each observation is a point in \(p\)-dimensional space, denoted as: \[ x_1, x_2, \ldots, x_n \in \mathbb{R}^p \]
Each observation \(x_i\) is associated with a class label \(y_i\), where \(y_i\) can take on one of two possible values: -1 or +1. That is, \[ y_1, y_2, \ldots, y_n \in \{-1, 1\} \] The goal is to find the maximal margin hyperplane: a separating hyperplane that maximizes the minimum distance (the margin) between itself and any of the training observations. This hyperplane will be used as a classifier.
The Optimization Problem To find the maximal margin hyperplane, you need to solve the following optimization problem. You are searching for the values of the hyperplane parameters \(\beta_0, \beta_1, \ldots, \beta_p\) and the margin \(M\):
\[ \underset{\beta_0, \beta_1, \ldots, \beta_p, M}{\text{maximize}} \quad M \]
You want to make the margin M as large as possible. Subject to (Constraints):
- Normalization Constraint: \[ \sum_{j=1}^{p} \beta_j^2 = 1 \]
This means the vector of coefficients \((\beta_1, \beta_2, \ldots, \beta_p)\) has length 1 (unit norm). It’s a mathematical trick to ensure that the margin is measured in a consistent way; otherwise, you could make the margin arbitrarily large by simply scaling all the coefficients.
- Margin Constraint: \[ y_i \left(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}\right) \geq M \qquad \forall i = 1, \ldots, n \]
This must hold for every observation in the training set.
- If \(y_i = +1\), this constraint requires that the hyperplane function at \(x_i\) is at least M (i.e., at least margin M away from the hyperplane, on the correct side).
- If \(y_i = -1\), the constraint requires the hyperplane function at \(x_i\) is at most -\(M\) (i.e., at least margin M away from the hyperplane, but on the other side).
- In both cases, the left-hand side must be at least \(M\), so all observations are not only on the correct side of the hyperplane, but also at least a distance \(M\) away from it.
Margin Constraint The constraint \[ y_{i}\left(\beta_{0}+\beta_{1} x_{i 1}+\beta_{2} x_{i 2}+\cdots+\beta_{p} x_{i p}\right) \geq M \]
ensures that every point is not just classified correctly, but with some cushion or buffer of size \(M\), which is the margin. This is a stricter requirement than simply being on the correct side.
Normalization Constraint The constraint \[ \sum_{j=1}^{p} \beta_j^2 = 1 \]
is needed because without it, the coefficients could be scaled up or down arbitrarily, which would make the margin \(M\) meaningless (since both the coefficients and margin would scale together). By fixing the length of the coefficient vector, the margin is a real, interpretable distance.
Invariance to Scaling If you have a hyperplane defined by \[ \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} = 0 \]
then multiplying all coefficients by any nonzero constant \(k\) (i.e., using \(k \beta_0, k \beta_1, \ldots, k \beta_p\)) gives the same hyperplane:
\[ k\left(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}\right) = 0 \]
So, the normalization constraint makes sure we’re only considering one scaling for the coefficients.
Perpendicular Distance to the Hyperplane The formula \[ y_{i}\left(\beta_{0}+\beta_{1} x_{i 1}+\beta_{2} x_{i 2}+\cdots+\beta_{p} x_{i p}\right) \]
with the normalization constraint in place, gives the perpendicular distance from the \(i\)-th training observation to the hyperplane. This is because, when the norm of the coefficient vector is 1, the value inside the parentheses is the signed distance from the point to the hyperplane along the normal direction (the direction perpendicular to the hyperplane).
5.2.10.1.5 The Non-separable Case
The Non-separable Case in Classification Maximal Margin Classifier and Its Limitation The maximal margin classifier is a method for classifying data that works by finding the separating hyperplane with the largest possible margin between two classes. This approach is only feasible if the data are linearly separable—that is, if there exists at least one hyperplane that puts all data points of one class on one side, and all points of the other class on the other side, with no errors. This requirement is formalized mathematically:
The optimization problem (described in the previous section, involving maximizing \(M\) subject to all training points being at least margin \(M\) from the hyperplane and on the correct side) only has a solution with M>0 if and only if the data is perfectly separable. If even a single data point is on the wrong side (for example, due to noise, outliers, or overlapping distributions), no hyperplane can satisfy all the constraints, and thus the optimization problem has no feasible solution with a positive margin.
When Data Is Not Linearly Separable In many real-world situations, data from two (or more) classes overlap in feature space. This means:
There may be points from both classes that are intermingled or close together in some regions. There may be outliers, or the boundary between classes may be nonlinear, or the features may not provide enough information to separate the classes perfectly.
In such cases:
No separating hyperplane exists that can perfectly divide the data into two classes with all points correctly classified. The optimization problem for the maximal margin classifier cannot produce a solution where \(M > 0\), because at least one constraint will always be violated (a point will be on the wrong side or within the margin).
What Happens Next? While perfect separation may not be possible, we still want a classifier that performs well, ideally by finding a boundary that almost separates the classes as well as possible. This leads to the idea of relaxing the constraints of the maximal margin classifier. Soft Margin and the Support Vector Classifier
Soft margin: Instead of requiring every point to be perfectly classified and at least margin \(M\) away from the hyperplane, we allow some points to:
Be on the wrong side of the hyperplane (misclassified). Lie within the margin (correctly classified but too close to the boundary).
The idea is to find a balance: maximize the margin as much as possible, but permit some violations of the constraints, penalizing them in the optimization. This is called the soft margin approach.
Support Vector Classifier
The support vector classifier is the generalization of the maximal margin classifier to the non-separable case. It introduces new variables (called slack variables) and a penalty parameter to control the trade-off between margin size and classification errors. The optimization problem is adjusted to allow constraint violations, but penalizes them—so the solution is a hyperplane that “almost” separates the classes, but with the best possible trade-off between margin and misclassifications.
Generalization
This issue (no separating hyperplane) can arise in any dimension (\(p\)), with any number of points (\(n\)), and for any arrangement of feature values and labels. The soft margin approach and support vector classifier can be used in all these cases, and are the foundation for support vector machines (SVMs).
5.2.10.2 Support Vector Classifiers
5.2.10.2.1 Overview of the Support Vector Classifier
Support Vector Classifiers: Motivation and Principles Non-separability of Classes In real-world data, observations from two classes are not always linearly separable—that is, you cannot always draw a straight line (in two dimensions), a flat plane (in three dimensions), or a hyperplane (in higher dimensions) that perfectly divides the classes so that all points from one class are on one side and all points from the other class are on the other side. This fundamental issue is illustrated in Figure 9.4, where some data points from each class are intermingled in such a way that no hyperplane exists to perfectly separate them. Even When a Separating Hyperplane Exists Even in situations where a separating hyperplane does exist, using it for classification may not be the best approach. Why? Because:
Perfect Separation = Perfect Fit to Training Data: A classifier based solely on a separating hyperplane will always classify all training data perfectly. Every data point will be on the correct side of the hyperplane (by construction). Overfitting Risk: If the classifier is forced to fit every training example perfectly, it can become overly sensitive to small changes or outliers in the data.
Sensitivity and Margin The distance of a data point to the hyperplane is called its margin. The margin is a measure of how confident the classifier is in its prediction for that point: the farther the point is from the hyperplane, the more confident the classifier is. Sensitivity to Observations
If you add a single new observation, especially if it’s an outlier or on the “edge,” the separating hyperplane can shift dramatically to accommodate it. As shown in Figure 9.5, the addition of just one new point can cause the hyperplane to move significantly, reducing the margin and confidence for other points. This means the maximal margin hyperplane is not robust: it is highly sensitive to minor changes in the data.
Overfitting
Overfitting occurs when a classifier models not only the underlying pattern but also the noise or peculiarities of the training data. A maximal margin classifier that fits every point (including outliers) perfectly may generalize poorly to new, unseen data because it is too closely tailored to the specifics of the training set. The margin may become very small or “tiny,” which indicates low confidence in the classifications.
Why Allow Imperfect Separation? Given this sensitivity and overfitting risk, it is often better to use a classifier that does not require perfect separation of the two classes. In practice, this means:
Allowing some points to be misclassified (on the wrong side of the hyperplane). Allowing some points to be within the margin (too close to the boundary, even if on the correct side). Focusing on better classification of most points, rather than perfect classification of all points.
This leads to greater robustness, meaning the classifier is less affected by noise, outliers, or small changes in the data.
The Support Vector Classifier (Soft Margin Classifier) What Is It?
The support vector classifier (also called the soft margin classifier) generalizes the maximal margin classifier to allow for errors and margin violations. The goal is no longer to find a hyperplane that perfectly separates the classes with the widest possible margin, but rather to find a hyperplane that balances two objectives:
Maximizing the margin (distance from the hyperplane to the nearest points). Minimizing the number and severity of constraint violations (misclassified or margin-violating points).
How Does It Work?
Slack Variables: Instead of requiring every point to be on the correct side of the margin (as with the hard margin/maximal margin classifier), we introduce variables that allow some points to fall inside the margin or even on the wrong side of the hyperplane. Penalty Parameter: The optimization problem includes a parameter that controls how much penalty is assigned to points that violate the margin or are misclassified. This balances the trade-off between margin size and classification errors. Soft Margin: The margin is called “soft” because it is not strict; some points can violate it.
What Does It Achieve?
Most Points Correct: Most training observations will be on the correct side of the margin and hyperplane, but a small subset may not be. Some Misclassifications: When perfect separation is impossible, some points will necessarily be misclassified (on the wrong side of the hyperplane). Generalization: By not insisting on perfect classification, the support vector classifier often achieves better generalization to new data.
Generalization and Scenarios
These methods apply to any number of features (\(p\)), any number of observations (\(n\)), and any layout of data—even if the classes are highly overlapping, or if outliers are present. The approach is robust to the addition or removal of a few atypical points. It can be adjusted (via the penalty parameter) to tolerate more or fewer errors depending on the specific problem or the cost of misclassification.
Summary of Key Points
Separating hyperplanes may not exist, or may not be desirable due to overfitting and sensitivity. Support vector classifiers allow for misclassification of training data (soft margin), leading to increased robustness and better performance on new data. Soft margin classifiers are designed to find the best trade-off between margin width and misclassification, rather than perfect separation.
5.2.10.2.2 Details of the Support Vector Classifier
The support vector classifier is a classification method that finds a hyperplane to separate two classes, but unlike the maximal margin classifier, it allows some flexibility:
Most training observations are separated correctly (on the right side and outside the margin), A small number may be on the wrong side of the margin, And a smaller number may even be on the wrong side of the hyperplane (misclassified).
This method is particularly useful when perfect separation is not possible or not desirable, due to overlapping classes, noise, or outliers.
The Optimization Problem (Soft Margin SVM) The support vector classifier is defined formally by solving the following optimization problem: Variables
- \(\beta_0, \beta_1, \ldots, \beta_p\): coefficients of the hyperplane
- \(M\): the margin (width between the closest points and the hyperplane)
- \(\epsilon_1, \ldots, \epsilon_n\): slack variables, one for each observation, quantifying margin or classification violations
\[ \underset{\beta_0, \beta_1, \ldots, \beta_p, \epsilon_1, \ldots, \epsilon_n, M}{\text{maximize}} \quad M \]
The goal is to make the margin \(M\) as large as possible, while allowing for possible violations. Constraints
- Normalization: \[ \sum_{j=1}^{p} \beta_j^2 = 1 \]
This ensures the coefficients define the margin in a standardized way, avoiding arbitrary scaling.
- Soft Margin Constraint: \[ y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}) \geq M(1 - \epsilon_i) \quad \forall i \]
- If \(\epsilon_i = 0\), the \(i\)th observation is on the correct side of the margin (at least distance M from the hyperplane).
- If \(0 < \epsilon_i \leq 1\), the ith observation is inside the margin but on the correct side of the hyperplane (closer than \(M\)).
- If \(\epsilon_i > 1\), the ith observation is on the wrong side of the hyperplane (misclassified).
- Slack variables non-negative:
\[ \epsilon_i \geq 0 \]
Negative slack makes no sense (would mean more than perfect separation).
- Budget on violations (“C constraint”):
\[ \sum_{i=1}^n \epsilon_i \leq C \]
\(C\) is a non-negative parameter (called the tuning parameter or budget). It controls how much total violation to the margin and hyperplane is tolerated across all data.
Interpretation of Each Constraint and Parameter Slack Variables \(\epsilon_i\)
Each slack variable \(\epsilon_i\) quantifies how much the ith data point violates the margin constraint:
- \(\epsilon_i = 0\): correctly classified, at least margin M away from the hyperplane (ideal case).
- \(0 < \epsilon_i \leq 1\): on the correct side of the hyperplane, but inside the margin (not as confidently classified).
- \(\epsilon_i > 1\): on the wrong side of the hyperplane (misclassified).
Tuning Parameter (\(C\))
The sum of all slack variables is at most \(C\), so:
- Smaller \(C\): Less tolerance for violations—margin is narrower, fewer misclassifications or margin violations are allowed, more “rigid” fit.
- Larger \(C\): More tolerance for violations—margin can be wider, more misclassifications or violations are allowed, less rigid fit.
\(C\) can be chosen using cross-validation or other model selection methods to optimize out-of-sample (test) performance. If \(C = 0\), then all \(\epsilon_i = 0\), and the problem reduces to the maximal margin classifier (hard margin SVM), which only works if the data is perfectly separable.
Geometric and Statistical Implications
The resulting hyperplane is not affected by points that are well away from the margin and on the correct side—only the points close to or inside the margin (or misclassified) matter. These influential points are called support vectors.
Support vectors are the observations that either:
Lie exactly on the margin (\(\epsilon_i = 0\) and margin constraint is tight), Violate the margin (\(\epsilon_i > 0\)), including those that are misclassified (\(\epsilon_i > 1\)).
The position of other (non-support vector) observations does not affect the solution.
Bias-Variance Trade-off (Role of \(C\))
When \(C\) is large:
- Margin is wide.
- Many points violate the margin.
- Many support vectors.
- Classifier is less sensitive to individual points (lower variance), but may underfit (higher bias).
When \(C\) is small:
- Margin is narrow.
- Few points violate the margin.
- Fewer support vectors.
- Classifier fits the training data more closely (lower bias), but may be more sensitive to noise or outliers (higher variance).
Comparison With Other Methods
Support vector classifier depends only on a subset of observations (support vectors), making it robust to outliers or changes far from the decision boundary. Linear discriminant analysis (LDA), in contrast, depends on the means and covariance of all observations in each class.
This means that every observation, even those far from the decision boundary, influences the classifier.
Logistic regression is similar to the support vector classifier in that it is also less sensitive to outlying observations far from the decision boundary.
If you adjust \(C\):
High \(C\): The classifier allows more violations, margin is larger, more points are support vectors. Low \(C\): The classifier allows fewer violations, margin is smaller, fewer points are support vectors. As \(C\) changes, the number of support vectors and the width of the margin change accordingly.
5.2.10.3 Support Vector Machines
5.2.10.3.1 Classification with Non-Linerar Decision Boundaries
Converting Linear to Non-Linear Classifiers The main idea is to start from a linear classifier (like a support vector classifier), which constructs a linear decision boundary between two classes, and find a general mechanism to convert it into a classifier that can handle non-linear boundaries between classes. The support vector machine (SVM) accomplishes this in an automatic and computationally efficient way.
9.3.1 Classification with Non-Linear Decision Boundaries Limitations of Linear Classifiers
The support vector classifier is designed for situations where the boundary between two classes is linear; that is, the classes can be separated by a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions). In practice, class boundaries are often non-linear. This means a straight line or plane cannot adequately separate the classes.
Analogy to Regression
In regression (Chapter 7), when the relationship between predictors and outcome is non-linear, the performance of linear regression suffers. To address this, we expand the feature space by including polynomial functions of predictors (e.g., quadratic, cubic terms), so that the model can capture non-linear relationships.
Expanding the Feature Space for Classification
Similarly, for classification, if we suspect (or observe) that the boundary between classes is non-linear, we can enlarge the feature space by including additional features constructed from the original ones, such as:
Quadratic terms: \(X_1^2, X_2^2, \ldots, X_p^2\) Cubic terms: \(X_1^3, X_2^3, \ldots, X_p^3\) Higher-order polynomials: \(X_j^d\) for degree \(d\) Interaction terms: \(X_j X_{j'}\) for \(j \neq j'\) Other non-linear functions
Example of Enlarged Feature Space Suppose your original predictors are \(X_1, X_2, ..., X_p\). Instead of fitting a support vector classifier using just these \(p\) features, you could expand the feature space to include their squares:
\[ X_1, X_1^2, X_2, X_2^2, ..., X_p, X_p^2 \]
This results in \(2p\) features: each original feature and its square.
Mathematical Formulation in the Enlarged Space The optimization problem for the support vector classifier, when using quadratic terms, becomes: \[ \begin{aligned} & \underset{\beta_0, \beta_{11}, \beta_{12}, ..., \beta_{p1}, \beta_{p2}, \epsilon_1, ..., \epsilon_n, M}{\text{maximize}} \quad M \\ & \text{subject to:} \\ & \quad y_i \left( \beta_0 + \sum_{j=1}^{p} \beta_{j1} x_{ij} + \sum_{j=1}^{p} \beta_{j2} x_{ij}^2 \right) \geq M(1 - \epsilon_i), \\ & \quad \sum_{i=1}^n \epsilon_i \leq C, \quad \epsilon_i \geq 0, \\ & \quad \sum_{j=1}^p \sum_{k=1}^2 \beta_{jk}^2 = 1 \end{aligned} \]
- \(y_i\) is the class label for observation i.
- \(x_{ij}\) is the value of feature j for observation i.
- \(\beta_0\) is the intercept.
- \(\beta_{j1}\) is the coefficient for the linear term in feature j.
- \(\beta_{j2}\) is the coefficient for the quadratic term in feature j.
- \(\epsilon_i\) are slack variables allowing for violations of the margin.
- \(M\) is the margin to be maximized.
- \(c\) is a parameter controlling the total allowed margin violations.
Why Does This Lead to Non-Linear Decision Boundaries? Geometry in Enlarged Space
In the enlarged feature space (with the quadratic features), the SVM finds a linear boundary (a hyperplane). But when this boundary is projected back to the original feature space, it becomes a non-linear boundary.
The decision boundary in the original space is of the form \(q(x) = 0\), where \(q(x)\) is a quadratic polynomial in the original features. Generally, such equations define curves (like parabolas, ellipses, or more complicated shapes) rather than straight lines.
Generalization to Other Expansions
One can go beyond quadratic terms:
Cubic terms: \(X_1^3, X_2^3, ...\) Interaction terms: \(X_j X_{j'}\) for \(j \neq j'\) Higher-order polynomials: Up to any desired degree. Other functions: Trigonometric, exponential, indicator functions, etc.
The more complex the expansion, the more flexible the resulting decision boundary.
Computational Considerations
Curse of dimensionality: As you add more and more features (especially for high-degree polynomials or many interactions), the number of features can grow very large, quickly becoming computationally infeasible. Efficient computation: The key innovation of the support vector machine is that it allows us to enlarge the feature space implicitly (using kernels), enabling the computation of very complex, non-linear boundaries without explicitly computing all new features.
Summary of the Mechanism (no summarizing, just expansion):
Linear classifiers work well only when the class boundary is linear. When class boundaries are non-linear, one can expand the feature space by adding polynomial (and other) terms. In the expanded space, the SVM finds a linear boundary, which corresponds to a non-linear boundary in the original space. This approach can be generalized to any type of non-linear expansion, not just polynomials. The number of features grows rapidly with the degree and number of predictors, making explicit computation expensive or infeasible for large expansions. The support vector machine, by using kernels, allows us to benefit from these expansions efficiently, as it computes inner products in the expanded space without ever constructing the expanded feature vectors.
5.2.10.3.2 The Supoort Vector Machines
From Support Vector Classifier to SVM A support vector machine (SVM) is a powerful generalization of the support vector classifier. The support vector classifier finds a linear boundary (a hyperplane) that best separates two classes, possibly allowing some margin violations as controlled by a tuning parameter. However, real-world data often cannot be separated by a straight line (or plane, or hyperplane) in the original feature space, because the relationship between the features and the classes is non-linear. Enlarging the Feature Space To solve this, SVMs enlarge the feature space—they implicitly map the original features into a higher-dimensional (even infinite-dimensional) space, where a linear separation might be possible. This mapping is performed not by directly computing new features, but by using a mathematical tool called a kernel.
A kernel is a function that quantifies the similarity between two observations, and mathematically it generalizes the concept of an inner product in the original feature space. The Inner Product Given two vectors \(a\) and \(b\) in \(\mathbb{R}^r\), their inner product is \[ \langle a, b \rangle = \sum_{i=1}^{r} a_i b_i \]
\[ \langle a, b \rangle = \sum_{i=1}^{r} a_i b_i \]
This is a measure of similarity, giving a large value when a and b point in similar directions. For two observations xi,xi′ in a p-dimensional feature space, the inner product is
\[ \langle x_i, x_{i'} \rangle = \sum_{j=1}^{p} x_{ij} x_{i'j} \]
This is the basis of the linear kernel.
The Support Vector Classifier in Terms of Inner Products It can be shown that the decision function for the linear support vector classifier can be written as: \[ f(x) = \beta_0 + \sum_{i=1}^n \alpha_i \langle x, x_i \rangle \]
where
- \(\beta_0\) is the intercept,
- \(\alpha_i\) are coefficients determined during training, one for each training observation \(x_i\).
However, only the support vectors (points on or inside the margin, or misclassified) have non-zero \(\alpha_i\). For all other points, \(\alpha_i = 0\). Thus, we can rewrite the function more efficiently as:
\[ f(x) = \beta_0 + \sum_{i \in S} \alpha_i \langle x, x_i \rangle \]
where \(S\) is the set of indices for the support vectors. Key point: To compute \(f(x)\) for any new point, we only need the inner product between x and each support vector. All other training points play no role in the prediction.
Generalizing the Inner Product: The Kernel Function Suppose, instead of just using the linear inner product, we use a more general function K that measures similarity in a potentially more complex way: \[ K(x_i, x_{i'}) = \text{kernel function applied to } x_i \text{ and } x_{i'} \]
Now, the decision function becomes: \[ f(x) = \beta_0 + \sum_{i \in S} \alpha_i K(x, x_i) \]
This allows the SVM to find non-linear boundaries in the original feature space, because the kernel \(K\) implicitly maps the data into a higher-dimensional space where a linear separation is possible. You never actually need to compute the coordinates in the high-dimensional space; you only need to compute \(K(x, x_i)\) for pairs of points.
Types of Kernels Linear Kernel
\[ K(x_i, x_{i'}) = \langle x_i, x_{i'} \rangle = \sum_{j=1}^{p} x_{ij} x_{i'j} \]
This corresponds to the standard support vector classifier: the boundary is linear in the original features. Polynomial Kernel
\[ K(x_i, x_{i'}) = \left(1 + \sum_{j=1}^p x_{ij} x_{i'j} \right)^d \]
- \(d\) is a positive integer, the degree of the polynomial.
- When \(d = 1\), this is just the linear kernel.
- When \(d > 1\), the boundary in the original feature space becomes more flexible, allowing for curved or more complex shapes. This is equivalent to implicitly fitting a classifier in a space of polynomial features up to degree d.
Radial (Gaussian) Kernel \[ K(x_i, x_{i'}) = \exp\left(-\gamma \sum_{j=1}^{p}(x_{ij} - x_{i'j})^2\right) \]
- \(\gamma\) is a positive parameter that controls the width of the kernel.
- This kernel measures similarity based on distance between points. If two points are close, the value is near 1; if far, it is near 0.
- This kernel is local: only nearby points have significant influence on the decision for a new observation.
Decision Function With Kernels With any kernel, the SVM decision function is always: \[ f(x) = \beta_0 + \sum_{i \in S} \alpha_i K(x, x_i) \]
The sign of \(f(x)\) determines the predicted class label. Key insights:
The SVM’s decision for a new point depends only on its similarity (under the kernel) to the support vectors. The choice of kernel determines the flexibility and shape of the decision boundary. Kernels allow SVMs to efficiently compute decision boundaries in spaces that may be vastly higher-dimensional than the original feature space, or even infinite-dimensional (as with the radial kernel).
Kernel trick: By using kernel functions, all computations can be done in terms of the kernel without explicitly transforming data into the higher-dimensional or infinite-dimensional space. This makes SVMs practical even for very large or complex feature spaces, because the actual computations never leave the original feature space. For n training observations, you need only compute the kernel for each pair, i.e., \(K(x_i, x_{i'})\) for all pairs, which is \(\binom{n}{2}\) computations.
Local and Global Behavior
Linear kernel: All points influence the decision globally. Polynomial kernel: Influence can be more complex, depending on the degree. Radial kernel: Only points near the test point (in Euclidean distance) influence its classification—far points have negligible effect.
Generalization
Any function \(K(x_i, x_{i'})\) that satisfies appropriate mathematical properties (symmetric and positive semi-definite) can serve as a kernel. The SVM can thus be adapted to a wide variety of tasks, data types, and notions of similarity.
Summary (with no information skipped or summarized):
SVM extends the support vector classifier by allowing complex, nonlinear boundaries using kernels. Kernels are functions that generalize the inner product to measure similarity in possibly complex ways. The decision function of an SVM can always be written in terms of kernel evaluations with the support vectors. Linear kernel yields a linear boundary; polynomial kernel yields a polynomial boundary; radial kernel yields a highly flexible, local boundary. The kernel trick allows SVMs to work in implicitly vast or infinite-dimensional spaces without explicit computation in those spaces. Only the support vectors (usually a subset of the training data) determine the boundary; all other observations have no effect on the decision function. The behavior of the SVM decision boundary (global vs. local, linear vs. nonlinear) is determined by the choice of kernel and its parameters.
5.2.10.4 SVM with More than Two classes
SVMs With More Than Two Classes Binary Versus Multi-Class Classification Until now, the discussion of support vector machines (SVMs) has focused exclusively on binary classification—distinguishing between two classes (for example, “yes” vs. “no”, “disease” vs. “no disease”, “cat” vs. “dog”). In binary SVM, a single separating hyperplane is learned, separating the two classes as well as possible (with a margin). However, in many real-world problems, you must classify observations into one of \(K\) classes, where \(K > 2\). For example, classifying images as “cat”, “dog”, or “rabbit” means \(K = 3\). The Challenge SVMs are naturally constructed for two-class separation. There is no single hyperplane that can, in general, simultaneously separate more than two classes. Thus, to use SVMs for multi-class problems, we must adapt the approach. Two Main Approaches: One-Versus-One and One-Versus-All
5.2.10.4.1 One-Versus-One Classification
Two Main Approaches: One-Versus-One and One-Versus-All 1. One-Versus-One (All-Pairs) Classification
Concept: For \(K\) classes, construct a separate SVM classifier for every possible pair of classes. The number of unique pairs among \(K\) classes is: \[ \binom{K}{2} = \frac{K(K-1)}{2} \]
For each pair \((k, k')\), fit an SVM that distinguishes class \(k\) (coded as +1) from class \(k'\) (coded as -1), ignoring all other classes for that classifier.
Classification Process for a Test Observation
1. For a new observation, run it through each of the \(\binom{K}{2}\) SVMs.
2. Each SVM “votes” for one of its two classes (the one it assigns to the observation).
3. Count up the votes: For each class, count how many times it was selected by the pairwise classifiers.
4. Final assignment: Assign the observation to the class with the most votes across all \(\binom{K}{2}\) classifiers.
Generalization: This approach works for any number \(K > 2\) of classes. Advantage: Each classifier only has to distinguish between two classes at a time, so the problem stays as simple as possible. Disadvantage: The number of classifiers to train and evaluate grows quadratically with \(K\). For large \(K\). this can become computationally expensive.
5.2.10.4.2 One-Versus-All Classification
- One-Versus-All (One-Versus-Rest) Classification
Concept: For \(K\) classes, construct \(K\) different SVMs. Each SVM is trained to distinguish one class (coded as +1) from all the other classes (coded as -1) combined.
Classification Process for a Test Observation
- For each class \(k\), fit an SVM that separates class k from all other classes.
- The SVM for class \(k\) yields a set of parameters: \(\beta_{0k}, \beta_{1k}, \ldots, \beta_{pk}\)
- For a test observation \(x^* = (x_1^*, x_2^*, \ldots, x_p^*)\), compute the score for each class: \[ f_k(x^*) = \beta_{0k} + \beta_{1k} x_1^* + \beta_{2k} x_2^* + \cdots + \beta_{pk} x_p^* \]
for all \(k = 1, \ldots, K\). 3. Final assignment: Assign the observation to the class k with the largest score \(f_k(x^*)\). - The largest score corresponds to the highest “confidence” that the observation belongs to that class, according to the SVM for that class.
Generalization: This approach also works for any number \(K > 2\) of classes. Advantage: Only \(K\) classifiers are needed (linear in \(K\)), so it is computationally simpler for large \(K\). Disadvantage: Each classifier must learn to separate one class from a potentially very heterogeneous set (“all other classes”), which can be more difficult, especially if classes are not well separated.
Mathematical Details and Generality Number of Classifiers
One-versus-one: \(\binom{K}{2} = \frac{K(K-1)}{2}\) SVMs One-versus-all: \(K\) SVMs
Coding of Classes
In each classifier, the classes being compared are coded as +1 and -1. All other classes are ignored (for one-versus-one) or lumped together as -1 (for one-versus-all).
Decision Rule
One-versus-one: Assign by majority vote across all pairwise comparisons. One-versus-all: Assign to the class whose SVM gives the largest value of the decision function for the test observation.
Examples of Application (Generalized) Suppose you have a dataset with \(K\) different animal species, and you want to classify new images based on features (such as color, size, shape, etc.).
One-versus-one: You build a separate SVM for every pair of species (cat vs. dog, cat vs. rabbit, dog vs. rabbit, etc.), then combine their votes for each new image. One-versus-all: For each species, you build an SVM to distinguish that species from all others combined, then, for a new image, assign it to the species whose classifier is most confident.
Summary of the Core Ideas (no skipping or summarizing):
SVMs are fundamentally binary classifiers, and do not naturally extend to multi-class problems. Two common strategies make SVMs work for any number of classes:
One-versus-one: Build classifiers for every possible pair of classes and use majority voting. One-versus-all: Build one classifier per class, each distinguishing that class from all others, and assign based on the highest score.
The formulas and decision rules outlined above are general and apply for any number of classes, any type of features, and any SVM kernel (linear, polynomial, radial, etc.). The choice between one-versus-one and one-versus-all depends on computational constraints, the nature of the data, and class separability.
If you want to see the explicit step-by-step training or prediction process for either method, or want a concrete worked example, just say the word!
5.2.10.5 Relationship to Logistic Regression
Relationship Between SVMs and Logistic Regression Historical Context and Initial Differences When support vector machines (SVMs) were first introduced, they were seen as fundamentally different from earlier classification methods such as:
Logistic regression (which models the probability of class membership using a logistic function), Linear discriminant analysis (which models class boundaries based on statistical assumptions about the data).
SVMs were notable because:
They sought to find a hyperplane that separates the data as much as possible (maximizing the margin). They allowed for some violations (training points on the “wrong” side of the margin or even the hyperplane). They introduced the idea of kernels to expand the feature space, making it possible to fit non-linear decision boundaries.
This seemed novel: instead of modeling probabilities, the SVM focused on the geometry of the separating boundary.
Modern View: Deep Connections Despite these apparent differences, it turns out that SVMs and classical statistical methods (like logistic regression and ridge regression) are deeply related. In fact, the SVM fitting criterion can be rewritten in the familiar “loss + penalty” form that appears throughout statistics and machine learning.
The SVM Objective Function Let’s formalize the SVM’s objective for a linear support vector classifier. The function to be learned is:
\[ f(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \]
SVM Loss + Penalty Formulation The SVM criterion (from earlier formulas 9.12–9.15) can be expressed as:
\[ \underset{\beta_0, \beta_1, \ldots, \beta_p}{\operatorname{minimize}} \left\{ \sum_{i=1}^{n} \max \left[0, 1 - y_i f(x_i)\right] + \lambda \sum_{j=1}^{p} \beta_j^2 \right\} \]
where:
- \(y_i\) is the class label for observation i, coded as +1 or −1.
- \(f(x_i)\) is the value of the decision function for observation i.
- \(\lambda\) is a nonnegative tuning parameter that controls regularization.
Interpreting the Terms
\(\sum_{i=1}^{n} \max \left[0, 1 - y_i f(x_i)\right]\) is known as the hinge loss:
For each observation \(i\), calculate \(y_i f(x_i)\).
If this value is \(\geq 1\), the loss is zero (the point is on the correct side of the margin).
If this value is \(< 1\), the loss increases linearly as the point moves further into the wrong side or closer to the margin.
\(\lambda \sum_{j=1}^{p} \beta_j^2\) is a ridge penalty (also known as L2 regularization), which penalizes the magnitude of the coefficients to prevent overfitting.
Tuning Parameter λ
Large \(\lambda\): The penalty on coefficients is strong, so \(\beta_1, \ldots, \beta_p\) are kept small. This allows more margin violations (i.e., more points closer to or on the wrong side of the boundary), leading to a model with higher bias but lower variance (less likely to overfit). Small \(\lambda\): The penalty is weak, so coefficients can be larger to fit the data more closely. This results in fewer margin violations (points are kept further from the decision boundary), leading to lower bias but higher variance (more likely to overfit).
This is mathematically analogous to the trade-off controlled by the parameter C in the earlier SVM constraint \(\sum \epsilon_i \leq C\).
General “Loss + Penalty” Framework The above SVM formulation is a specific case of a very general approach used throughout statistics and machine learning:
\[ \underset{\beta_0, \beta_1, \ldots, \beta_p}{\operatorname{minimize}} \left\{ L(\mathbf{X}, \mathbf{y}, \beta) + \lambda P(\beta) \right\} \]
Where:
- \(L(\mathbf{X}, \mathbf{y}, \beta)\) is a loss function that measures how well the model fits the data.
- \(P(\beta)\) is a penalty function (regularization term) that discourages overly complex models.
- \(\lambda\)controls the trade-off
Examples:
Ridge regression: Loss is squared error, penalty is sum of squares of coefficients. \[ L(\mathbf{X}, \mathbf{y}, \beta) = \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j \right)^2 \]
\[ P(\beta) = \sum_{j=1}^p \beta_j^2 \]
Lasso: Loss is squared error, penalty is sum of absolute values of coefficients.
\[ P(\beta) = \sum_{j=1}^p |\beta_j| \]
SVM: Loss is hinge loss, penalty is sum of squares of coefficients.
Hinge Loss vs. Logistic Loss
Hinge Loss (SVM): \[ L(\mathbf{X}, \mathbf{y}, \beta) = \sum_{i=1}^{n} \max [0, 1 - y_i f(x_i)] \]
Zero for points “well-classified” (on correct side of margin). Increases linearly for points inside margin or misclassified. Only support vectors (points on or inside the margin) contribute to the loss and thus influence the fit.
Logistic Loss (Logistic Regression):
The loss is never exactly zero but is very small for points far from the boundary. All points influence the fit, but those far from the boundary have much less effect.
Figure 9.12 (not shown here) visually compares these loss functions as a function of yi(β0+β1xi1+⋯+βpxip).
For hinge loss, the loss is exactly zero for values greater than 1. For logistic loss, the loss is always positive, but rapidly decreases for larger values.
Model Fitting and the Role of Support Vectors
In SVMs, only the support vectors (points with nonzero hinge loss, i.e., those on or inside the margin) influence the classifier’s coefficients. In logistic regression, all points influence the coefficients, though those far from the decision boundary have less effect.
Bias-Variance Tradeoff, Model Selection, and Flexibility
The tuning parameter λ (or equivalently C) is crucial:
It controls how flexible or rigid the boundary is. Too much flexibility (small λ, large coefficients) leads to overfitting (low bias, high variance). Too little flexibility (large λ, small coefficients) leads to underfitting (high bias, low variance).
The right balance is often found using cross-validation or other model selection techniques.
SVMs, Kernels, and Nonlinear Boundaries
SVMs are not unique in using kernels to handle complex, nonlinear boundaries in the data. You can use kernels with logistic regression or other methods, though historically kernels are more common in SVMs. The kernel trick allows you to work in very high- or infinite-dimensional feature spaces without explicitly computing those features.
Extension to Regression: Support Vector Regression
SVMs can also be adapted for regression problems (predicting a continuous value, not a class). Support Vector Regression (SVR):
Instead of minimizing squared error, SVR minimizes a loss that only penalizes residuals outside a specified “margin” (called the ϵ-insensitive loss). This means only points with large prediction errors influence the coefficients, analogous to support vectors in classification.
Summary of Key Mathematical Ideas (no skipping):
The SVM objective is a combination of hinge loss (which enforces a margin) and an L2 penalty (which controls model complexity), just like regularized regression methods. Logistic regression uses a similar “loss + penalty” form but with a different loss function. Both SVM and logistic regression can be interpreted as fitting a linear or nonlinear boundary, with regularization controlling bias-variance tradeoff. The choice of tuning parameters (like λ or C) is critical for model performance. Loss functions determine which training points have the most influence: only support vectors for SVM, all points (with diminishing influence) for logistic regression. The kernel trick can be applied to SVMs and other models to handle non-linear relationships.
5.2.11 Deep Learning
Deep learning is a rapidly growing area in machine learning and AI, with neural networks as its foundation. Neural networks gained popularity in the late 1980s but later lost favor to other methods like SVMs and random forests, partly because neural networks required more manual tuning and were often outperformed. However, after 2010, neural networks returned as “deep learning,” achieving notable successes in areas like image and speech recognition due to advances in architectures and the availability of large datasets.
The chapter will cover the basics of neural networks and deep learning, including specialized types such as convolutional neural networks (CNNs) for images and recurrent neural networks (RNNs) for sequences. Practical demonstrations will use the R package keras, which connects to the tensorflow software from Google. The material in this chapter is considered somewhat more challenging than the rest of the book.
5.2.11.1 Single Layer Neural Network
Input, Structure, and Output A neural network processes an input vector of \(p\) variables:
\[ X = (X_1, X_2, \ldots, X_p) \]
and constructs a nonlinear prediction function: \[ f(X) \]
\[ f(X) \]
to estimate or predict a response variable \(Y\)
What’s unique about a neural network? While earlier models (trees, boosting, generalized additive models) also produce nonlinear predictions, a neural network is characterized by its specific architecture. The structure consists of layers of units: input, hidden, and output.
General Model The neural network with a single hidden layer is described by:
\[ f(X) = \beta_0 + \sum_{k=1}^{K} \beta_k h_k(X) \]
where \(h_k(X)\) is the output or activation from the \(k\)th hidden unit.
But \(h_k(X)\) itself is a nonlinear transformation of a linear combination of inputs:
\[ h_k(X) = g\left(w_{k0} + \sum_{j=1}^{p} w_{kj} X_j\right) \]
where:
- \(w_{k0}\) is a bias (intercept) term for unit k,
- \(w_{kj}\) is the weight from input j to hidden unit k,
- \(g(z)\) is a nonlinear activation function (specified in advance).
So, plugging this in, the model becomes: \[ f(X) = \beta_0 + \sum_{k=1}^{K} \beta_k \, g\left(w_{k0} + \sum_{j=1}^{p} w_{kj} X_j\right) \]
Step-by-Step Construction
- Linear combination: For each hidden unit k, compute the weighted sum of the inputs plus a bias:
\[ z_k = w_{k0} + \sum_{j=1}^{p} w_{kj} X_j \]
- Activation: Apply the activation function: \[ A_k = h_k(X) = g(z_k) \]
- Output layer: Output \(f(X)\) is a linear combination of the activations: \[ f(X) = \beta_0 + \sum_{k=1}^{K} \beta_k A_k \]
The activation function \(g(z)\) introduces nonlinearity. Common choices: Sigmoid (Logistic) Activation
\[ g(z) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}} \]
- Smooth, S-shaped curve.
- Output between 0 and 1.
- Historically favored, especially for probability outputs.
ReLU (Rectified Linear Unit)
\[ g(z) = (z)_{+} = \begin{cases} 0 & \text{if } z < 0 \\ z & \text{otherwise} \end{cases} \]
Outputs zero for negative z, and linear for positive z. Popular for modern deep neural networks due to efficiency and ease of optimization.
The Role of Nonlinearity
If g(z) is linear (e.g., g(z)=z), the entire network is a linear model of the inputs, no matter how many hidden units are used. With nonlinear g(z):
The model can capture complex patterns, interactions, and nonlinearities. The hidden units act as learned “basis functions” (see generalized additive models).
Explicit Example: Nonlinear Interaction Suppose you have \(p=2\) inputs, \(X_1, X_2\), and \(K=2\) hidden units with \(g(z) = z^2\) (purely for illustration): Let’s denote weights:
\[ \begin{aligned} \beta_0 &= 0 \\ \beta_1 &= \frac{1}{4} \\ \beta_2 &= -\frac{1}{4} \\ w_{10} &= 0, \quad w_{11} = 1, \quad w_{12} = 1 \\ w_{20} &= 0, \quad w_{21} = 1, \quad w_{22} = -1 \end{aligned} \]
so:
\[ h_1(X) = (0 + X_1 + X_2)^2 = (X_1 + X_2)^2 \]
\[ h_2(X) = (0 + X_1 - X_2)^2 = (X_1 - X_2)^2 \]
plut into the output: \[ f(X) = \frac{1}{4}(X_1 + X_2)^2 - \frac{1}{4}(X_1 - X_2)^2 = X_1 X_2 \]
So, this network computes an interaction between features, a type of nonlinearity not directly available to linear models.
Fitting the Neural Network: Training
All parameters \(\beta_0, ..., \beta_K, w_{10}, ..., w_{Kp}\) are learned from data. For quantitative (regression) tasks, typically minimize the sum of squared errors:
\[ \sum_{i=1}^{n} (y_i - f(x_i))^2 \] For classification, other loss functions may be used (like cross-entropy).
Neural Network Terminology
Input layer: The original input features. Hidden layer: Units that compute nonlinear transformations (activations) of linear combinations of the inputs. Output layer: Produces the final output by combining the activations. Weights: Parameters connecting inputs to hidden units \(w_{kj}\), and hidden units to output \(\beta_k\).
Biological Analogy
Hidden units are like “neurons”: if their activation \(A_k\) is close to 1 (sigmoid), they “fire”; if close to 0, they are “silent”.
Application Example: Handwritten Digit Recognition
The MNIST dataset: Each image is a \(28 \times 28\) grid of grayscale pixels (\(p = 784\) input features). Each pixel value is an integer from 0 (white) to 255 (black). Neural networks excel at extracting complex patterns from such high-dimensional input data, learning multiple layers of nonlinear transformations to distinguish digits.
Why Nonlinear Activations Are Essential
Without nonlinearity: The output is just a linear function of the input, regardless of the number of hidden units. With nonlinearity: The network can approximate highly complex functions, capture interactions, and fit intricate decision boundaries.
5.2.11.2 Multilayer Neural networks
Why More Layers?
A single hidden layer with enough units can approximate any function, but using more layers makes learning complex functions much more efficient and practical. Modern neural networks (“deep learning”) often use many hidden layers, each with many units.
Example: Handwritten Digit Recognition (MNIST) Each image is \(28 \times 28 = 784\) pixels, so the input vector has 784 numbers (each pixel’s gray level). The output is a class label: which digit (0-9) the image represents. The output is represented using one-hot encoding: a vector of 10 entries, with a 1 in the position for the correct digit and 0 elsewhere. Example: Network Architecture
Input layer: 784 units (one per pixel). First hidden layer: 256 units. Each unit computes a nonlinear function of a weighted sum of all 784 inputs. Second hidden layer: 128 units. Each unit computes a nonlinear function of a weighted sum of all 256 outputs from the previous layer. Output layer: 10 units (one per digit).
Mathematical Representation
First hidden layer:
\[ A_k^{(1)} = g\left(w_{k0}^{(1)} + \sum_{j=1}^{p} w_{kj}^{(1)} X_j\right) \]
for \(k = 1, ..., 256\) Second hidden layer:
\[ A_\ell^{(2)} = g\left(w_{\ell 0}^{(2)} + \sum_{k=1}^{256} w_{\ell k}^{(2)} A_k^{(1)}\right) \]
for \(\ell = 1, ..., 128\)
Output layer: \[ Z_m = \beta_{m0} + \sum_{\ell=1}^{128} \beta_{m\ell} A_\ell^{(2)} \]
for \(m = 0, ..., 9\) (one per digit).
These Zm values are then passed through a softmax function to convert them to probabilities: \[ f_m(X) = \frac{e^{Z_m}}{\sum_{\ell=0}^9 e^{Z_{\ell}}} \]
\[ f_m(X) = \frac{e^{Z_m}}{\sum_{\ell=0}^9 e^{Z_{\ell}}} \]
This guarantees all outputs are between 0 and 1 and sum to 1, so they can be interpreted as class probabilities.
- Training and Loss for Classification
For classification, the loss function used is the cross-entropy (also called negative log-likelihood for multinomial logistic regression):
\[ -\sum_{i=1}^{n} \sum_{m=0}^{9} y_{im} \log(f_m(x_i)) \]
Here, \(y_{im}\) is 1 if observation \(i\) is of class m, 0 otherwise; \(f_m(x_i)\) is the predicted probability for class m on observation \(i\).
The goal is to choose weights and biases to minimize this loss over all the training data.
- Regularization: Preventing Overfitting Neural networks can have a huge number of parameters (weights and biases), sometimes more than the number of training observations. This flexibility can cause overfitting: the model fits the training data perfectly but performs poorly on new, unseen data. Two Types of Regularization
Ridge Regularization (L2 penalty):
Adds a penalty proportional to the sum of the squares of all the weights. Encourages the network to keep weights small, which helps generalize better to new data.
Dropout Regularization:
During training, randomly “drops out” (sets to zero) a random subset of units on each update. Forces the network to not rely too heavily on any single path through the network, improving robustness.
- Example Applications and Datasets MNIST
A classic dataset of handwritten digits (0–9), each image is 28×28 pixels. The input vector has 784 features (pixel values). The output is one of 10 classes, encoded as a one-hot vector.
CIFAR100
A dataset of colored images from everyday life, with 100 different classes (such as animals, vehicles, objects, etc.). Each class might have images of different sizes, content, etc. Neural networks are powerful enough to learn to distinguish many complex classes from raw pixel data.
- Model Complexity and Parameter Counting
Neural networks can be very large. For example, the example network described has over 235,000 parameters (weights and biases), which is far more than the number of parameters in a typical logistic regression model. If you have 60,000 training images and 235,000+ parameters, you must be careful to use regularization, or the network will “memorize” the training data instead of learning general patterns.
- Summary of Mathematical Steps
Forward pass: Input is passed through each layer, applying linear transformations and then nonlinear activation functions. Output: Final layer produces either a number (regression) or class probabilities (classification, via softmax). Loss calculation: Compute error using squared error (regression) or cross-entropy (classification). Training: Adjust weights and biases to minimize the loss, using optimization algorithms like gradient descent. Regularization: Add penalties or use dropout to prevent overfitting.
- Generalization to Any Problem
The described structure (input layer, several hidden layers, output layer; weights, biases, activations; nonlinear activation functions; cross-entropy or squared-error loss; regularization) can be adapted to any input/output problem. For images, the input is all pixel values; for tabular data, it’s numerical or categorical features; for text, it can be word or character encodings.
5.2.11.3 Convolution Neural Networks
What Are CNNs and Why Are They Important? Convolutional neural networks are a special kind of neural network that have achieved huge success in image recognition and classification, especially since around 2010. They have become the main tool for tasks that involve images, such as recognizing objects in photos, identifying faces, or even self-driving car vision systems. The Data: Images as Arrays
An image in a computer is made up of tiny dots called pixels. Each pixel has a value that represents its color and brightness. For color images, each pixel usually has three numbers corresponding to red, green, and blue components (these are called “channels”). For example, an image might be 32 pixels wide and 32 pixels tall. For a color image, every pixel has three numbers (for the three colors), so the image data forms a three-dimensional array:
The first two dimensions are the spatial dimensions (height and width: 32 x 32). The third dimension is the channel (color): 3 (one for each color).
This three-dimensional array is called a feature map in the context of neural networks.
Datasets
Large datasets like CIFAR100 contain tens of thousands of images, each labeled with a class (e.g., “dog,” “cat,” “car,” “tree,” etc.). The CIFAR100 dataset, for instance, has 60,000 images, organized into 100 different classes, grouped into 20 superclasses (like “aquatic mammals,” “flowers,” etc.). Usually, there is a split into a training set (used to teach the network) and a test set (used to see how well the network learned).
How Do CNNs Work? CNNs are designed to recognize patterns in images. They do this by mimicking the way humans look at images: by detecting small details (like edges or colors) and then combining these to identify larger patterns (like eyes, ears, or wheels), and finally recognizing whole objects (like a tiger, a car, or a dog). Step-by-Step Feature Detection
Low-level features: The network first looks for simple patterns—edges, corners, blobs of color. These are very basic building blocks. High-level features: It then combines these simple patterns to detect more complex shapes—like parts of objects (e.g., an eye, an ear, a wheel). Whole object detection: By combining the presence of these parts, the network can figure out what object is in the image.
Analogy Think of how a child learns to draw a face: first, they learn to draw circles and lines (low-level features), then they arrange them to make eyes, mouth, ears (higher-level features), and finally, the whole face (object recognition).
The Architecture: Convolution and Pooling Layers A CNN is built from two main types of layers: 1. Convolution Layers
These layers scan small regions of the image (often called “filters” or “kernels”) and look for patterns. Each filter is like a little template that slides over the image and checks how much the template matches the underlying pixels. The filter’s job is to detect a specific feature (like an edge or a spot of color) wherever it appears in the image.
Mathematical Operation (Generalized)
For each position in the image, the filter computes a weighted sum of the pixel values covered by the filter. If the pattern matches, the output is high; if not, it’s low. The same filter is used across the whole image, which allows the network to find the same feature no matter where it appears.
- Pooling Layers
These layers reduce the size of the image representation, while keeping the most important information. The most common pooling operation is “max pooling,” which splits the image into small regions and takes the maximum value in each region. Pooling makes the network more robust to small changes in the image (like slight shifts or distortions).
Building a Hierarchy of Features
By stacking many convolution and pooling layers, the network can build up a hierarchy:
Early layers find basic patterns. Middle layers combine these into more complex features. Final layers recognize entire objects.
The network can learn to recognize, for instance, a tiger by first detecting stripes, colors, shapes of eyes and ears, and then combining these clues to decide “this is a tiger.”
Example from the Text
The figure in the text shows how a cartoon image of a tiger might be recognized by a CNN:
The network finds small features (edges, spots, curves). It combines them to make compound features (eyes, ears). These compound features are used to decide that the image is a tiger.
Why Are CNNs Special?
Shared weights: Each convolution filter is used across the whole image, dramatically reducing the number of parameters compared to a fully-connected network. Translation invariance: Because the same filter is used everywhere, the network can recognize the same object even if it appears in a different part of the image. Efficiency: By only connecting each output to a small region of the input, CNNs are much more efficient for images than standard neural networks.
Generalization to Any Similar Problem
Although CNNs are best known for image recognition, the same principles can be applied to any data that has a spatial or sequential structure (for example, audio signals, time series, or even text). The key is the ability to detect local patterns and then build up more complex understanding through a hierarchy of layers.
Summary of Key Concepts and Mathematical Steps
Input: An image is represented as a 3D array (height x width x channels). Convolution layers: Slide filters over the image, detect local patterns. Pooling layers: Reduce the size of the representation, keep important features. Hierarchical learning: By stacking layers, learn increasingly complex features. Final classification: Use the learned features to assign the image to a class (like “tiger” or “car”). Training: Adjust the filters and other weights to improve accuracy on labeled images.
5.2.11.3.1 Convolution Layer
What Is a Convolution Layer? The Big Picture A convolution layer is a fundamental building block in convolutional neural networks (CNNs), which are used to process images and other data with a grid-like structure. The main role of a convolution layer is to look for patterns or features (like edges, lines, textures, or shapes) in small regions of an image, and to do this across the whole image. Convolution Filters: Pattern Detectors
The convolution layer is made up of many convolution filters (also called “kernels”). What is a filter? Think of a filter as a small square or rectangle of numbers. Each filter is like a tiny template that is designed to detect a specific pattern in part of the image.
For example, one filter might detect horizontal edges (like the boundary between a tiger’s stripe and its fur), another might detect vertical edges, and others might look for corners or spots.
How does a filter work? The filter is slid (or “scanned”) over the image, one small patch at a time.
In each position, the filter checks how much the local region of the image matches the pattern encoded in the filter.
The Convolution Operation: Multiply and Add
At each position, the filter and the corresponding patch of the image are multiplied element-wise (each number in the filter is multiplied by the corresponding number in the image patch). All these products are then added up to get a single number for that location. This process is called a convolution.
Example (Generalized): Imagine you have a small image patch and a filter, both as tiny grids of numbers. If the numbers in the image patch and the numbers in the filter “match up” (meaning the pattern is present), the result will be a large number (high score). If they don’t match up, the result will be smaller. This multiplication-and-adding is done for every position in the image, so that for every location, we get a score that tells us how much the pattern in the filter is present there. Output: The Convolved Image (Feature Map)
After the filter has scanned the entire image, we get a new “image” (called a feature map or convolved image) that shows where the filter’s pattern was found most strongly. If you have multiple filters, you get multiple feature maps, each highlighting a different kind of pattern in the original image.
Mathematical Example (Accessible Explanation) Suppose you have an image represented as a grid of numbers (imagine a 4x3 matrix for simplicity):
\[ \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \\ j & k & l \end{bmatrix} \]
And you have a filter that is a smaller grid, say 2x2: \[ \begin{bmatrix} \alpha & \beta \\ \gamma & \delta \end{bmatrix} \]
To apply the filter:
- Place the filter on the top-left of the image.
- Multiply corresponding numbers together and add them up:
- For the top-left 2x2 patch: \(a\alpha + b\beta + d\gamma + e\delta\)
- Move the filter one step to the right, and repeat for the next 2x2 patch: \(b\alpha + c\beta + e\gamma + f\delta\)
- Continue this process for every possible location where the filter fits in the image.
- The result is a new, smaller grid (the “convolved image”), where each value represents how much the filter’s pattern was found at that spot.
The size of the convolved image is smaller than the original, because the filter can only be centered on regions where it fully fits within the image.
Why Is This Useful?
Highlighting features: If a region of the image is similar to the filter’s pattern, the result at that location will be high. This means the filter has “found” its pattern there. Detecting local patterns everywhere: By scanning the filter across the whole image, the convolution layer can detect where patterns like edges, corners, or textures occur, no matter where they are in the image.
Multiple Filters, Multiple Channels
In practice, images have multiple color channels (red, green, blue), so each filter actually has weights for each channel. If you use K different filters, you get K output “feature maps”—think of these as new layers of information that capture different patterns.
Applying Nonlinearity
After the convolution, it’s common to apply a nonlinear activation function (like ReLU, which outputs the value itself if it’s positive or zero otherwise) to the result. This helps the network capture more complex patterns.
Summary (No Steps Skipped)
The convolution layer is a pattern detector, scanning for specific features in an image. Each filter is a small matrix of numbers, designed to find a particular local pattern. The filter is slid across the image, and at each location, the corresponding values are multiplied and summed. The output is a new image (feature map) showing where the pattern was detected. Many filters can be used at once, each producing its own feature map. This process is the foundation of how CNNs “see” and interpret image data.
5.2.11.3.2 Pooling layers
What Is a Pooling Layer? A pooling layer is a special type of layer used in convolutional neural networks (CNNs) to make the information from an image more manageable and to help the network focus on the most important features. Pooling layers take a large image or feature map and summarize it, creating a smaller version that keeps only the most significant information.
Why Do We Need Pooling?
Reduce size: Images and feature maps can be very large. By shrinking them, pooling makes computations easier and faster, and helps prevent the network from overfitting (memorizing details that are not important). Preserve important features: Pooling keeps the most important features while discarding less relevant or redundant information. Location invariance: Pooling helps the network recognize features even if they move a little in the image. For example, a cat’s eye might be in a slightly different spot in two pictures, but pooling helps the network still recognize it.
What Is Max Pooling? Max pooling is the most common kind of pooling. Here’s how it works:
Divide the image into blocks: The image (or feature map) is divided into non-overlapping small blocks, usually 2x2 squares. Find the maximum in each block: For each block, look at all the numbers (which might represent brightness, color, or the presence of a feature) and keep only the largest number. Create a new, smaller image: The collection of these maximum values forms a new, smaller image.
Step-by-Step Example Suppose you have this 4x4 grid of numbers (could be pixel values or feature activations):
\[ \begin{bmatrix} 1 & 2 & 5 & 3 \\ 3 & 0 & 1 & 2 \\ 2 & 1 & 3 & 4 \\ 1 & 1 & 2 & 0 \end{bmatrix} \]
Let’s apply max pooling with 2x2 blocks: Block 1 (top-left): \[ \begin{bmatrix} 1 & 2 \\ 3 & 0 \end{bmatrix} \]
Maximum is 3. Block 2 (top-right):
\[ \begin{bmatrix} 5 & 3 \\ 1 & 2 \end{bmatrix} \]
Maximum is 5. Block 3 (bottom-left):
\[ \begin{bmatrix} 2 & 1 \\ 1 & 1 \end{bmatrix} \]
Maximum is 2. Block 4 (bottom-right):
\[ \begin{bmatrix} 3 & 4 \\ 2 & 0 \end{bmatrix} \]
So, after max pooling, the output is a 2x2 grid: \[ \begin{bmatrix} 3 & 5 \\ 2 & 4 \end{bmatrix} \]
What Does This Achieve?
Compression: The original 4x4 grid becomes a 2x2 grid—a much smaller summary. Highlighting important features: If any value in a block is large (for example, if a certain edge or pattern is detected strongly in that area), that value will be preserved in the summary. Location invariance: It doesn’t matter exactly where the large value is within the block; as long as it’s there, it will be kept. So if a feature shifts a little within the block, it’s still recognized.
Other Types of Pooling While max pooling is most common, there are other pooling methods:
Average pooling: Takes the average of all the values in the block. Min pooling: Takes the minimum value in the block. But max pooling tends to work best for most image tasks, as it preserves the strongest responses.
Why Is Pooling Important in Neural Networks?
Faster computation and less memory: Smaller images/feature maps mean the next layers have less work to do. Helps generalization: By summarizing, pooling prevents overfitting to tiny details, helping the model recognize features even if they are not in exactly the same place in every image. Used after convolution: Typically, pooling is applied after convolution layers to further distill the detected features.
Summary (No Steps Skipped)
A pooling layer reduces the size of an image or feature map by summarizing local regions (blocks). Max pooling keeps the largest value in each block. This makes the data smaller, more manageable, and helps the network focus on the most important features, with some tolerance for small shifts or changes in position. The process is repeated for every block, creating a new, smaller summary image or feature map.
5.2.11.3.3 Architecture of a convolution neural network
What is a Convolutional Neural Network (CNN)? A convolutional neural network (CNN) is a type of machine learning model that is especially good at recognizing patterns in images. For example, it could look at a picture and recognize if it is a cat, a dog, or a car.
The Structure of a CNN 1. Input Layer
This is where the image goes into the network. Images are made up of pixels, and in color images, each pixel has three numbers (one for red, one for green, one for blue). For a standard color image, think of it as a 3D block: height × width × number of colors (channels).
For example, a 32 × 32 color image has shape 32 (height) × 32 (width) × 3 (channels).
- Convolution Layer
This is the main building block of a CNN. Imagine a small window (called a filter or kernel) sliding over the image and looking at a few pixels at a time. Each filter looks for a specific pattern, like edges or textures. The result of applying one filter is a new 2D image called a “feature map.” If you use several filters, you get several feature maps, stacked together as a 3D block.
Example:
If you have 6 filters and each filter scans a 32 × 32 image, you get 6 new 32 × 32 images (feature maps). The number of filters = number of output “channels.”
Analogy:
Filters in CNNs are like different colored glasses, each highlighting a different feature in the image.
- Pooling Layer
This layer reduces the size of the feature maps, making the network faster and helping it focus on the most important information. The most common pooling method is “max-pooling,” which keeps the largest number in each small region. For example, a 2 × 2 max-pool looks at every 2 × 2 block and only keeps the biggest number from each block. This reduces the height and width by half (so a 32 × 32 feature map becomes 16 × 16 after 2 × 2 pooling).
Why Pooling?
It makes the network smaller and more efficient. It helps the network be less sensitive to small changes in the image (like if an object moves a little).
- Repeating Layers
It’s common to repeat convolution and pooling layers several times. After each pooling, the image gets smaller, but we often use more filters to capture more complex patterns.
- Flattening
After several convolution and pooling layers, the 3D block of numbers (feature maps) is flattened into a long list. This step is like unrolling the block into a single row.
- Fully Connected Layer
This layer works like a traditional neural network. Each number from the flattened list connects to every “neuron” in this layer. The last fully connected layer outputs one number for each class (for example, 100 numbers if there are 100 categories to choose from).
- Output Layer and Softmax
The final layer uses a function called “softmax” to turn the output numbers into probabilities. For example, if there are 100 categories, the softmax function makes sure the output numbers add up to 1 and represent the likelihood of the image being in each category.
Formula for Softmax: If the last layer gives numbers \(z_1, z_2, ..., z_{100}\), the softmax probability for class \(j\) is:
\[ \text{softmax}(z_j) = \frac{e^{z_j}}{\sum_{k=1}^{100} e^{z_k}} \]
- \(e\) is a mathematical constant (about 2.718).
- This formula makes the biggest \(z_j\) turn into the highest probability.
Data Augmentation
Sometimes, to help the network learn better, we make small changes to the training images (like rotating, shifting, or adding blur). This creates new, slightly different images from the original ones but keeps the same label. It helps the network not get fooled by small changes and “regularizes” the learning, making it perform better on new images.
Common Scenario Generalized Imagine you want a computer to tell if a photo is of a dog or a cat. You use a CNN:
The image goes in. The network uses filters to scan for features (like fur, ears, noses). After several steps of reducing and combining features, the network makes a guess—dog or cat.
This is the same process, regardless of whether you are classifying animals, vehicles, or any other objects in pictures.
5.2.11.3.4 Data Augmentation
What is Data Augmentation?
Data augmentation is a powerful trick used when training computer models to recognize images. The idea is simple: you take each original training image and create many new, slightly altered versions of it. These changes are made in such a way that a human would still recognize the object in the image, but the computer now has more examples to learn from.
Why Do We Augment Data?
The goal is to increase the variety of images available for training, without needing to actually collect more real-world data. This helps the model learn to recognize objects even if they look a bit different, which is important because objects in real life can appear in many ways (different positions, lighting, or sizes).
How is it Done?
Distortions (changes) commonly used include:
Zoom: Making the object appear closer or farther. Shift: Moving the image left, right, up, or down. Shear: Skewing the image diagonally. Rotation: Rotating the image by small angles. Flip: Flipping the image horizontally (like seeing a mirror image).
For example, if you have a picture of a cat, you can create new training images by flipping it, zooming in, or rotating it slightly.
What’s the Benefit?
Prevents Overfitting: If your model only ever sees perfect, unaltered images, it may learn to memorize those exact examples and not generalize to new, unseen images. Augmentation helps the model learn the important features that define an object, not just memorize. Regularization: In machine learning, regularization is a strategy to prevent overfitting. Data augmentation acts as a kind of regularization by “fattening” the cloud of training data, meaning the model sees many variations of each example, making it more robust. No Need to Store All Images: Instead of saving all the new, altered images, we can create them “on the fly” (in real time) during training, which saves computer memory.
Analogy
Imagine teaching someone to recognize apples. If you only show them one photo of an apple, they might not recognize an apple from a different angle or with a bite taken out. But if you show them apples of all shapes, sizes, and colors, they’ll be able to recognize any apple in real life.
5.2.11.3.5 Results Using a Pretrained Classifier
What is a Pretrained Classifier?
Instead of training a model from scratch (which takes a lot of time and data), we can use a model that has already been trained on a huge collection of images, such as the “ImageNet” dataset. This model (like “resnet50”) has learned to recognize thousands of different objects. We can use this model to classify new images, even ones it’s never seen before.
How Does Classification Work?
When you give the model a new image, it tries to guess what’s in the image by assigning probabilities to different possible categories. For example, for an image of a flamingo, the model might say:
Flamingo: 83% Spoonbill: 17% White stork: 0%
These numbers mean the model is most confident the image is a flamingo, but there’s a chance it could be a spoonbill.
Example Table Explanation The following table is an example of how a classifier might output its guesses:
TSV_TABLE{“value”:“Image Label”} {“value”:“Model’s Top Guess (probability)”} {“value”:“Second Guess (probability)”} {“value”:“Third Guess (probability)”} {“value”:“flamingo”} {“value”:“flamingo (0.83)”} {“value”:“spoonbill (0.17)”} {“value”:“white stork (0.00)”} {“value”:“hawk”} {“value”:“kite (0.60)”} {“value”:“great grey owl (0.09)”} {“value”:“robin (0.06)”} {“value”:“hawk (zoomed out)”} {“value”:“fountain (0.35)”} {“value”:“nail (0.12)”} {“value”:“hook (0.07)”} {“value”:“dog”} {“value”:“Tibetan terrier (0.56)”} {“value”:“Lhasa (0.32)”} {“value”:“cocker spaniel (0.03)”} {“value”:“cat”} {“value”:“Old English sheepdog (0.82)”} {“value”:“Shih-Tzu (0.04)”} {“value”:“Persian cat (0.04)”} {“value”:“Cape weaver”} {“value”:“jacamar (0.28)”} {“value”:“macaw (0.12)”} {“value”:“robin (0.12)”} END_TSV_TABLE
The model gives its best guesses for each image. Sometimes it gets confused, especially if the image is unusual (like zoomed out or a rare viewpoint).
Why Use Pretrained Models?
Efficiency: They save a huge amount of time and resources, since training from scratch on millions of images is expensive. Transfer Learning: You can use the knowledge from a pretrained model for new tasks. For example, you can use the early layers (which have learned basic shapes and textures) and only retrain the final layers for your specific problem. This process is called weight freezing: you “freeze” (don’t change) the early layer weights, and only update the last layers during training on your smaller dataset.
Analogy
Think of a pretrained model like a student who has already studied a lot of subjects. If you want them to specialize in one new subject, you don’t need to teach them everything from the beginning—just the extra details for that subject.
Step-by-Step General Scenario (No Specific Values Needed)
Start with a set of labeled images (cats, dogs, birds, etc.). Apply data augmentation to multiply your training images by creating altered versions (rotations, flips, zooms, shifts). Train a CNN (or use a pretrained one) to learn to recognize patterns in these images. For new images, the model predicts the probability for each possible label (for example, “dog: 70%, cat: 25%, rabbit: 5%”). If using a pretrained model, you can “freeze” most of the layers (don’t update them) and only train the last few layers on your own data. The output can be used for many purposes (classifying photos, detecting objects, etc.).
Key Terms and Concepts Explained
Data Augmentation: Creating new training examples by altering existing ones. Overfitting: When a model memorizes the training data instead of learning to generalize. Regularization: Techniques to prevent overfitting. Data augmentation is one type. Pretrained Classifier: A model trained on a huge, diverse dataset, ready to be reused for new tasks. Weight Freezing: Keeping the early layers of a model unchanged and only training the last layers on new data. Probability Output: The model’s confidence in each possible label for a given image.
5.2.11.4 Document Classification
What Is Document Classification? Document classification is a process where we train a computer to look at a piece of text (like a review, email, tweet, or news article) and decide what kind of document it is or what it’s about. For example, the computer could figure out whether a review is positive or negative (sentiment analysis), or what topic a news article is about. Example Scenario (Generalized) Imagine you have lots of customer reviews for a product. You want a program to automatically decide if each review is positive (“I love this!”) or negative (“I hate this!”), instead of reading them all yourself.
The Challenge With Text
Text is messy: People use different words, slang, make spelling mistakes, or write reviews of different lengths. We need to turn text into numbers: Computers work with numbers, so we need a way to convert words into a format a computer can understand and use to make predictions.
Featurization: Turning Text into Numbers Featurize means to create a set of “features” (predictors) from the text. Think of features as ingredients in a recipe: each ingredient has a quantity, and the combination determines the final dish (or, for us, the classification). The Bag-of-Words Model This is the most basic and common way to featurize text:
Dictionary: Make a list of the most common words in your collection of documents (like the top 10,000 words). Feature Vector: For each document, make a list that is as long as the number of words in the dictionary.
Each position in the list represents a word. If the word is present in the document, put a 1 in that position; if it’s not, put a 0.
Binary Vector: Because we only care about presence/absence (1 or 0), this is called a binary feature vector.
Example (No Real Values) Suppose your dictionary is [“dog”, “cat”, “happy”, “sad”]. For the review “I am happy with my dog”, your feature vector would look like:
“dog”: 1 (present) “cat”: 0 (not present) “happy”: 1 (present) “sad”: 0 (not present)
So, your document becomes [1, 0, 1, 0].
Dealing With Document Length
Reviews can be short or long. By using the bag-of-words, every document gets a vector of the same length (equal to the dictionary size), regardless of how many words are in the review. Sometimes, very rare words or words not in the dictionary are marked as “unknown” (UNK).
The Feature Matrix (Big Table of Numbers)
If you have 25,000 reviews and use the top 10,000 words in your dictionary, you get a big table (called a “matrix”) with 25,000 rows (one for each review) and 10,000 columns (one for each word). Each entry is 1 if the word is in that review, or 0 if it isn’t. Sparse Matrix: Most entries are 0, because most reviews use only a small number of words from the dictionary. This makes the matrix “sparse,” and there are special ways to store it efficiently.
What Models Are Used? Two main types of models are mentioned:
Lasso Logistic Regression
A mathematical model for predicting if something is in one category or another (for example, positive or negative). Uses a technique called “regularization” to avoid overfitting (memorizing the training data rather than learning general rules). In the plot, the x-axis shows \(-\log(\lambda)\), which is related to how much regularization is used. A higher \(\lambda\) means more regularization. Neural Network (Two Hidden Layers)
A model inspired by the brain, with layers of “neurons” that learn to recognize patterns in the data. The x-axis in the neural network plot shows “epochs,” which are the number of times the model has seen all the training data.
Both types of models try to find patterns between the presence or absence of words and the sentiment (positive or negative) of the review.
Accuracy and Overfitting
Accuracy: The fraction of documents the model correctly classifies. Overfitting: When a model does very well on the training data but not on new, unseen data. Both models in the example achieve about 88% accuracy on test data, even though their training accuracy (how well they fit known data) can be higher.
Mathematical Formulas Explained Logistic Regression for Two Classes The formula given is:
\[ \log \left(\frac{\operatorname{Pr}(Y=1 \mid X)}{\operatorname{Pr}(Y=0 \mid X)}\right) = Z_{1} - Z_{0} \]
- This formula calculates the “log-odds” of the document being in class 1 (for example, positive) versus class 0 (negative).
- \(\operatorname{Pr}(Y=1 \mid X)\) means the probability that the sentiment is positive, given the features \(X\) (the presence/absence of words).
- \(Z_1\) and \(Z_0\) are scores the model calculates for each class.
- The bigger \(Z_1\) compared to \(Z_0\), the more likely the document is positive.
Expanded, it becomes: \[ = (\beta_{10} - \beta_{00}) + \sum_{\ell=1}^{K_2} (\beta_{1\ell} - \beta_{0\ell}) A_{\ell}^{(2)} \] Here, \(\beta_{10}\) and \(\beta_{00}\) are coefficients (numbers the model learns). The sum adds up contributions from each “hidden” feature the model finds.
Limitations of the Bag-of-Words Model
Ignores context: It only cares whether a word is present, not what words are next to it.
Bag-of-n-grams: To add some context, we can count pairs (2-grams), triples (3-grams), etc., of words that appear together.
Example: “not good” (negative) is different from “good” (positive).
Sequence models: More advanced models look at the order of all words, not just which words are present.
TSV_TABLE{“value”:“Concept”} {“value”:“Simple Explanation”} {“value”:“Why It Matters”} {“value”:“Bag-of-Words”} {“value”:“Turn each document into a list of 1s and 0s for each word”} {“value”:“Lets the computer handle text as numbers”} {“value”:“Sparse Matrix”} {“value”:“Most entries are 0, only a few are 1”} {“value”:“Efficient to store and process”} {“value”:“Lasso Logistic Regression”} {“value”:“A model that predicts categories and avoids overfitting”} {“value”:“Finds the most important words”} {“value”:“Neural Network”} {“value”:“Model inspired by the brain, with layers that find patterns”} {“value”:“Can learn complex relationships”} {“value”:“n-grams”} {“value”:“Pairs or groups of words that appear together”} {“value”:“Adds some context to the analysis”} END_TSV_TABLE
General Scenario (No Specific Values)
Gather a set of documents (like reviews), each labeled by their type or sentiment. Build a dictionary of common words (or word pairs). For each document, record whether each word (or word pair) in the dictionary is present. Organize the data in a big table (matrix), where each row is a document, and each column is a word. Train a model (such as logistic regression or a neural network) on this data. Use the model to predict the label of new documents, based on which words they contain.
5.2.11.4.1 Reccurent Neural networks
What Are Recurrent Neural Networks (RNNs)? Key Idea
Many types of data are sequential: the order of the data matters. RNNs are a type of neural network designed specifically to work with sequential data.
Why Does Sequence Matter? Think of things like:
Sentences or reviews: The meaning depends on the order of words. Weather data: Today’s temperature depends on previous days. Stock prices: Tomorrow’s price is related to past prices. Speech/music/handwriting: Each sound or letter depends on what came before.
Examples of Sequential Data
Text: Reviews, articles, tweets—RNNs can help figure out the sentiment or meaning by looking at the order of words. Weather/time series: Predicting future weather using past measurements. Finance: Predicting future stock prices from past prices. Audio: Recognizing speech or music by analyzing sequences of sounds. Handwriting: Turning handwritten digits or words into digital text (optical character recognition).
How Does an RNN Work? The Structure
The input is a sequence: \(X = \{X_1, X_2, ..., X_L\}\) - Each \(X_\ell\) is a part of the sequence (e.g., a word, a time step). The RNN processes each item in the sequence one at a time, in order. At each step, it keeps track of what it has seen so far (its “memory”).
Schematic
At each time step \(\ell\), the RNN:
Takes the current input \(X_\ell\). Combines it with its previous “memory” or activation \(A_{\ell-1}\). Produces a new activation \(A_\ell\). Optionally, outputs a prediction \(O_\ell\).
Usually, only the final output \(O_L\) is used (e.g., overall sentiment for a review). Diagram Explanation
The left side of the diagram is a compact way to show the entire network. The right side “unrolls” the network to show what happens at each step.
Mathematical Details Representing the Sequence
Each input \(X_\ell\) can be a vector, such as a “one-hot” encoding (a way to represent words using 0s and a single 1). The sequence is processed one step at a time: \(X_1\), then \(X_2\), etc.
The Hidden Layer (Activations) At each time step \(\ell\), the RNN computes a new activation value for each hidden unit in the network: \[ A_{\ell k} = g \left( w_{k0} + \sum_{j=1}^{p} w_{kj} X_{\ell j} + \sum_{s=1}^{K} u_{ks} A_{\ell-1,s} \right) \] Let’s break this down:
- \(k\) indexes the hidden units (there are \(K\) of them).
- \(w_{kj}\): weights connecting the input \(X_{\ell}\) (with \(p\) features) to the hidden unit.
- \(u_{ks}\): weights connecting the previous hidden activations to the current ones.
- \(g(\cdot)\): activation function, like ReLU, which adds non-linearity (makes the network more powerful).
- \(w_{k0}\): bias term (extra number added in).
- \(X_{\ell j}\): the jth component of the input at step \(\ell\). \(A_{\ell-1, s}\): the sth hidden activation at the previous step.
Big picture: Each hidden activation at step \(\ell\) combines information from the current input and the hidden activations from the previous step. The Output At each step, you can compute an output: \[ O_{\ell} = \beta_0 + \sum_{k=1}^{K} \beta_k A_{\ell k} \]
- \(\beta_0\): output bias term.
- \(\beta_k\): weights for each hidden activation.
Usually, only the final output \(O_L\) (after the last element in the sequence) is used for prediction. Weight Sharing
The same weights (\(\mathbf{W}, \mathbf{U}, \mathbf{B}\)) are used at every step. This is called weight sharing. This allows the RNN to process sequences of any length, reusing what it has learned at each step.
Training the RNN
The goal is to adjust the weights (\(\mathbf{W}, \mathbf{U}, \mathbf{B}\)) to minimize the difference between the predictions and the actual values. For regression (predicting numbers), the loss for a single example is: \[ (Y - O_L)^2 \]
where \(Y\) is the true value, and \(O_L\) is the prediction at the end of the sequence.
For a dataset with n examples, the total loss is:
\[ \sum_{i=1}^{n} (y_i - o_{iL})^2 \]
where \(y_i\) is the true value for the ith sequence, and \(o_{iL}\) is its predicted value. Why Use RNNs?
RNNs remember what has happened before in a sequence, which helps them make better predictions about what comes next or the overall meaning. They are suited to tasks where order and context are crucial (unlike models that treat each data point independently).
When Are Intermediate Outputs Used?
Sometimes, you want an output at every step (for example, translating each word in a sentence to another language). Other times, only the final output is needed (like predicting if a review is positive or negative after reading the whole thing).
Generalized Scenario
You have a dataset where the order of items matters (e.g., words in a review, daily temperatures, notes in music). You represent each item in the sequence as a vector (for example, a “one-hot” vector for a word). An RNN processes each item in order, updating its “memory” at each step. At the end, the RNN uses all the information it has seen to make a prediction (or, if needed, makes predictions at each step). The RNN is trained by adjusting its weights to make its predictions as close as possible to the correct answers, over many examples.
5.2.11.4.1.1 Sequential Models for Document Classification
Sequential Models for Document Classification: The Big Idea We want a computer to read a document (like a movie review) and predict something about it (like whether it’s positive or negative). The earlier approach, the bag-of-words, ignored word order. Now, we want to use the actual sequence of words as they appear to make our prediction.
The Dimensionality Challenge: Why Not Just Use One-Hot Vectors? One-hot encoding is a way to represent each word as a long list (vector) of 0s and a single 1. For example, if your dictionary contains 10,000 words, the word “cat” might be represented by a vector where the 4,512th entry is 1, and the rest are all 0. Every word gets its own spot in this long list. Problem: This is very inefficient.
Each word is a huge list of mostly zeros. With a big dictionary, it takes a lot of memory and computation.
Solution: Word Embeddings Instead of one-hot vectors, we use embeddings to represent each word as a much shorter list of numbers (for example, a list of 50 or 100 numbers instead of 10,000). These numbers capture information about the meaning and usage of the word. Embedding Space
Every word is mapped to a location in a lower-dimensional space (embedding space). Similar words (like synonyms) end up close to each other in this space. Each word’s embedding is a vector of real numbers (not just 0s and 1s), and none are typically zero.
How Embeddings Are Used
We construct a big matrix \(\mathbf{E}\), where each column represents one word in the dictionary, and each row represents one of the embedding dimensions.
The shape of \(\mathbf{E}\): \(m \times D\) (where m is the number of embedding dimensions, and \(D\) is the dictionary size). To get the embedding for a word, you look up its column in \(\mathbf{E}\).
Where Do Embeddings Come From? There are two main ways:
Learned During Training: The neural network learns the best embeddings for your specific task. Pretrained Embeddings: You use a set of embeddings that were learned from a huge amount of text before, such as word2vec or GloVe. These embeddings can be kept fixed (weight freezing) or further adjusted during your training.
Practical Representation
Now, every document is represented as a sequence of embedding vectors (one for each word in the document). If documents are of different lengths:
We pick a maximum length \(L\). Documents shorter than \(L\) get padded (filled up) at the beginning with zeros, so all documents have the same length.
Feeding Into an RNN
Each document becomes a sequence of vectors: \(X = \{X_1, X_2, ..., X_L\}\), with each \(X_\ell\) an embedding vector. The RNN reads these vectors one at a time, from the first word to the last.
The RNN Model Structure
The RNN keeps a “hidden state” or memory as it moves through the sequence. At each word, it updates its memory based on the current word and what it remembers from the previous words. After reading the whole document, the RNN uses its final memory to make a prediction (like the sentiment).
Parameters
Embedding matrix E: Each word’s embedding. Weight matrices \(\mathbf{W}, \mathbf{U}, \mathbf{B}\): Control how the RNN processes the sequence.
- \(\mathbf{W}\): Connects input to hidden layer.
- \(\mathbf{U}\): Connects previous hidden state to current hidden state.
- \(\mathbf{B}\): Output layer weights.
The number of parameters depends on the embedding size, the number of hidden units, and the dictionary size.
Model Training and Performance
The model is trained to minimize the difference between its predictions and the true labels (using a loss function like squared error or cross-entropy). More advanced RNNs, like LSTM (Long Short-Term Memory) networks, solve problems with remembering information from earlier in long sequences. LSTM keeps two types of memory: short-term and long-term, allowing the model to remember important information for longer.
Results and Improvements
Simple RNNs can be outperformed by more advanced models like LSTM. LSTM can improve accuracy, but they are slower to train. The best models combine good embeddings, enough hidden units, and proper training.
Generalized Example (No Specific Values) Suppose you want to classify any sequence where order matters (not just text—could be DNA, actions, etc.).
Build a dictionary of all possible items (words, actions, etc.). Assign each item an embedding (a short vector of numbers). For each input sequence, turn it into a list of embedding vectors. Feed the sequence, one embedding at a time, into an RNN. The RNN updates its memory each step, then uses its final state to make a prediction. Train the RNN to minimize error on many labeled examples.
5.2.11.4.1.2 Time Series Forecasting
What is Time Series Forecasting? Time series refers to any data collected over time—such as daily stock trading volumes, temperatures over months, or website visits by the hour. The key thing is that the order of the data matters: what happens today depends on what happened yesterday (and maybe days before that).
Types of Data and Variables In the example, three types of daily statistics are tracked:
Trading volume: How much is traded each day, relative to a recent average. Return: How much a value (like a stock index) changes from one day to the next. Volatility: How much prices move around, measured day by day.
Each day is represented by a group of these measurements.
- Why is This Different from Other Problems? Most data problems assume that each example is independent. In time series, this is not true:
Autocorrelation: Values close in time often look similar (for example, trading volume today probably looks a lot like yesterday). Lag: The effect of a past value on the current value, measured by how many days apart they are.
This means that yesterday’s (or previous days’) values are useful for predicting today’s.
- How Do We Build the Input for a Model? To predict the value at a certain time (say, today), we use:
The values from the last several days (how many is a parameter you choose, called the lag). All the types of measurements for each of those days.
Each prediction is based on a mini-sequence: a short stretch of recent days’ data, not the entire series at once. Generalized Example: Say you want to predict tomorrow’s temperature. You might use measurements from the last week (temperature, humidity, wind) as your input.
- How Does the Model Learn? For every day (after the first few used for the lag), you create a pair:
Input: The lagged window of past days’ data. Target: The value you want to predict (for example, today’s trading volume).
You repeat this for each day in your dataset, creating many input/target pairs for training.
- RNN Forecaster An RNN (Recurrent Neural Network) is a model that is designed to learn from sequences—making it ideal for time series.
It processes each day’s data in order, keeping a kind of “memory” of what it has seen. It can use not only the main variable you care about (like trading volume), but also other variables (like return and volatility) from the past.
Structure:
Input: A sequence of vectors (one for each day in the lag window, each containing all the variables for that day). Output: The predicted value for the next day.
How Do We Evaluate the Model? A common way to measure how well the model predicts is the \(R^2\) (R-squared) statistic, which measures how much of the variation in the data the model can explain. Higher is better, with 1 being perfect.
Comparison With Other Models
- Autoregression (AR)
A traditional, simpler method is autoregression: predicting today’s value as a weighted sum of previous days’ values. The formula for an order-\(L\) AR model (using the last L days) is: \[ \hat{v}_t = \hat{\beta}_0 + \hat{\beta}_1 v_{t-1} + \hat{\beta}_2 v_{t-2} + \cdots + \hat{\beta}_L v_{t-L} \]
- \(\hat{\beta}_0\) is like a baseline, and each \(\hat{\beta}_i\) tells how much the value from i days ago matters. You can also add in lagged versions of other variables (like return or volatility) as extra predictors.
- Feedforward Neural Network
Another model is a regular neural network that just takes all the lagged values as a single input vector (“flattening” the sequence), rather than stepping through them in order. This model can capture more complex (nonlinear) relationships.
- How Do These Models Compare?
All three types—AR, RNN, and feedforward neural network—can perform pretty well on time series forecasting. The RNN and feedforward neural net can sometimes capture more complex patterns than AR, especially if the relationships are nonlinear. However, the difference in performance might not be huge for some problems, depending on the data.
- Improving the Models Adding Day-of-Week Effects
Trading patterns often depend on the day of the week (e.g., more trades on Mondays or Fridays). You can add a day_of_week variable to the model.
This is done by one-hot encoding: creating a set of binary (0 or 1) variables, one for each possible day.
This extra information helps all models do better, because it gives them a regular pattern to learn from.
Using LSTM (Long Short-Term Memory)
LSTM is a special type of RNN that remembers information for longer periods. Using LSTM can sometimes lead to small improvements, especially when the effects of past days last a long time.
- Practical Considerations
Picking how many past days (lag length) to use is important. Too few may miss important history; too many may add noise. All these models need to be carefully trained and evaluated, often using separate data for testing (not used in training) to see if they generalize well.
5.2.11.4.1.3 Summary of RNNs
Only the Beginning
The examples shown (like document classification and time series) are just the start. RNNs are powerful tools for handling any data where order matters.
There Are Many Advanced Variations of RNNs 1. Convolutional Networks for Sequences
Sometimes, instead of using an RNN, we can use a convolutional neural network (CNN) designed for sequences. In this approach, the sequence (like a row of word embeddings) is treated like a one-dimensional image. A convolution filter (a small sliding window) moves along the sequence, looking for important patterns—like phrases or groups of words—regardless of where they appear. This helps the model learn to recognize specific combinations of words or short sequences that are important for the task.
Analogy:
It’s like scanning a sentence for a particular phrase or structure, no matter where it shows up.
- Stacking Hidden Layers
The simple RNN described earlier had just one hidden layer (one memory being updated at each step). Stacked RNNs: You can add more hidden layers, creating a deeper network.
The output of one layer (the sequence of activations, \(A_\ell\)) becomes the input to the next hidden layer.
This allows the model to learn more complex and abstract patterns in the sequence. Analogy:
Think of each layer as a different level of understanding, where each layer builds on what the previous one learned.
- Bidirectional RNNs
In the standard RNN, the sequence is read from start to finish, using only the past to predict the future. Bidirectional RNNs read the sequence both forwards and backwards.
They can use information from both earlier and later in the sequence to make better predictions.
Example:
Understanding the meaning of a word in a sentence often depends on both the words before and after it.
- Sequence-to-Sequence (Seq2Seq) Models
Sometimes, you want your model to output a sequence, not just a single value.
Example: Translating a sentence from one language to another.
Seq2Seq models use two RNNs:
Encoder: Reads the input sequence and summarizes it into a “hidden state”. Decoder: Uses this hidden state to generate the output sequence (word by word).
The hidden units in these models are thought to capture the meaning of the whole input sequence, so they can generate meaningful outputs in a different order or language.
Why These Variations Matter
These enhancements let RNNs tackle more complicated problems, like:
Understanding the meaning of a document. Translating languages. Summarizing text. Generating music or captions.
Training and Using RNNs Is Hard — But Software Helps
Training RNNs (finding the best settings for all their internal numbers) involves lots of calculations and can be slow even for powerful computers. Luckily, modern software (like TensorFlow, PyTorch, Keras) handles most of the complexity for you. This makes it possible for many people to use these powerful models without being experts in the math or programming behind them.
Everyday Use: Real-World Applications
Many tools you use daily, like Google Translate, use advanced versions of RNNs (and their descendants). These models are built and trained by teams of experts, using huge amounts of data and computational power. As a result, you get accurate translations and other language tools, even though the underlying models are incredibly complex.
Generalized Scenario Imagine you want to solve any problem where the input is a sequence (words, actions, measurements) and the output could be a single result (like sentiment) or another sequence (like a translated sentence). By choosing and combining different RNN variations, you can build a model that:
Learns from the order and context of items. Understands meaning across the whole sequence. Can generate outputs that are also sequences, not just single labels.
5.2.11.5 When to Use Deep Learning
Deep Learning: Where It Excels Deep learning (using neural networks with many layers) has been very successful, especially in:
Image recognition (like identifying objects in photos, scanning X-rays, or reading handwriting). Speech and language tasks (like translating text, generating speech, or summarizing documents). Time series forecasting (predicting things that change over time, like weather or stock prices).
These successes are often reported in the news and scientific journals.
The Big Question: Should We Always Use Deep Learning? With all the hype, it’s tempting to think we should use deep learning for every data problem. But the text argues that we shouldn’t discard older, simpler tools just because deep learning is popular. To explore this, the authors revisit a classic regression problem: predicting the salary of a baseball player based on their past year’s performance.
The Example: Salary Prediction Here’s the setup, generalized:
Goal: Predict a target value (like salary) using information from the past (performance stats, age, etc.). Data: Each data point (player) has multiple features (variables). Split: The data is split into a training set (to build the model) and a test set (to see how well the model works on new data).
Three Types of Models Compared 1. Linear Regression
What is it? A simple model that finds the best straight-line relationship between the input variables and the target. How many “knobs” (parameters) does it have? One for each variable plus one for the overall level (the intercept). How does it work? Adds up the variables, each multiplied by a learned weight.
- Lasso Regression
What is it? A linear regression with a twist: it tries to set as many weights as possible to zero, keeping only the most important variables (feature selection). Why? To avoid overfitting and make the model simpler and easier to interpret. How is it tuned? Uses cross-validation: tries different levels of “push towards zero” and picks the best one on unseen data.
- Neural Network
What is it? A much more flexible model with many “hidden” units (neurons) and nonlinear transformations (like ReLU, which outputs zero for negative numbers and the input itself for positive numbers). How many parameters? Can have dozens, hundreds, or thousands of parameters, depending on its size. How does it work? Each input is sent through layers of calculations, allowing the model to learn complex patterns.
TSV_TABLE{“value”:“Model”} {“value”:“# Parameters”} {“value”:“Mean Absolute Error”} {“value”:“Test Set \(R^2\)”} {“value”:“Linear Regression”} {“value”:“few”} {“value”:“low”} {“value”:“high”} {“value”:“Lasso”} {“value”:“fewer”} {“value”:“lowest”} {“value”:“still high”} {“value”:“Neural Network”} {“value”:“many”} {“value”:“slightly higher”} {“value”:“still high”} END_TSV_TABLE
All three models gave similar performance on the test data. The simpler models were much easier and faster to fit. The neural network needed much more tuning and trial-and-error to get reasonable results.
Interpretation and Simplicity
Linear models are easy to understand: you can look at the coefficients and see how much each variable contributes to the prediction. Lasso regression helps even more by dropping unimportant variables (making the model “sparse”). Neural networks are often called a “black box” because it’s hard to see what’s going on inside—hard to interpret which inputs matter most.
Occam’s Razor Principle Occam’s razor says that when you have several ways to do something equally well, you should pick the simplest.
Simpler models are easier to explain, less likely to break if the data changes, and faster to use.
Selection Bias Warning
If you choose your best model by checking which one worked best on your training data, you might fool yourself. Always check the model’s performance on new (test) data that wasn’t used for model selection.
When Deep Learning is Really Useful
Large Datasets: Deep learning shines when you have huge amounts of data (thousands or millions of examples). Complex Relationships: When the relationship between input and output is very complicated or nonlinear. Interpretability Not Needed: If you don’t care about being able to explain exactly how the model works.
Practical Recommendations
Try the simple models first. If they work just as well, use them—they’re easier to explain, faster, and less likely to break. Use deep learning when:
You have lots of data. The problem is very complex (like image or speech recognition). You don’t need to explain every prediction.
Always compare performance and complexity. Only use more complex models if they really give better results.
TSV_TABLE{“value”:“Concept”} {“value”:“Explanation”} {“value”:“Why It Matters”} {“value”:“Linear regression”} {“value”:“Fits a straight-line relationship between variables and target”} {“value”:“Simple, interpretable, quick to fit”} {“value”:“Lasso”} {“value”:“Linear regression with variable selection”} {“value”:“Simplifies the model, avoids overfitting”} {“value”:“Neural Network”} {“value”:“Flexible, can capture complex relationships”} {“value”:“Needs more data, harder to interpret, more complex”} {“value”:“Occam’s razor”} {“value”:“Prefer the simplest model that works”} {“value”:“Reduces risk of overfitting, easier to explain”} {“value”:“Selection bias”} {“value”:“Picking a model based on the same data used for fitting”} {“value”:“Can give overly optimistic results”} END_TSV_TABLE
Generalized Scenario Whenever you need to predict an outcome (like price, score, or risk) from a set of features:
Start simple: Try linear or lasso models. Compare performance: If a simple model works well, use it. Try more complex models if needed: If the simple models aren’t good enough and you have lots of data, try a neural network or other advanced tool. Balance accuracy, simplicity, and interpretability: Pick the model that gives the best tradeoff for your needs.
5.2.11.6 Fitting a Neural Network
What Does “Fitting a Neural Network” Mean?
Neural network: A model inspired by the brain, made up of layers of “neurons” (units) connected by “weights” (parameters). Fitting: Adjusting the parameters (weights and biases) so the neural network does a good job predicting the correct output (target) for each input in your data.
The Structure of a Simple Neural Network Suppose you have:
Inputs: Several features for each data point (could be measurements, scores, etc.). Hidden layer: A group of units, each doing a calculation based on the inputs. Output: The final prediction from the network.
The Parameters Weights (\(w_{kj}\)): Numbers that control how much each input influences each unit in the hidden layer. Biases (\(w_{k0}\), \(\beta_0\)): Extra numbers added to each unit’s calculation, giving the model more flexibility. Output weights (\(\beta_k\)): Each hidden unit’s output is multiplied by a weight before being added up for the final prediction.
The Prediction Formula The neural network makes a prediction for each input \(x_i\) using: \[ f(x_i) = \beta_0 + \sum_{k=1}^K \beta_k \, g\left(w_{k0} + \sum_{j=1}^p w_{kj} x_{ij}\right) \]
Let’s break this down:
- \(x_{ij}\): The \(j\)-th feature of the \(i\)-th data point.
- \(p\): The number of input features.
- \(K\): The number of hidden units.
- \(g(\cdot)\): An activation function (like ReLU or sigmoid), which adds nonlinearity.
- \(\sum_{j=1}^p w_{kj} x_{ij}\): Weighted sum of inputs for the k-th hidden unit.
- \(w_{k0}\): Bias for the \(k\)-th hidden unit.
- \(\beta_k\): How much the k-th hidden unit’s output contributes to the final prediction.
- \(\beta_0\): Final bias added to the prediction.
General scenario: Think of each hidden unit as a “mini-expert” looking at all the inputs and producing a signal. The network combines all these mini-experts’ signals to make its final prediction.
The Training Objective: Minimizing Error We want our predictions to be as close as possible to the real values for all data points. We measure this with the mean squared error (MSE):
\[ \frac{1}{2} \sum_{i=1}^{n}(y_i - f(x_i))^2 \]
- \(y_i\): The true answer for data point i.
- \(f(x_i)\): The neural network’s prediction for i.
- \(n\): Number of data points.
The goal is to find the values of all the weights and biases that make this error as small as possible.
Why Is Fitting Hard?
The error function can have many hills and valleys (it is “non-convex”), so there can be many possible “best” (minimum) points.
Local minimum: A point that is lower than all nearby points, but not necessarily the lowest possible. Global minimum: The lowest point overall.
Neural networks have many parameters, making the landscape even more complicated. The solution you find depends on where you start (your initial guess for the parameters).
How Is Fitting Done? Gradient Descent The Gradient Descent Algorithm
Start with an initial guess for all weights and biases (collectively called θ0). Calculate the error (how far off the predictions are from the actual answers). Change the parameters a tiny bit in the direction that makes the error go down (this is “going downhill”). Repeat steps 2–3 until you can’t reduce the error any more or you detect overfitting.
In math:
\[ R(\theta) = \frac{1}{2} \sum_{i=1}^{n} (y_i - f_\theta(x_i))^2 \]
At each step, update θ to a new value that lowers \(R(\theta)\).
Visual analogy: Imagine you’re on a hilly landscape (the error surface). Gradient descent is like walking downhill, always taking the steepest step down, until you reach a valley (minimum error).
Challenges: Local vs. Global Minimum
Sometimes you might get stuck in a “local minimum” (a valley that’s not the lowest possible). If you’re lucky (or have a good starting point), you reach the “global minimum” (lowest valley).
Strategies to Make Fitting Work Better
Slow Learning:
Take small steps (small changes to parameter values) to avoid skipping over the lowest points. Stop the process before the model starts fitting the noise in the data (overfitting).
Regularization:
Add a penalty to the error for having large weights or too many nonzero weights. Two common types:
Lasso regularization: Pushes some weights to zero (simplifies the model). Ridge regularization: Penalizes large weights (keeps them small).
Step-by-Step: Gradient Descent for Neural Networks
Initialize parameters: Start with random values for all weights and biases (\(\theta^0\)), set \(t = 0\). Iterate:
Compute the error. Calculate in which direction a small change will reduce the error the most (\(\delta\)). Update parameters:\(\theta^{t+1} = \theta^t + \delta\). Repeat until the error doesn’t go down anymore.
Illustration (from Figure 10.17)
The error function \(R(\theta)\) can have multiple dips. Gradient descent will take you from your starting point to the nearest dip. Sometimes this is the global minimum (the lowest dip), sometimes it’s just a local minimum (not as low as possible).
5.2.11.6.1 Backpropagation
The Big Picture: Why Backpropagation? When training a neural network, our goal is to find the set of parameters (weights and biases, collectively called \(\theta\)) that make the model’s predictions as close as possible to the real answers for all examples in the training data. We measure “closeness” using a function called the objective or loss (for example, mean squared error, \(R(\theta)\)). The lower this number, the better our model is doing. But how do we actually adjust the parameters to make this number smaller?
The Gradient: Our Direction Finder
The gradient is a mathematical tool that tells us, at any set of parameter values, which direction will make the objective (error) rise the fastest. Since we want to reduce the error, we go in the opposite direction of the gradient.
Mathematically:
\[ \nabla R(\theta^m) = \left. \frac{\partial R(\theta)}{\partial \theta} \right|_{\theta = \theta^m} \]
- \(\nabla R(\theta^m)\): The gradient of \(R\) at the current guess for parameters \(\theta^m\).
- This is a vector: for each parameter in \(\theta\), the gradient tells you how much increasing or decreasing that parameter will change the loss.
Updating the Parameters: Gradient Descent To improve the model, we move from our current parameters \(\theta^m\) to new parameters \(\theta^{m+1}\) by taking a small step in the direction that most reduces the error: \[ \theta^{m+1} \leftarrow \theta^{m} - \rho \nabla R(\theta^m) \]
- \(\rho\) (rho): The learning rate—a small number that controls how big a step we take. If the learning rate is too big, we might “overshoot” the minimum; if it’s too small, learning will take a long time.
Why Does This Work?
If you imagine the error function as a landscape of hills and valleys, the gradient always points uphill. By moving in the opposite direction, you go downhill, reducing the error. If the gradient is zero (flat), you are at a minimum—hopefully the lowest (global minimum), but possibly just a local dip.
Calculating the Gradient: Chain Rule and Backpropagation The Chain Rule
The objective function depends on the parameters in a complicated, layered way because of how neural networks are built. To compute how changing any parameter affects the output, we use the chain rule from calculus, which breaks down the effect into steps, layer by layer.
Example: Calculating Derivatives Suppose we have:
\[ R_i(\theta) = \frac{1}{2}\left(y_i - \beta_0 - \sum_{k=1}^K \beta_k g(w_{k0} + \sum_{j=1}^p w_{kj} x_{ij})\right)^2 \]
We want to know:
How does the loss change if we adjust \(\beta_k\) (the output weight)? How does the loss change if we adjust \(w_{kj}\) (a hidden layer weight)?
- With respect to \(\beta_k\): \[ \frac{\partial R_i(\theta)}{\partial \beta_k} = -(y_i - f_\theta(x_i)) \cdot g(z_{ik}) \]
- \(y_i - f_\theta(x_i)\): The residual (difference between true value and predicted value).
- \(g(z_{ik})\): The output of the k-th hidden unit for input \(i\).
- With respect to \(w_{kj}\): \[ \frac{\partial R_i(\theta)}{\partial w_{kj}} = -(y_i - f_\theta(x_i)) \cdot \beta_k \cdot g'(z_{ik}) \cdot x_{ij} \]
-\(g'(z_{ik})\) : The derivative of the activation function (how sensitive it is at that point). - \(x_{ij}\): The j-th input for the i-th example.
These formulas show how the “blame” for an error gets distributed backwards through the network, from the output layer to each hidden unit, and then to each input connection.
The Backpropagation Algorithm: What’s Really Happening
Forward pass: Compute the predictions for all examples in the data using current parameter values. Compute error: Measure how far off each prediction is (the residual). Backward pass (Backpropagation):
Use the chain rule to pass the residuals backward through the network. Calculate, for each parameter, how much changing it would have reduced the error. This tells you the gradient for each parameter.
Update parameters: Take a small step for each parameter in the direction that most reduces the error.
Why Is This Called Backpropagation?
“Backpropagation” means propagating (passing) the error backward through the network, so every parameter gets its share of the blame (and can adjust itself to reduce future errors).
Why Is This Efficient?
Even for large networks, the chain rule lets us efficiently and systematically compute all the gradients with a small number of calculations—much faster than if we tried to do each one from scratch.
Generalized Scenario Suppose you have any prediction problem (could be images, text, numbers, etc.):
Build a neural network model. For each batch of training data, use backpropagation to figure out how to adjust each parameter to reduce the errors. Repeat the process, gradually improving performance.
5.2.11.6.2 Regulaization and Stochastic Gradient Descent
Gradient Descent in Practice
Gradient Descent is a process where the neural network slowly adjusts its parameters (weights and biases) to make better predictions. It does this by repeatedly looking at how much its predictions differ from the correct answers (the error or loss), then nudging the parameters to reduce this error.
But in real life, especially with large datasets and big neural networks, some challenges arise:
- Minibatches and Stochastic Gradient Descent (SGD)
When you have a huge amount of data, it’s inefficient to look at every single example for every tiny adjustment to the parameters. Instead, you can randomly pick a small group of examples (called a minibatch) and use just these to estimate how to update the parameters. Doing this repeatedly, with new random minibatches each time, is called Stochastic Gradient Descent (SGD).
“Stochastic” means “random”—the updates are based on random samples. This is now the standard approach for training deep neural networks.
Why use minibatches?
Faster updates: You don’t have to wait to process all your data before making each tiny improvement. Memory efficient: You only need to load a small part of the data at any time. Adds a bit of randomness: This can help the model avoid getting stuck in a bad local minimum (a suboptimal solution).
General scenario: Imagine you have thousands of photos for training an image classifier. Instead of adjusting your model after looking at every photo, you look at a handful (say, 128) at a time, learn from them, then move on to the next handful.
- Regularization: Preventing Overfitting
Overfitting is when a model gets too good at matching the training data, but performs poorly on new, unseen data. Regularization adds rules or penalties to the learning process so the model doesn’t become too complex or memorize the data.
Types of Regularization a. Ridge Regularization (L2)
Adds a penalty term to the loss function: \(\lambda \sum_j \theta_j^2\) Encourages the weights to stay small. \(\lambda\) is a tuning parameter that controls how strong the penalty is.
- Lasso Regularization (L1)
Adds a penalty based on the sum of the absolute values of the weights. Tends to push some weights all the way to zero, making the model simpler and easier to interpret.
- Dropout
Randomly “drops out” (temporarily removes) some neurons during each training step. Forces the network to not rely too heavily on any single connection, making it more robust.
See Figure 10.19: On the left, every neuron is always connected. On the right, some neurons (grey) are ignored in each training round, chosen at random. d. Early Stopping
While training, watch how well the model does on a separate “validation” set (data not used for training). If the validation error starts to increase (even as the training error keeps going down), stop training! This means the model is starting to overfit.
- Training in Epochs
Epoch: One full pass over the entire training dataset. In practice, you do many epochs, checking after each one how well the model is doing on the validation data.
Example: If you have 48,000 training examples and use minibatches of 128, you’ll do about 375 minibatch updates in one epoch.
- Putting It All Together: The Loss Function with Regularization When training with regularization, the objective (loss) function might look like:
\[ R(\theta; \lambda) = \text{Loss from predictions} + \lambda \times \text{Penalty on weights} \]
For example, in classification: \[ R(\theta ; \lambda) = -\sum_{i=1}^{n} \sum_{m=0}^{9} y_{im} \log(f_m(x_i)) + \lambda \sum_j \theta_j^2 \]
The first part measures how well the model predicts the correct class (using log-likelihood). The second part (multiplied by \(\lambda\)) discourages large weights.
- Why These Tricks Matter
Minibatch SGD: Makes learning scalable and efficient for massive datasets. Regularization (ridge, lasso, dropout, early stopping): Keeps models from becoming overly complex, improving their ability to generalize to new data. Validation set: Used to decide when to stop training and how to set parameters like \(\lambda\), without “cheating” by looking at the test set.
Generalized Scenario If you’re training a neural network for any problem (speech recognition, image classification, forecasting, etc.):
Use minibatches to update parameters quickly and efficiently. Apply regularization techniques (ridge, lasso, dropout, early stopping) to keep the model from overfitting. Monitor performance on a validation set to decide when to stop and how to tune parameters. Repeat over multiple epochs to improve the model.
5.2.11.6.3 Dropout Learning
What is Dropout? Dropout is a technique used in training neural networks to help the model generalize better and avoid “overfitting.” Overfitting happens when a model becomes too specialized to the training data and fails to perform well on new, unseen data.
How Does Dropout Work?
When training a neural network, each layer has several “units” (neurons). In dropout, for each training example (each time you feed data into the network while learning), you randomly turn off (drop out) a certain fraction (ϕ) of these units. “Turning off” a unit means it does not participate in that round of learning: it acts as if it is not there at all for that training example.
Key Points:
Random selection: The units to drop out are picked at random, and this is done differently for every example during training. Fraction dropped: If, for example, ϕ=0.5, then half the units in the layer are dropped out at each step.
Why Do We Use Dropout?
Prevents over-specialization: If the same units always work together, they can form “shortcuts”—relying on each other too much and not learning to work alone. Forces robustness: Because a unit can’t count on others always being there, it has to learn features that are useful on their own. Regularization: Like other forms of regularization (such as ridge or lasso), dropout helps the network avoid fitting the training data too perfectly, improving its ability to generalize.
What Happens to the Remaining Units?
The units that are not dropped out are still active and contribute to the network’s output. To make up for the missing units, the outputs of the remaining units are scaled up by a factor of \(\frac{1}{1-\phi}\). For example, if half the units are dropped out (\(\phi = 0.5\)), the surviving units are each doubled in strength for that training step.
How is Dropout Implemented?
In practice, dropout is done by setting the outputs (activations) of the dropped units to zero during the forward pass. This is done only during training. When making predictions on new data (testing), all units are used, but their outputs are scaled down to account for the training procedure.
Visual Example (Generalized) Imagine a neural network as a team of people working together on a task:
Normally, everyone is always present. With dropout, for each new project (training example), you randomly tell some people to stay home. The people who show up have to work harder (their output is scaled up), and over time, everyone learns to be flexible and not rely too much on any single team member.
Generalized Scenario If you are training a neural network (for images, text, forecasting, etc.):
For every training example, randomly pick some neurons to ignore (drop out). The remaining neurons work harder to “cover” for the missing ones (by scaling up their outputs). This random dropping out happens every time you show the network a new training example. When you use the model for real predictions, all neurons are used, but their outputs are scaled down appropriately.
5.2.11.6.4 Network Tuning
What is Network Tuning? Network tuning refers to the process of adjusting various parts of a neural network and its training procedure to get the best possible results. This is also called “hyperparameter tuning,” because most of these settings are not learned by the model itself—they are chosen by the person training the model.
Key Choices in Network Tuning 1. Number of Hidden Layers and Units per Layer
Hidden layers: These are layers between the input and the output. More layers can allow the network to learn more complex patterns. Units per layer: Each layer contains several units (neurons). More units allow the network to capture more information at each layer.
Modern practice: It’s now common to use many units per layer, sometimes even hundreds or thousands, because regularization techniques (like dropout or L2/L1 penalties) help prevent the network from becoming too complex and overfitting.
- Regularization Tuning Parameters
Dropout rate (\(\phi\)): What fraction of units to randomly drop out during training. Common values are between 0.2 and 0.5, but it depends on the problem. Lasso (λ, L1 regularization): Controls how strongly to push weights towards exactly zero, which can make the model simpler. Ridge (λ, L2 regularization): Controls how strongly to keep weights small, which helps prevent the model from becoming overly flexible.
These values can be set differently for each layer to tailor the regularization to different parts of the network.
- Details of Stochastic Gradient Descent (SGD)
Batch size: How many training examples you use to compute each update to the network. Small batch sizes make updates noisier but can help the model escape bad local minima. Number of epochs: How many times you go through the entire training set. More epochs give the model more opportunities to learn, but too many can lead to overfitting. Data augmentation: Techniques to artificially expand the training data by creating slightly altered versions (for example, rotating images, adding noise, etc.). This can help the model generalize better.
Why Do These Choices Matter?
Each of these settings can have a big impact on how well the model learns and how well it performs on new, unseen data. There’s no one-size-fits-all answer—what works best depends on the data and the problem.
The Process of Tuning
Set initial values for all parameters (number of layers, units, regularization, etc.). Train the network and evaluate its performance on a validation set (data not used for training). Adjust the parameters (for example, try different dropout rates or batch sizes) and retrain. Repeat until you find a combination that gives good results without overfitting.
This process often involves a lot of trial and error.
What Happens If You Tune Carelessly?
If you make the network too big and don’t use enough regularization, it will overfit—memorize the training data but fail on new data. If you make it too small, it won’t have enough capacity to learn the patterns in the data. If you tune too much based on the validation set, you can even start to overfit to that set!
Example (Generalized Scenario) Imagine you’re building a neural network to classify images or predict sales:
You can choose to have 2 or 3 hidden layers, with dozens or hundreds of units per layer. You set a dropout rate (how many neurons to ignore during each training step). You pick a batch size (how many images/sales to use per update). You train for a certain number of epochs (passes through the data). You try different values for these choices, watching to see which combination gives the best performance on unseen data. If you get great results on training but poor on new data, you increase regularization or decrease network size. If your results are poor everywhere, you try making the network bigger, training longer, or tweaking the learning rate.
Tuning a neural network is a careful balancing act:
You want it big and powerful enough to learn patterns, but not so big it memorizes everything. You use regularization, dropout, and careful training practices to keep it under control. You experiment, measure, and adjust, always checking performance on data the model hasn’t seen before.
5.2.11.7 Interpolation and Double Descent
- Bias-Variance Trade-off and the Classic “U-shape” What is the bias-variance trade-off?
In machine learning, the bias-variance trade-off is a fundamental concept:
Bias is the error from making strong assumptions in your model (e.g., assuming a straight line when the true relationship is curved). Variance is the error from a model being too sensitive to small fluctuations in the training data (i.e., it “wiggles” too much).
The best models are usually in the middle: not too simple (high bias, low variance) and not too complex (low bias, high variance).
Visualizing the trade-off:
Imagine plotting model flexibility (how complex the model is) on the x-axis, and error on new data (test error) on the y-axis. Test error typically forms a U-shape:
Too simple: error is high (left side—high bias). Just right: error is low (bottom of the U). Too complex: error goes up again (right side—high variance).
Training error (error on the data used to fit the model) always goes down as flexibility increases—it never goes up.
- Interpolating the Training Data
Interpolation means creating a model that fits the training data perfectly—training error is exactly zero. Traditionally, this was considered bad, because the model could be just memorizing the training data, leading to very high test error (bad predictions for new data).
- The Double Descent Phenomenon What is double descent?
Recent research found that in certain cases, making your model even more flexible (even past the point where it can perfectly fit the training data) can sometimes make the test error go down again after a brief spike. This is called double descent:
The test error curve goes down (left side of the U), then up (the classic rise in the U-shape), then down again (the “second descent”) as the model gets even more complex.
Why is it called “double descent”?
Because the test error first descends, then ascends, then descends again, instead of following a single U-shape.
- Illustration with a Generalized Example Suppose you want to predict a value Y based on some input X. Here’s a generalized version of the scenario: You collect some data: pairs of inputs \(X\) and outputs \(Y\). You fit a model to this data using a method called a spline, which lets you adjust how “wiggly” the curve is (the number of wiggles or “degrees of freedom”).
Degrees of freedom (\(d\)): How many “knobs” the model has to fit the data—higher d means more wiggles.
Low degrees of freedom (small \(d\)): The fitted curve is smooth and may miss some details in the data. Degrees of freedom equals the number of data points (\(d = n\)): The curve fits every data point exactly—zero training error (interpolates the data). Degrees of freedom greater than number of data points (\(d > n\)): There are many ways to fit the data exactly, and the chosen solution is the “smoothest” one according to a mathematical rule (minimum-norm).
- What Happens to Training and Test Error?
Training error: Drops to zero when the model becomes flexible enough to fit all data points (interpolation). Test error:
Follows a U-shape for low to moderate flexibility. Spikes up when interpolation is first reached—model becomes extremely “wiggly” and unstable. Drops again when flexibility is increased even further and a “smooth” solution is chosen from among the many possible fits.
Visual summary from the figures:
With moderate flexibility, the model is not too wiggly and generalizes well. At the interpolation threshold (model is just flexible enough to memorize the data), the model becomes unstable and test error spikes. With even more flexibility, the model can interpolate smoothly, and test error can go down again.
- Why Does This Happen?
At the interpolation threshold (model just fits the data), there’s often only one way to fit all points, and it’s usually very wild and unstable. With even more flexibility, there are many ways to fit the data, so you can pick the smoothest one, which tends to generalize better.
- Application to Neural Networks and Deep Learning
This “double descent” is not just for splines—it can happen in neural networks too. When neural networks have a huge number of parameters (many layers, many units), they can fit the training data exactly (zero error). Surprisingly, in some cases, these highly flexible networks still perform very well on new data (test set)—especially in “high signal-to-noise” problems (the pattern in the data is strong and clear).
- Key Takeaways and Warnings
Double descent does not break the classic bias-variance trade-off; it just shows that the relationship between flexibility and error can be more complicated than a simple U-shape. Most regularized methods do not show double descent: Regularization (like ridge or lasso) prevents the model from interpolating the data, so you don’t see this phenomenon. Zero training error can sometimes be good (as with “support vector machines” or well-regularized neural networks), but it depends on the details—especially the signal-to-noise ratio. In practice: It is usually safer and more reliable to use regularization, early stopping, and other techniques to avoid overfitting, rather than relying on double descent. The shape of the error curve can depend on how you measure “flexibility”. Depending on your choice of x-axis (e.g., number of parameters, basis functions, etc.), the error curve may not always look like a U or double descent.
- Final Summary
The classic wisdom is to avoid models that fit the training data perfectly, because they tend to perform poorly on new data (overfitting). Double descent shows there are special cases (especially with neural networks and splines) where making the model even more flexible can make test error go down again, provided the model picks a smooth solution among many possible fits. Regularization, early stopping, and smoothness constraints are better-understood and safer ways to get good test error, and are widely used in practice. The bias-variance trade-off is still important and should always be kept in mind. Double descent is a fascinating phenomenon, but not a rule to replace the basics of model selection and regularization.