Digital Mathematical Notebook
Statistical Learning and Large Data
This page follows the 172-slide course deck in its original order: clustering, principal components, supervised classification, nonparametric smoothing, cross-validation, resampling, large-feature linear models, penalization, supervised dimension reduction, and ultrahigh-dimensional screening.
A single review marker remains in the late feature-screening block, where the final covariate-information notation is harder to read cleanly from the slides than the surrounding ideas.
The first half is about learning tasks and computational assessment: unsupervised structure, dimension reduction, classification, smoothing, and resampling. The second half shifts to what changes when the feature space itself becomes large, unstable, or ultra-high dimensional.
Cluster analysis, PCA, and an MDS addendum organize the opening lectures.
Classification, smoothing, cross-validation, bootstrap, and permutation logic form the middle.
Ridge, LASSO, subset selection, SDR, SIS, and ISIS address large or ultra-large feature spaces.
Source map and reconstruction notes
The notebook follows the slide blocks in sequence: Cluster Analysis; Principal Components Analysis with an MDS addendum; Supervised Classification; Non-parametric Regression and Density Estimation; Cross-Validation; Resampling; Linear Models in Large Feature Spaces; Penalizations via Ridge and LASSO; Traditional Feature Selection; Supervised Dimension Reduction; and Feature Screening for ultra-large spaces.
Textbook references that appear directly on the slides are kept in the reading section at the end. Where a slide is mainly visual, the page recreates the mathematical point in new SVG figures or explorable widgets rather than embedding the slide image.
Cluster analysis
The deck begins with unsupervised classification: there are no labels to predict, only a cloud of points in feature space and the question of whether a useful partition or hierarchy emerges from dissimilarity.
Hierarchical clustering does not return one partition and one value of \(K\). It returns a nested sequence of merges summarized by a dendrogram, and the cut height determines the partition read off from that tree.
K-means fixes a target number of clusters \(K\) in advance and optimizes a partition directly: $$\sum_{k=1}^{K}\sum_{i:\,C(i)=k}\lVert x_i-\mu_k\rVert^2.$$ It is fast and practical, but the final solution depends strongly on initialization and only reaches a local minimum.
The slides separate three decisions that are often blurred together: choosing a point-to-point distance, choosing a linkage rule between clusters, and choosing how many groups to retain. Hierarchical clustering uses the first two to build a nested family of partitions. K-means skips the linkage step and alternates assignment and centroid updates instead.
The silhouette diagnostic appears as the main quality score for a given partition: $$d_{i,C}=\frac{1}{\#(C)}\sum_{\ell\in C}d(x_i,x_\ell), \qquad a_i=d_{i,C(i)}, \qquad b_i=\min_{C\neq C(i)} d_{i,C},$$ $$\operatorname{sil}_i=\frac{b_i-a_i}{\max\{a_i,b_i\}}, \qquad \operatorname{Sil}(K)=\frac{1}{N}\sum_{i=1}^{N}\operatorname{sil}_i.$$
High silhouette values mean the partition is coherent, but the slides explicitly warn that the criterion is not monotone in \(K\). Later pages add other ways to choose the number of clusters: dissimilarity curves, stability or reproducibility measures, and BIC when a mixture model is available.
| Method | Core idea | What the slides stress |
|---|---|---|
| Hierarchical | Merge clusters step by step using a linkage rule. | Produces a dendrogram and a nested family of partitions, not one final answer. |
| K-means | Alternate centroid updates with reassignment. | Optimizes within-cluster sum of squares, depends on initialization, and needs \(K\). |
| Variants | PAM, self-organizing maps, fuzzy \(k\)-means, Gaussian mixtures. | Useful when prototypes, soft partitions, or model-based interpretations matter. |
Principal components analysis and an MDS addendum
PCA enters as a general-purpose reduction technique: capture the main directions of variation, build a low-dimensional representation, and create composite predictors without using a response.
The slides treat location as irrelevant and center each feature first. Writing the centered sample variance-covariance matrix as \(S\) (the deck suppresses the usual \(1/(n-1)\) factor), PCA uses the eigen-decomposition $$S=\sum_{m=1}^{p}\lambda_m\phi_m\phi_m^{\top}, \qquad \lambda_1\ge \cdots \ge \lambda_p\ge 0.$$
The loadings are the coordinates of each eigenvector \(\phi_m\) in the original basis. The scores are the values of the new composite feature on each observation: $$z_{im}=\phi_m^{\top}x_i.$$
The geometric point is simple and powerful: for any dimension \(m\), the first \(m\) principal directions define the \(m\)-dimensional linear subspace that lies closest to the data cloud in a least-squares sense. That is why the slides move naturally from eigenvalues to scree plots, biplots, and cumulative explained variance.
Percentage of variance explained appears explicitly as $$\operatorname{PVE}_m=\frac{\lambda_m}{\sum_{k=1}^{p}\lambda_k}, \qquad \operatorname{CPVE}_m=\sum_{j=1}^{m}\operatorname{PVE}_j.$$
A short addendum then introduces multidimensional scaling. In the Euclidean case, the deck notes that MDS reduces to another eigen-decomposition problem, so it sits naturally next to PCA as a geometry-first view of dimension reduction before the supervised material begins.
Supervised classification
The page now moves from unlabeled structure to labeled prediction. Here the slides are careful about model choice, thresholding, and the gap between training error and test error.
Observe pairs \((x_i,y_i)\) with a categorical response \(Y\in\{1,\dots,K\}\). The goal is prediction: build a rule \(x\mapsto \widehat y(x)\) with low misclassification on new data, not merely low error on the training sample used to fit the classifier.
Four model families anchor the lecture block: logistic regression, LDA, QDA, and \(K\)-nearest neighbors. Logistic regression models posterior class probabilities directly in the binary case. LDA and QDA instead model the class-conditional feature distributions \(f_k(x)\), then approximate the Bayes classifier through discriminant functions. KNN drops the parametric model and classifies by local majority vote.
Binary logistic regression: $$\log\frac{p(x)}{1-p(x)}=\beta_0+\beta^{\top}x.$$ LDA assumes $$X\mid Y=k\sim N_p(\mu_k,\Sigma),$$ leading to the familiar linear discriminant score $$\delta_k(x)=x^{\top}\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^{\top}\Sigma^{-1}\mu_k+\log\pi_k.$$ QDA keeps the Gaussian assumption but allows a separate covariance \(\Sigma_k\) for each class.
The classification block spends real time on confusion matrices, ROC curves, sensitivity, specificity, and predictive values. That matters especially when classes are unbalanced: low overall error can hide poor detection of the minority class, so threshold choice and class-specific rates become essential.
| Procedure | Bias / flexibility | What the deck emphasizes |
|---|---|---|
| Logistic regression | Linear decision surface in feature space. | Probability modeling and threshold tuning. |
| LDA | Lower variance under a shared-covariance Gaussian model. | Can outperform QDA when the true boundary is close to linear or data are limited. |
| QDA | More flexible, but with higher variance. | Useful when covariance structure differs substantially across classes. |
| KNN | Nonparametric and local. | Works through neighborhoods rather than a fitted parametric discriminant. |
Nonparametric regression, smoothing, and kernel density estimation
Once the response is continuous, the slides pause the parametric model-building game and ask what happens if the data themselves suggest the systematic shape.
LOWESS fits local weighted least-squares models around each target point. The main tuning choice is the fraction \(q\in(0,1)\) of the sample used in each neighborhood, together with the local polynomial degree and the weighting function.
The slides explicitly list the drawback: a smoother does not hand back one explicit global equation for the regression function. It is excellent for visualization, diagnosis, and local signal extraction, but less convenient when a compact parametric law is the end goal.
LOWESS appears twice in the deck: first as a direct regression tool, then as a diagnostic companion to parametric regression, where plotting residuals against predictors or fitted values can reveal mean patterns that a postulated model missed. The same lecture then turns to kernel density estimation, where the tuning question becomes bandwidth selection rather than local span.
Kernel density estimation is written in the usual form $$\widehat f_h(x)=\frac{1}{nh}\sum_{i=1}^{n}K\!\left(\frac{x-x_i}{h}\right),$$ and the slides use the bandwidth \(h\) to illustrate the bias-variance dilemma: large \(h\) reduces variance but oversmooths the structure, while small \(h\) preserves detail at the cost of instability.
Cross-validation
The cross-validation lecture is framed as a computational replacement for analytic formulas when the question is out-of-sample predictive accuracy.
The slides make a clean distinction between in-sample error and out-of-sample error. For flexible procedures, the training sample can be fit extremely closely while true predictive performance remains much worse. Cross-validation estimates the out-of-sample criterion directly by repeatedly withholding part of the data during fitting.
For regression, the in-sample criterion is $$\operatorname{MSE}_{\mathrm{in}}=\frac{1}{n}\sum_{i=1}^{n}(y_i-\widehat y_i)^2,$$ while \(k\)-fold cross-validation estimates out-of-sample error by averaging foldwise test errors: $$\operatorname{CV}_k=\frac{1}{k}\sum_{j=1}^{k}\operatorname{MSE}_j.$$
The same cross-validation curve can be used in two different ways. First, to estimate predictive accuracy for a fixed method. Second, to choose a tuning parameter or model size by minimizing the estimated out-of-sample error. The slides keep those roles conceptually separate even though the same computation supports both.
As \(k\) increases toward leave-one-out, the training folds become larger and the overestimation bias of out-of-sample error drops, but the variance of the estimate increases because the folds overlap much more. The deck presents the familiar practical compromise: values around five or ten often balance cost, variance, and bias well.
Bootstrap, permutation methods, and resampling logic
The resampling lecture shifts from predictive assessment to uncertainty assessment: not “how well will this predict?” but “how variable is this estimator, and what would its sampling distribution look like?”
In the one-sample nonparametric bootstrap, the empirical distribution \(\widehat F_n\) stands in for the unknown data-generating distribution \(F\). Resampling with replacement from the observed data generates pseudo-samples that mimic the variability of the statistic across repeated samples of size \(n\).
If \(\widehat\theta^{*(1)},\dots,\widehat\theta^{*(B)}\) are the bootstrap replicates of a statistic, the bootstrap standard-error estimate is their empirical standard deviation, and the percentile interval uses the empirical quantiles of the bootstrap distribution: $$[\widehat G^{-1}(\alpha/2),\,\widehat G^{-1}(1-\alpha/2)].$$
The deck then broadens the picture: multi-sample bootstrap for two-population problems, parametric bootstrap when a parametric family is credible, and random permutations when the goal is a null distribution under no association rather than a sampling distribution around the observed effect.
Linear models, multicollinearity, and what breaks when \(p\) grows
Before introducing penalties, the deck revisits ordinary least squares and the classical assumptions that make it attractive.
The working linear model is $$y_i=\beta_0+\beta^{\top}x_i+\varepsilon_i,$$ with least-squares fit $$\widehat\beta=(X^{\top}X)^{-1}X^{\top}Y, \qquad \widehat\sigma^2=\frac{\lVert Y-X\widehat\beta\rVert^2}{n-(p+1)}.$$
The slides review unbiasedness, Gauss-Markov, Gaussian error assumptions, residual diagnostics, and generalized linear models. But the real purpose of the section is to set up the failure mode that motivates the rest of the notebook: when predictors are strongly interdependent or \(p\) is large relative to \(n\), \(X^{\top}X\) becomes ill-conditioned, coefficient variance inflates, and overfitting becomes easy.
Pairwise correlations, partial \(R^2\), and variance inflation factors enter here as diagnostics. The slides present ridge, LASSO, and dimension reduction as three different responses to the same instability.
A dedicated aside shows that omitting a relevant variable \(Z\) correlated with \(X\) does not just enlarge error variance: it changes the expected value of the least-squares coefficients and induces a persistent bias.
This is also where the notebook separates prediction from inference, and model specification from model assessment. The later regularization slides are not sold as “fixing the true model”; they are introduced as ways to trade a small amount of bias for a large reduction in variance and overfitting.
Ridge regression and LASSO
Penalization appears as constrained least squares: shrink the coefficient vector to stabilize estimation, improve out-of-sample accuracy, and sometimes force sparsity.
The two canonical penalties are $$\widehat\beta_{\mathrm{ridge}}=\arg\min_{\beta}\{\lVert Y-X\beta\rVert^2+\lambda\lVert\beta\rVert_2^2\},$$ $$\widehat\beta_{\mathrm{lasso}}=\arg\min_{\beta}\{\lVert Y-X\beta\rVert^2+\lambda\lVert\beta\rVert_1\}.$$ The slides repeatedly note that features should be scaled first, because unlike ordinary least squares, these procedures are not scale-equivariant.
Ridge is introduced historically as a response to multicollinearity. LASSO arrives later as a sparse alternative that can perform soft feature selection. Both improve accuracy by shrinking the coefficient vector, but they do so with different geometry: the Euclidean ball of ridge keeps all coefficients alive, while the diamond-shaped \(L^1\) constraint makes exact zeros much more likely.
The slides visualize out-of-sample MSE as the sum of squared bias and variance. As \(\lambda\) increases, variance drops, bias rises, and the best value must be found computationally, usually with cross-validation.
The deck briefly points beyond ridge and LASSO toward Elastic Net, fused and grouped penalties, and non-convex options such as SCAD and related oracle-property literature.
Traditional feature selection
After penalization, the deck returns to the older feature-selection viewpoint: explicitly choose which predictors stay in the model and which disappear.
Best subset and stepwise procedures are treated as “hard” selection methods. In the slide geometry this is an \(L^0\)-style constraint: count nonzero coefficients, tune the allowed model size by cross-validation or model-selection criteria, and then refit on the surviving variables.
Selection keeps a subset of the original coordinates. Dimension reduction instead builds new composite predictors as linear combinations. The notebook keeps those ideas separate because the high-dimensional slides use both.
The selection block explicitly includes \(C_p\), AIC, BIC, adjusted \(R^2\), and cross-validated test error. That is the right place to keep model selection distinct from the later use of cross-validation for penalty tuning.
The lecture then connects these classical ideas to modern mixed-integer optimization. MIP-BOOST appears as a concrete example: exact \(L^0\) selection made computationally viable by careful tuning, warm starts, and whitening to handle collinearity more effectively than a naive subset search.
Supervised dimension reduction
Principal component regression is not the end of the story because the response variable may care about directions that do not explain the most feature variance. The slides therefore introduce supervised dimension reduction.
Sufficient dimension reduction is formulated through the smallest subspace \(\mathcal S_{Y\mid X}\) such that $$Y \perp X \mid A^{\top}X, \qquad \mathcal S_{Y\mid X}=\operatorname{span}(A_{p\times d}).$$ The structural dimension \(d\) is part of the inferential problem, not a tuning afterthought.
The simplest SDR method in the deck is sliced inverse regression. Instead of regressing \(Y\) on \(X\), look at the inverse regression of \(X\) on \(Y\), slice the response, compute slice means \(\bar X_h\), and build $$M=\sum_{h=1}^{H}\frac{n_h}{n}\bar X_h\bar X_h^{\top},$$ then perform an eigen-analysis of \(S^{-1}M\). Under the stated linearity conditions, the leading eigenspaces recover directions inside the central subspace.
The same block reinterprets LDA as supervised dimension reduction for categorical responses: the between-class covariance is rescaled by within-class covariance, and the resulting discriminant directions serve both for visualization and for classification.
The slides list several routes: diagnostics and scree-like plots from eigenvalues, sequential tests for tail eigenvalues, BIC-like criteria when likelihoods are available, and bootstrap-based stability assessment inspired by Ye and Weiss.
Sure independence screening and ultrahigh-dimensional feature spaces
The last lecture block changes asymptotic scale entirely: \(p\) may grow with \(n\), even very quickly, so full-model fitting and even one-shot penalization can become unstable or computationally unrealistic.
The slides first explain why the usual least-squares “re-mapping” through \(X^{\top}X\) becomes unreliable. As dimension rises, spurious correlations grow, \(X^{\top}X\) becomes singular or nearly so, and trying to account for all predictor interdependence can hurt more than it helps.
Fan and Lv’s SIS proposal therefore starts from marginal utilities only: $$\omega=X^{\top}Y\propto r_{X,Y},$$ rank the predictors by \(|\omega_j|\), keep the top \(d\ll p\) variables, and apply a heavier selection method on that reduced set. Under suitable assumptions and \(d=n/\log n\), the slides summarize the sure screening property as $$\Pr(M^{\ast}\subseteq M_d)\to 1 \qquad \text{as } n\to\infty.$$
A key conceptual point is that screening is meant to be conservative. The target is not an immediately tiny model. The target is to avoid false negatives early, then let LASSO, SCAD, Dantzig, or another follow-up selector refine the reduced set. The iterative version ISIS appears precisely because some relevant variables have weak marginal traces and only become visible after partial residualization.
The deck highlights the shift from fixed \(p\) to \(p=p(n)\), sometimes with \(\log p = O(n^{\alpha})\). That perspective is what makes “screen first, select later” a genuinely statistical idea rather than just a computational shortcut.
The final research-slide extension on covariate information numbers is clearly about a model-free marginal utility for screening, and the paper reference is readable. The exact density-information notation in that final formula block is harder to transcribe confidently from the deck, so the page keeps the idea but does not overcommit to every symbol there.
Examples, key takeaways, and further reading
The slide deck closes by mixing textbook material with research-oriented examples. The page keeps that same balance: concrete datasets, a compact recap, and the references that actually appear in the slides.
Silhouette plots, Gaussian-mixture partitions, and stability-based checks appear as the main unsupervised examples.
The SIS material revisits the classic Golub leukemia data to compare sparsity, screening, and classification error.
The final CIS extension is motivated by transcriptomic data, where \(p \gg n\) is the norm rather than the exception.
Unsupervised grouping, supervised prediction, density estimation, and dimension reduction are related but not interchangeable.
Cross-validation, bootstrap, and permutation methods are presented as replacements for analytic formulas when those formulas are too narrow.
The second half of the deck is about feature-space size and instability, not only about having many observations.
Subset selection, shrinkage, PCA, SIR, and SIS solve related problems with different geometric and statistical trade-offs.
Introduction to Statistical Learning is the explicit backbone cited across the clustering, classification, cross-validation, and regularization blocks.
The slides cite Elements of Statistical Learning, Tibshirani’s original LASSO paper, Elastic Net, grouped penalties, fused LASSO, and non-convex penalties such as SCAD.
Key references include Li (1991) for SIR, Cook’s Regression Graphics, and the later SDR review literature cited directly in the slides.
Fan and Lv (2008) anchor the SIS section, followed by extensions to GLMs, model-free screening, and the later SIS software package reference.