Digital Mathematical Notebook

Statistical Learning and Large Data

This page follows the 172-slide course deck in its original order: clustering, principal components, supervised classification, nonparametric smoothing, cross-validation, resampling, large-feature linear models, penalization, supervised dimension reduction, and ultrahigh-dimensional screening.

The PDF is the main source throughout. Short connective text is added only to make the slide sequence readable as one continuous study page rather than a slide dump.

A single review marker remains in the late feature-screening block, where the final covariate-information notation is harder to read cleanly from the slides than the surrounding ideas.

Course arc

The first half is about learning tasks and computational assessment: unsupervised structure, dimension reduction, classification, smoothing, and resampling. The second half shifts to what changes when the feature space itself becomes large, unstable, or ultra-high dimensional.

Unsupervised geometry

Cluster analysis, PCA, and an MDS addendum organize the opening lectures.

Prediction and assessment

Classification, smoothing, cross-validation, bootstrap, and permutation logic form the middle.

High-dimensional regimes

Ridge, LASSO, subset selection, SDR, SIS, and ISIS address large or ultra-large feature spaces.

Source map and reconstruction notes

The notebook follows the slide blocks in sequence: Cluster Analysis; Principal Components Analysis with an MDS addendum; Supervised Classification; Non-parametric Regression and Density Estimation; Cross-Validation; Resampling; Linear Models in Large Feature Spaces; Penalizations via Ridge and LASSO; Traditional Feature Selection; Supervised Dimension Reduction; and Feature Screening for ultra-large spaces.

Textbook references that appear directly on the slides are kept in the reading section at the end. Where a slide is mainly visual, the page recreates the mathematical point in new SVG figures or explorable widgets rather than embedding the slide image.

Opening block • Unsupervised classification

Cluster analysis

The deck begins with unsupervised classification: there are no labels to predict, only a cloud of points in feature space and the question of whether a useful partition or hierarchy emerges from dissimilarity.

Partitioning versus nesting

Hierarchical clustering does not return one partition and one value of \(K\). It returns a nested sequence of merges summarized by a dendrogram, and the cut height determines the partition read off from that tree.

K-means is different

K-means fixes a target number of clusters \(K\) in advance and optimizes a partition directly: $$\sum_{k=1}^{K}\sum_{i:\,C(i)=k}\lVert x_i-\mu_k\rVert^2.$$ It is fast and practical, but the final solution depends strongly on initialization and only reaches a local minimum.

The slides separate three decisions that are often blurred together: choosing a point-to-point distance, choosing a linkage rule between clusters, and choosing how many groups to retain. Hierarchical clustering uses the first two to build a nested family of partitions. K-means skips the linkage step and alternates assignment and centroid updates instead.

The silhouette diagnostic appears as the main quality score for a given partition: $$d_{i,C}=\frac{1}{\#(C)}\sum_{\ell\in C}d(x_i,x_\ell), \qquad a_i=d_{i,C(i)}, \qquad b_i=\min_{C\neq C(i)} d_{i,C},$$ $$\operatorname{sil}_i=\frac{b_i-a_i}{\max\{a_i,b_i\}}, \qquad \operatorname{Sil}(K)=\frac{1}{N}\sum_{i=1}^{N}\operatorname{sil}_i.$$

High silhouette values mean the partition is coherent, but the slides explicitly warn that the criterion is not monotone in \(K\). Later pages add other ways to choose the number of clusters: dissimilarity curves, stability or reproducibility measures, and BIC when a mixture model is available.

Method Core idea What the slides stress
Hierarchical Merge clusters step by step using a linkage rule. Produces a dendrogram and a nested family of partitions, not one final answer.
K-means Alternate centroid updates with reassignment. Optimizes within-cluster sum of squares, depends on initialization, and needs \(K\).
Variants PAM, self-organizing maps, fuzzy \(k\)-means, Gaussian mixtures. Useful when prototypes, soft partitions, or model-based interpretations matter.

Cluster explorer

Compare a direct \(K\)-means partition with a hierarchical cut and read the silhouette widths next to it.

Method

Number of clusters

The widget stays close to the slides: partition quality is shown through silhouette widths rather than through a generic clustering score.

Second block • Unsupervised dimension reduction

Principal components analysis and an MDS addendum

PCA enters as a general-purpose reduction technique: capture the main directions of variation, build a low-dimensional representation, and create composite predictors without using a response.

Centered cloud, rotated coordinates

The slides treat location as irrelevant and center each feature first. Writing the centered sample variance-covariance matrix as \(S\) (the deck suppresses the usual \(1/(n-1)\) factor), PCA uses the eigen-decomposition $$S=\sum_{m=1}^{p}\lambda_m\phi_m\phi_m^{\top}, \qquad \lambda_1\ge \cdots \ge \lambda_p\ge 0.$$

Scores and loadings

The loadings are the coordinates of each eigenvector \(\phi_m\) in the original basis. The scores are the values of the new composite feature on each observation: $$z_{im}=\phi_m^{\top}x_i.$$

The geometric point is simple and powerful: for any dimension \(m\), the first \(m\) principal directions define the \(m\)-dimensional linear subspace that lies closest to the data cloud in a least-squares sense. That is why the slides move naturally from eigenvalues to scree plots, biplots, and cumulative explained variance.

Percentage of variance explained appears explicitly as $$\operatorname{PVE}_m=\frac{\lambda_m}{\sum_{k=1}^{p}\lambda_k}, \qquad \operatorname{CPVE}_m=\sum_{j=1}^{m}\operatorname{PVE}_j.$$

A short addendum then introduces multidimensional scaling. In the Euclidean case, the deck notes that MDS reduces to another eigen-decomposition problem, so it sits naturally next to PCA as a geometry-first view of dimension reduction before the supervised material begins.

PCA explorer

Toggle raw units versus standardized features and see how the principal directions and the retained reconstruction change.

Feature scaling

Retained components

This mirrors the deck’s warning that scaling is not an afterthought: unlike regression coefficients, PCA reacts non-trivially when features are measured on very different scales.

Third block • Labeled prediction

Supervised classification

The page now moves from unlabeled structure to labeled prediction. Here the slides are careful about model choice, thresholding, and the gap between training error and test error.

Classification setup

Observe pairs \((x_i,y_i)\) with a categorical response \(Y\in\{1,\dots,K\}\). The goal is prediction: build a rule \(x\mapsto \widehat y(x)\) with low misclassification on new data, not merely low error on the training sample used to fit the classifier.

Four model families anchor the lecture block: logistic regression, LDA, QDA, and \(K\)-nearest neighbors. Logistic regression models posterior class probabilities directly in the binary case. LDA and QDA instead model the class-conditional feature distributions \(f_k(x)\), then approximate the Bayes classifier through discriminant functions. KNN drops the parametric model and classifies by local majority vote.

Binary logistic regression: $$\log\frac{p(x)}{1-p(x)}=\beta_0+\beta^{\top}x.$$ LDA assumes $$X\mid Y=k\sim N_p(\mu_k,\Sigma),$$ leading to the familiar linear discriminant score $$\delta_k(x)=x^{\top}\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^{\top}\Sigma^{-1}\mu_k+\log\pi_k.$$ QDA keeps the Gaussian assumption but allows a separate covariance \(\Sigma_k\) for each class.

Linear Bayes boundary: LDA is often closer. Curved Bayes boundary: QDA can adapt better.
A clean restatement of the slide comparison between Bayes, LDA, and QDA boundaries: linear structure favors LDA, while unequal class covariances can make a quadratic boundary more appropriate.
Assessment is not just accuracy

The classification block spends real time on confusion matrices, ROC curves, sensitivity, specificity, and predictive values. That matters especially when classes are unbalanced: low overall error can hide poor detection of the minority class, so threshold choice and class-specific rates become essential.

Procedure Bias / flexibility What the deck emphasizes
Logistic regression Linear decision surface in feature space. Probability modeling and threshold tuning.
LDA Lower variance under a shared-covariance Gaussian model. Can outperform QDA when the true boundary is close to linear or data are limited.
QDA More flexible, but with higher variance. Useful when covariance structure differs substantially across classes.
KNN Nonparametric and local. Works through neighborhoods rather than a fitted parametric discriminant.

Fourth block • Flexible regression without a fixed equation

Nonparametric regression, smoothing, and kernel density estimation

Once the response is continuous, the slides pause the parametric model-building game and ask what happens if the data themselves suggest the systematic shape.

LOWESS

LOWESS fits local weighted least-squares models around each target point. The main tuning choice is the fraction \(q\in(0,1)\) of the sample used in each neighborhood, together with the local polynomial degree and the weighting function.

What smoothing does not give you

The slides explicitly list the drawback: a smoother does not hand back one explicit global equation for the regression function. It is excellent for visualization, diagnosis, and local signal extraction, but less convenient when a compact parametric law is the end goal.

LOWESS appears twice in the deck: first as a direct regression tool, then as a diagnostic companion to parametric regression, where plotting residuals against predictors or fitted values can reveal mean patterns that a postulated model missed. The same lecture then turns to kernel density estimation, where the tuning question becomes bandwidth selection rather than local span.

Kernel density estimation is written in the usual form $$\widehat f_h(x)=\frac{1}{nh}\sum_{i=1}^{n}K\!\left(\frac{x-x_i}{h}\right),$$ and the slides use the bandwidth \(h\) to illustrate the bias-variance dilemma: large \(h\) reduces variance but oversmooths the structure, while small \(h\) preserves detail at the cost of instability.

Smoothing explorer

Switch between the LOWESS regression view and the kernel-density view to see how one tuning parameter controls smoothness in both settings.

Mode

Tuning parameter

LOWESS and KDE are different procedures, but the same notebook section uses them to teach the same modeling lesson: flexibility must be tuned rather than assumed.

Fifth block • Model assessment for prediction

Cross-validation

The cross-validation lecture is framed as a computational replacement for analytic formulas when the question is out-of-sample predictive accuracy.

The slides make a clean distinction between in-sample error and out-of-sample error. For flexible procedures, the training sample can be fit extremely closely while true predictive performance remains much worse. Cross-validation estimates the out-of-sample criterion directly by repeatedly withholding part of the data during fitting.

For regression, the in-sample criterion is $$\operatorname{MSE}_{\mathrm{in}}=\frac{1}{n}\sum_{i=1}^{n}(y_i-\widehat y_i)^2,$$ while \(k\)-fold cross-validation estimates out-of-sample error by averaging foldwise test errors: $$\operatorname{CV}_k=\frac{1}{k}\sum_{j=1}^{k}\operatorname{MSE}_j.$$

Model assessment versus model selection

The same cross-validation curve can be used in two different ways. First, to estimate predictive accuracy for a fixed method. Second, to choose a tuning parameter or model size by minimizing the estimated out-of-sample error. The slides keep those roles conceptually separate even though the same computation supports both.

As \(k\) increases toward leave-one-out, the training folds become larger and the overestimation bias of out-of-sample error drops, but the variance of the estimate increases because the folds overlap much more. The deck presents the familiar practical compromise: values around five or ten often balance cost, variance, and bias well.

Sixth block • Sampling variability by computation

Bootstrap, permutation methods, and resampling logic

The resampling lecture shifts from predictive assessment to uncertainty assessment: not “how well will this predict?” but “how variable is this estimator, and what would its sampling distribution look like?”

In the one-sample nonparametric bootstrap, the empirical distribution \(\widehat F_n\) stands in for the unknown data-generating distribution \(F\). Resampling with replacement from the observed data generates pseudo-samples that mimic the variability of the statistic across repeated samples of size \(n\).

If \(\widehat\theta^{*(1)},\dots,\widehat\theta^{*(B)}\) are the bootstrap replicates of a statistic, the bootstrap standard-error estimate is their empirical standard deviation, and the percentile interval uses the empirical quantiles of the bootstrap distribution: $$[\widehat G^{-1}(\alpha/2),\,\widehat G^{-1}(1-\alpha/2)].$$

The deck then broadens the picture: multi-sample bootstrap for two-population problems, parametric bootstrap when a parametric family is credible, and random permutations when the goal is a null distribution under no association rather than a sampling distribution around the observed effect.

Resampling explorer

Move between cross-validation, bootstrap, and permutation views to keep their targets clearly distinct.

Mode

The same “split and repeat” visual language hides different inferential goals. This widget keeps the contrast explicit instead of lumping all resampling together.

Transition block • Review before penalization

Linear models, multicollinearity, and what breaks when \(p\) grows

Before introducing penalties, the deck revisits ordinary least squares and the classical assumptions that make it attractive.

The working linear model is $$y_i=\beta_0+\beta^{\top}x_i+\varepsilon_i,$$ with least-squares fit $$\widehat\beta=(X^{\top}X)^{-1}X^{\top}Y, \qquad \widehat\sigma^2=\frac{\lVert Y-X\widehat\beta\rVert^2}{n-(p+1)}.$$

The slides review unbiasedness, Gauss-Markov, Gaussian error assumptions, residual diagnostics, and generalized linear models. But the real purpose of the section is to set up the failure mode that motivates the rest of the notebook: when predictors are strongly interdependent or \(p\) is large relative to \(n\), \(X^{\top}X\) becomes ill-conditioned, coefficient variance inflates, and overfitting becomes easy.

Multicollinearity

Pairwise correlations, partial \(R^2\), and variance inflation factors enter here as diagnostics. The slides present ridge, LASSO, and dimension reduction as three different responses to the same instability.

Omitted variable bias

A dedicated aside shows that omitting a relevant variable \(Z\) correlated with \(X\) does not just enlarge error variance: it changes the expected value of the least-squares coefficients and induces a persistent bias.

This is also where the notebook separates prediction from inference, and model specification from model assessment. The later regularization slides are not sold as “fixing the true model”; they are introduced as ways to trade a small amount of bias for a large reduction in variance and overfitting.

Main high-dimensional block • Penalized least squares

Ridge regression and LASSO

Penalization appears as constrained least squares: shrink the coefficient vector to stabilize estimation, improve out-of-sample accuracy, and sometimes force sparsity.

The two canonical penalties are $$\widehat\beta_{\mathrm{ridge}}=\arg\min_{\beta}\{\lVert Y-X\beta\rVert^2+\lambda\lVert\beta\rVert_2^2\},$$ $$\widehat\beta_{\mathrm{lasso}}=\arg\min_{\beta}\{\lVert Y-X\beta\rVert^2+\lambda\lVert\beta\rVert_1\}.$$ The slides repeatedly note that features should be scaled first, because unlike ordinary least squares, these procedures are not scale-equivariant.

Ridge is introduced historically as a response to multicollinearity. LASSO arrives later as a sparse alternative that can perform soft feature selection. Both improve accuracy by shrinking the coefficient vector, but they do so with different geometry: the Euclidean ball of ridge keeps all coefficients alive, while the diamond-shaped \(L^1\) constraint makes exact zeros much more likely.

Bias-variance trade-off

The slides visualize out-of-sample MSE as the sum of squared bias and variance. As \(\lambda\) increases, variance drops, bias rises, and the best value must be found computationally, usually with cross-validation.

Beyond convex penalties

The deck briefly points beyond ridge and LASSO toward Elastic Net, fused and grouped penalties, and non-convex options such as SCAD and related oracle-property literature.

Regularization explorer

Watch the least-squares optimum move inside either a circular ridge constraint or a diamond-shaped LASSO constraint while the coefficients shrink.

Penalty

Penalty strength

The left panel follows the exact visual language of the slides: residual contours plus a changing constraint set, not a generic coefficient dashboard.

Follow-up block • Hard selection and model criteria

Traditional feature selection

After penalization, the deck returns to the older feature-selection viewpoint: explicitly choose which predictors stay in the model and which disappear.

Best subset and stepwise procedures are treated as “hard” selection methods. In the slide geometry this is an \(L^0\)-style constraint: count nonzero coefficients, tune the allowed model size by cross-validation or model-selection criteria, and then refit on the surviving variables.

Feature selection versus dimension reduction

Selection keeps a subset of the original coordinates. Dimension reduction instead builds new composite predictors as linear combinations. The notebook keeps those ideas separate because the high-dimensional slides use both.

Criteria in the slides

The selection block explicitly includes \(C_p\), AIC, BIC, adjusted \(R^2\), and cross-validated test error. That is the right place to keep model selection distinct from the later use of cross-validation for penalty tuning.

The lecture then connects these classical ideas to modern mixed-integer optimization. MIP-BOOST appears as a concrete example: exact \(L^0\) selection made computationally viable by careful tuning, warm starts, and whitening to handle collinearity more effectively than a naive subset search.

Late block • Composite predictors with supervision

Supervised dimension reduction

Principal component regression is not the end of the story because the response variable may care about directions that do not explain the most feature variance. The slides therefore introduce supervised dimension reduction.

High-dimensional predictors \(X\in\mathbb{R}^p\) Projection \(A^{\top}X\) Model for \(Y\) central subspace of dimension \(d\)
The slide logic in one diagram: estimate a low-dimensional subspace that retains the information in \(X\) relevant for the dependence of \(Y\) on the predictors.
Central subspace

Sufficient dimension reduction is formulated through the smallest subspace \(\mathcal S_{Y\mid X}\) such that $$Y \perp X \mid A^{\top}X, \qquad \mathcal S_{Y\mid X}=\operatorname{span}(A_{p\times d}).$$ The structural dimension \(d\) is part of the inferential problem, not a tuning afterthought.

The simplest SDR method in the deck is sliced inverse regression. Instead of regressing \(Y\) on \(X\), look at the inverse regression of \(X\) on \(Y\), slice the response, compute slice means \(\bar X_h\), and build $$M=\sum_{h=1}^{H}\frac{n_h}{n}\bar X_h\bar X_h^{\top},$$ then perform an eigen-analysis of \(S^{-1}M\). Under the stated linearity conditions, the leading eigenspaces recover directions inside the central subspace.

The same block reinterprets LDA as supervised dimension reduction for categorical responses: the between-class covariance is rescaled by within-class covariance, and the resulting discriminant directions serve both for visualization and for classification.

Choosing \(d\)

The slides list several routes: diagnostics and scree-like plots from eigenvalues, sequential tests for tail eigenvalues, BIC-like criteria when likelihoods are available, and bootstrap-based stability assessment inspired by Ye and Weiss.

Final block • \(p \gg n\) and new asymptotics

Sure independence screening and ultrahigh-dimensional feature spaces

The last lecture block changes asymptotic scale entirely: \(p\) may grow with \(n\), even very quickly, so full-model fitting and even one-shot penalization can become unstable or computationally unrealistic.

The slides first explain why the usual least-squares “re-mapping” through \(X^{\top}X\) becomes unreliable. As dimension rises, spurious correlations grow, \(X^{\top}X\) becomes singular or nearly so, and trying to account for all predictor interdependence can hurt more than it helps.

Fan and Lv’s SIS proposal therefore starts from marginal utilities only: $$\omega=X^{\top}Y\propto r_{X,Y},$$ rank the predictors by \(|\omega_j|\), keep the top \(d\ll p\) variables, and apply a heavier selection method on that reduced set. Under suitable assumptions and \(d=n/\log n\), the slides summarize the sure screening property as $$\Pr(M^{\ast}\subseteq M_d)\to 1 \qquad \text{as } n\to\infty.$$

A key conceptual point is that screening is meant to be conservative. The target is not an immediately tiny model. The target is to avoid false negatives early, then let LASSO, SCAD, Dantzig, or another follow-up selector refine the reduced set. The iterative version ISIS appears precisely because some relevant variables have weak marginal traces and only become visible after partial residualization.

Screening explorer

Compare a one-pass marginal ranking with a single residual iteration to see why the deck treats ISIS as a rescue mechanism for weaker signals.

Procedure

Retained variables

The toy example is deliberately small, but the logic is the same one emphasized in the slides: marginal learning first, structured selection second.

New asymptotics

The deck highlights the shift from fixed \(p\) to \(p=p(n)\), sometimes with \(\log p = O(n^{\alpha})\). That perspective is what makes “screen first, select later” a genuinely statistical idea rather than just a computational shortcut.

Review marker

The final research-slide extension on covariate information numbers is clearly about a model-free marginal utility for screening, and the paper reference is readable. The exact density-information notation in that final formula block is harder to transcribe confidently from the deck, so the page keeps the idea but does not overcommit to every symbol there.

Closing section • What the deck leaves you with

Examples, key takeaways, and further reading

The slide deck closes by mixing textbook material with research-oriented examples. The page keeps that same balance: concrete datasets, a compact recap, and the references that actually appear in the slides.

Clustering diagnostics

Silhouette plots, Gaussian-mixture partitions, and stability-based checks appear as the main unsupervised examples.

Leukemia gene expression

The SIS material revisits the classic Golub leukemia data to compare sparsity, screening, and classification error.

Transcriptomic screening

The final CIS extension is motivated by transcriptomic data, where \(p \gg n\) is the norm rather than the exception.

Learning tasks stay distinct

Unsupervised grouping, supervised prediction, density estimation, and dimension reduction are related but not interchangeable.

Assessment is computational

Cross-validation, bootstrap, and permutation methods are presented as replacements for analytic formulas when those formulas are too narrow.

Large \(p\) is not large \(n\)

The second half of the deck is about feature-space size and instability, not only about having many observations.

Selection and reduction are separate

Subset selection, shrinkage, PCA, SIR, and SIS solve related problems with different geometric and statistical trade-offs.

Broader penalization context

The slides cite Elements of Statistical Learning, Tibshirani’s original LASSO paper, Elastic Net, grouped penalties, fused LASSO, and non-convex penalties such as SCAD.

Supervised dimension reduction

Key references include Li (1991) for SIR, Cook’s Regression Graphics, and the later SDR review literature cited directly in the slides.

Ultra-high-dimensional screening

Fan and Lv (2008) anchor the SIS section, followed by extensions to GLMs, model-free screening, and the later SIS software package reference.