Digital Mathematical Notebook
Time Series 📈
Time series are ordered observations whose temporal dependence changes what “data analysis” even means. Once order matters, we have to think about stochastic processes, lagged dependence, stationarity, forecast uncertainty, and model validation in a different way than for i.i.d. data.
Why this matters
Energy demand, sales, inventory, traffic, and climate planning all rely on dependence-aware prediction.
Industrial sensors, medical signals, and infrastructure streams are judged by change through time, not isolated points.
Macroeconomic indicators, volatility, and interventions require dynamic rather than static reasoning.
Measurements from astronomy, neuroscience, and climate science are often generated by latent dynamical systems.
Counts, arrivals, clicks, and token streams carry temporal dependence that breaks i.i.d. assumptions.
A useful forecast is not only a point estimate but also a calibrated picture of what may happen next.
What a Time Series Is
A time series is not only a list of values. Formally, one usually begins with a stochastic process $\{X_t : t \in \mathcal{T}\}$ indexed by time, and then one observes one realization $x_1,\dots,x_T$ of that process. The process is the probabilistic object; the observed sequence is a sample path.
DefinitionBox
Process versus realizationIn discrete time, $\mathcal{T}$ is often $\mathbb{Z}$ or $\{1,\dots,T\}$. The variables $X_t$ are random, while the observed values $x_t$ are fixed once the sample has been recorded. Many statistical statements are about the process even though only one realization is available.
ExampleBox
Univariate versus multivariateDaily temperature is a univariate series. Electricity demand, temperature, and humidity observed together form a multivariate series $Y_t \in \mathbb{R}^m$.
RemarkBox
Discrete and continuous timeThis page focuses mainly on discrete-time models. Continuous-time processes require different tools and are not simply the same formulas with smaller time steps.
KeyIdeaBox
Why i.i.d. intuition failsIn a time series, nearby observations may carry information about each other. Shuffling the data destroys precisely the dependence structure that most forecasting, filtering, and signal-extraction methods are trying to learn.
Core Structure of Time Series
A useful descriptive decomposition separates persistent structure from residual variation, but the parts should not be confused. Trend, seasonality, cycles, calendar effects, interventions, and noise describe different mechanisms.
DefinitionBox
Level and trendLevel is the local baseline. Trend is a smoother long-run change in the mean, which may be deterministic or stochastic.
DefinitionBox
Seasonality and cyclesSeasonality repeats with a known or fixed period, while cycles are recurrent but not locked to a single calendar frequency.
DefinitionBox
Shocks and interventionsHolidays, policy changes, outages, and structural breaks are not interchangeable with trend or seasonality; they often need explicit intervention terms or regime-sensitive modeling.
RemarkBox
Additive and multiplicative viewpointsAdditive thinking uses $y_t = \ell_t + s_t + r_t$ when the scale of fluctuations is roughly constant. Multiplicative thinking uses $y_t = \ell_t s_t r_t$ when variability grows with the level; logs often convert multiplicative structure into an additive one.
Stochastic Foundations
The classical time-series vocabulary is built from moments indexed by lag. The lag operator $B X_t = X_{t-1}$ helps encode models compactly, while covariance functions explain what “dependence across time” means in second-order terms.
FormulaBlock
Moments and lagged dependence$$ \mu_t = \mathbb{E}[X_t], \qquad \gamma_t(h) = \mathrm{Cov}(X_t, X_{t+h}), \qquad \rho_t(h) = \frac{\gamma_t(h)}{\sqrt{\gamma_t(0)\gamma_{t+h}(0)}}. $$
Weak stationarity asks for $\mu_t \equiv \mu$ and $\gamma_t(h) = \gamma(h)$ independent of $t$. Strict stationarity is stronger: the joint distribution of $(X_{t_1},\dots,X_{t_k})$ is invariant under time shifts.
DefinitionBox
White noiseA white-noise sequence usually means zero mean, constant variance, and zero autocovariance for all nonzero lags. It need not be independent unless extra assumptions are added.
RemarkBox
i.i.d. noise is strongerIndependent and identically distributed noise is white noise, but the converse can fail. Gaussian white noise is a special case where uncorrelatedness and independence coincide.
More on weak stationarity, strict stationarity, and ergodic intuition
Weak stationarity is a second-order statement, so it is enough for covariance-based tools such as ACF, spectral density, linear prediction, and many classical ARMA results. Strict stationarity controls full joint distributions and is therefore stronger. Ergodic intuition explains why averages computed along one long realization can sometimes approximate ensemble averages, but stationarity alone does not automatically guarantee every ergodic property.
Dependence, ACF, and PACF
The autocorrelation function summarizes linear dependence between values $h$ periods apart. The partial autocorrelation at lag $h$ asks what remains after linearly accounting for the shorter lags in between.
FormulaBlock
ACF$$ \rho(h) = \frac{\gamma(h)}{\gamma(0)} $$
for a weakly stationary process. Sample autocorrelations estimate these lagged correlations from one realization.
DefinitionBox
PACFThe lag-$h$ PACF is the correlation between $X_t$ and $X_{t-h}$ after linearly removing the effect of $X_{t-1},\dots,X_{t-h+1}$. It is often estimated through regression or Levinson-Durbin recursions.
RemarkBox
Use the cutoff story carefullyThe textbook heuristic “AR gives PACF cutoff, MA gives ACF cutoff” is a useful first guide for idealized low-order models, not a law of nature. Finite samples, mixed models, seasonal terms, and transformations all blur the picture.
Stationarity, Differencing, and Transformations
Stationarity matters because many models describe dependence through constant lag structure. But not every nonstationarity should be attacked the same way: variance stabilization, detrending, differencing, and seasonal adjustment are distinct operations.
DefinitionBox
Deterministic trendA smooth mean function $m_t$ can be removed by regression or decomposition when the trend is modeled as a deterministic component.
DefinitionBox
Stochastic trendA unit-root process such as a random walk carries nonstationarity in the accumulation of shocks; differencing addresses a different mechanism from simple detrending.
FormulaBlock
Differencing operators$$ \nabla X_t = (1-B)X_t, \qquad \nabla_s X_t = (1-B^s)X_t. $$
RemarkBox
Do not conflate detrending, differencing, and seasonal adjustmentDetrending removes an estimated smooth mean. Differencing removes persistent low-frequency behavior by applying lag operators. Seasonal adjustment targets periodic structure. A log or Box-Cox style transform instead acts on scale, not on temporal dependence directly.
Classical Baselines and Task Distinctions
Strong time-series practice begins with baselines and with clean task definitions. Forecasting, filtering, smoothing, and imputation are related but not identical.
ExampleBox
ForecastingPredict future values $X_{T+h}$ using information available up to time $T$.
ExampleBox
Filtering and smoothingEstimate latent states from noisy data, either using past data only or the entire sample.
ExampleBox
ImputationFill missing observations while respecting temporal structure and uncertainty.
DefinitionBox
Baseline modelsWhite noise, mean forecast, naive forecast, seasonal naive forecast, and drift forecast are not trivialities. They tell us whether a more elaborate model is actually learning structure beyond persistence and season repetition.
AR, MA, and ARMA Models
Autoregressive and moving-average models describe dependence through lag polynomials. They are simple enough to analyze and rich enough to explain many practical diagnostics.
FormulaBlock
AR($p$)$$ X_t = c + \phi_1 X_{t-1} + \cdots + \phi_p X_{t-p} + \varepsilon_t. $$
Stationarity is tied to the roots of $1-\phi_1 z-\cdots-\phi_p z^p$: they must lie outside the unit circle.
FormulaBlock
MA($q$)$$ X_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q}. $$
Invertibility concerns whether the model can be uniquely represented in terms of past observations. The roots of $1+\theta_1 z+\cdots+\theta_q z^q$ must lie outside the unit circle.
RemarkBox
Stationarity and invertibility are not the same conditionAR stationarity controls whether the process is well-behaved in time. MA invertibility controls whether past shocks can be recovered from the observed process in a stable way. They solve different identifiability problems.
Why the root condition matters
The lag-polynomial view rewrites an AR process as $\phi(B)X_t = c + \varepsilon_t$. If $\phi(z)$ has roots on or inside the unit circle, the formal inverse $\phi(B)^{-1}$ does not generate an absolutely summable impulse response, so shocks do not decay in a stationary way. The MA invertibility condition is the analogous requirement for expressing the model as a stable infinite AR representation.
ARIMA, SARIMA, and the Box–Jenkins Workflow
ARIMA extends ARMA by modeling the differenced series. Seasonal ARIMA adds seasonal differencing and seasonal lag polynomials. The “I” stands for integration in the time-series sense: undoing differencing to return to the original scale.
FormulaBlock
ARIMA and SARIMA notation$$ \Phi(B^s)\phi(B)\nabla^d \nabla_s^D X_t = c + \Theta(B^s)\theta(B)\varepsilon_t. $$
Here $(p,d,q)$ describe nonseasonal AR, differencing, and MA orders, while $(P,D,Q)_s$ describe seasonal terms with period $s$.
KeyIdeaBox
IdentificationInspect the series, choose transformations, decide whether differencing is needed, then use residual diagnostics and likelihood-based comparison rather than ACF/PACF alone.
RemarkBox
DiagnosticsA fitted model should leave residuals behaving like white noise relative to the structure you intended to capture. Good in-sample fit is not enough if the residual dependence remains.
Decomposition, Smoothing, and ETS
Smoothing is about extracting persistent structure from noisy observations. Exponential smoothing is not merely heuristic weighting; in many cases it has an explicit state-space interpretation.
DefinitionBox
Classical decompositionSplit a series into trend, seasonal, and remainder components, often with moving averages or smoother seasonal extraction.
DefinitionBox
STL-style decompositionSeasonal-Trend decomposition using Loess is more flexible than classical fixed decomposition and handles changing seasonal structure more gracefully.
DefinitionBox
ETSETS summarizes models by Error, Trend, and Seasonality components. Many exponential-smoothing recursions arise as optimal filters for corresponding state-space models.
State-Space Models and Kalman Filtering
The state-space viewpoint treats the observed series as a noisy measurement of an evolving latent state. This unifies local-level models, structural time-series models, exponential smoothing, and many forecasting systems with missing data or time-varying covariates.
FormulaBlock
Linear Gaussian state space$$ y_t = Z_t \alpha_t + \varepsilon_t, \qquad \alpha_{t+1} = T_t \alpha_t + R_t \eta_t, $$
with observation noise $\varepsilon_t$ and state noise $\eta_t$. Filtering estimates $\alpha_t$ using data up to time $t$; smoothing uses future data as well.
RemarkBox
Filtering, smoothing, forecastingFiltering is online estimation of the current state. Smoothing retrospectively improves state estimates using the full sample. Forecasting pushes the state forward beyond the observed window.
KeyIdeaBox
Kalman updateThe filter balances model confidence and observation noise: trust the data more when the measurement is reliable; trust the prior more when the observation is noisy.
Regression with Exogenous Variables, VAR, and Cointegration
Time-series regression with exogenous variables is not the same as fully multivariate endogenous modeling. If one series predicts another, we still need to ask whether that predictor is treated as external or jointly generated inside the system.
DefinitionBox
Dynamic regression / ARIMAXRegress on exogenous covariates while modeling the serial dependence of the errors, often with ARIMA-type error terms.
FormulaBlock
VAR$$ Y_t = c + A_1 Y_{t-1} + \cdots + A_p Y_{t-p} + u_t. $$
Every component is modeled as depending on past values of all components.
DefinitionBox
Cointegration and VECMIf components are nonstationary but some linear combination is stationary, a VECM captures both short-run differences and long-run equilibrium correction.
RemarkBox
Dependence is not structural causalityGranger-style predictability is about incremental forecasting content, not a full structural causal claim. Useful dependence tests should not be oversold as mechanistic proof.
Volatility Models and the Frequency-Domain View
Not all time-series structure lives in the conditional mean. Financial returns often exhibit weak serial dependence in the mean but strong dependence in squared residuals. Separately, periodic behavior can be easier to diagnose in frequency space than in the time domain.
FormulaBlock
ARCH / GARCH intuition$$ \varepsilon_t = \sigma_t z_t, \qquad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2. $$
The conditional variance evolves through time, so shocks can cluster even if the mean process is simple.
FormulaBlock
Periodogram intuition$$ I(\omega_k) = \frac{1}{T}\left|\sum_{t=1}^{T} x_t e^{-i \omega_k t}\right|^2. $$
Peaks in the periodogram point to dominant frequencies, complementing ACF-based seasonality diagnostics.
Forecasting Workflow, Rolling Evaluation, and Probabilistic Uncertainty
Out-of-sample evaluation in time series must preserve chronology. Random shuffling breaks the forecasting problem by leaking future information into the training set.
DefinitionBox
Point forecasts versus distributional forecastsA point forecast gives a central value such as the conditional mean or median. A probabilistic forecast describes a full predictive distribution or at least interval forecasts at stated coverage levels.
FormulaBlock
Prediction intervals$$ \hat{y}_{T+h|T} \pm c \,\hat{\sigma}_h $$
for an appropriate multiplier $c$ under a chosen predictive distributional assumption.
RemarkBox
Calibration and sharpnessGood probabilistic forecasts are calibrated, so coverage is honest, and sharp, so the intervals are not unnecessarily wide. One without the other is not enough.
RemarkBox
MetricsMAE and RMSE evaluate point forecasts. Interval coverage and width, pinball loss, or CRPS-style scores evaluate probabilistic quality more directly.
Beyond Forecasting: Anomaly Detection, Classification, Clustering, Regression
Forecasting provides much of the theory, but time-series structure also matters for other supervised and unsupervised tasks.
ExampleBox
Anomaly detectionDeviations can be point anomalies, collective anomalies, or regime shifts. Temporal context matters: the same value can be normal in one phase and anomalous in another.
ExampleBox
Classification and regressionThe target may be a label or scalar attached to a whole sequence or a subsequence. Feature extraction, shape-based comparison, or learned sequence encoders may all be appropriate.
ExampleBox
ClusteringDistances such as Euclidean or DTW, feature summaries, or learned embeddings define different notions of “similar behavior,” so the modeling goal has to be explicit.
ExampleBox
Monitoring and interventionSequential surveillance often mixes forecasting, residual control charts, changepoint detection, and intervention analysis rather than solving one isolated task.
Modern Machine Learning and Deep Learning for Time Series
Modern methods do not erase the classical theory. They extend it with different inductive biases, larger function classes, and new scaling behavior, but issues such as leakage, nonstationarity, uncertainty, and evaluation remain.
DefinitionBox
Feature-based MLLagged features, calendar covariates, rolling summaries, and exogenous regressors often make linear models, trees, and boosted ensembles competitive baselines.
DefinitionBox
Sequence modelsRNNs, LSTMs, GRUs, and TCNs encode different locality and memory biases for sequential data, especially with nonlinear dynamics or large multi-series datasets.
DefinitionBox
Transformers and foundation-style modelsAttention-based models can capture long-range interactions and large covariate sets, but they do not automatically dominate simpler baselines, especially on smaller or cleaner forecasting tasks.
RemarkBox
Honest comparison mattersStrong recent benchmarks repeatedly show that classical models, good baselines, and careful data preprocessing remain hard to beat. Large architectures are most compelling when the data regime, covariate richness, or cross-series scale actually justify them.
Common Misconceptions
A few recurring mistakes create a large fraction of bad time-series modeling decisions.
RemarkBox
“Time series is just regression with timestamps.”Temporal dependence changes both the model class and the validation protocol. Order is part of the problem, not a feature column to append at the end.
RemarkBox
“Differencing fixes every nonstationarity.”Differencing can be essential for stochastic trends but can also over-difference, amplify noise, and destroy level information if applied mechanically.
RemarkBox
“Higher in-sample fit means better forecasts.”Overfitting can improve in-sample likelihood while harming multi-step predictive performance and calibration out of sample.
RemarkBox
“White noise and i.i.d. are the same.”White noise is an uncorrelated second-order condition; independence is stronger.
RemarkBox
“Transformers always beat classical models.”Performance depends on data regime, horizon, covariates, evaluation design, and the strength of the baseline. Bigger is not automatically better.
RemarkBox
“All nonstationarity should be removed.”Some models represent nonstationary components directly. Removing every visible trend can erase interpretable structure that the model should actually learn.
Practical Modeling Checklist
Good time-series work is often less about finding one magical architecture and more about respecting a careful sequence of questions.
RemarkBox
1. Inspect sampling and missingnessCheck regularity, gaps, duplicated timestamps, aggregation choices, and calendar alignment.
RemarkBox
2. Visualize structure firstLook for level shifts, trend, seasonality, interventions, variance changes, and obvious anomalies.
RemarkBox
3. Build strong baselinesNaive, seasonal naive, drift, and simple smoothing are often surprisingly hard to beat.
RemarkBox
4. Diagnose dependenceUse ACF/PACF, residual plots, and domain knowledge instead of relying on one formal test alone.
RemarkBox
5. Match the model to the taskUnivariate forecast, exogenous regression, multivariate system, latent-state model, or anomaly monitor?
RemarkBox
6. Validate chronologicallyUse rolling-origin or walk-forward designs. Never let future data leak into the training window.
RemarkBox
7. Evaluate uncertaintyPrediction intervals and forecast distributions should be checked for calibration as well as width.
RemarkBox
8. Monitor drift and regime changeModels that worked last quarter may fail after a structural break, policy change, or sensor shift.
Applications
The same backbone of dependence, uncertainty, and structure shows up across very different domains.
ExampleBox
Economics and macro indicatorsGrowth, inflation, unemployment, and interventions often combine trend, seasonal adjustment, and multivariate dynamics.
ExampleBox
Energy and climateDemand responds to weather, holidays, and long-run structural change, so exogenous covariates and seasonal effects matter.
ExampleBox
Traffic and operationsQueueing, travel times, and service demand often need multiple horizons, online updates, and intervention awareness.
ExampleBox
FinanceMean prediction is often modest, but volatility clustering and risk quantification are central.
ExampleBox
Healthcare and physiologyVital signs and biosignals require noise-robust filtering, anomaly detection, and often irregularly sampled data handling.
ExampleBox
Event streams and language countsTraffic, clicks, arrivals, and token counts behave like sequences with bursts, seasonality, and intervention effects.
Takeaways and Further Reading
Time series require dependence-aware thinking. Forecasting is only one task, but it supplies a large share of the core mathematics: lag structure, stationarity, uncertainty, state estimation, and chronological validation.
KeyIdeaBox
SummaryClassical models still matter because they express interpretable dependence structures, provide strong baselines, and often scale well to uncertainty-aware forecasting. State-space thinking unifies many apparently different tools. Modern deep models are most useful when they are compared honestly against those foundations rather than replacing them rhetorically.
FurtherReadingBox
Curated readingForecasting: Principles and Practice for clear forecasting fundamentals and practical workflow.
Penn State STAT 510 and the NIST handbook.
statsmodels time-series documentation and the state-space user guide for implementation-oriented details.
VAR, VECM, and cointegration tools are organized in the statsmodels vector autoregression guide.
Use the NIST and STAT 510 materials to connect periodogram intuition with formal spectral analysis.
For a broad transformer-oriented survey, see Transformers in Time Series, but read it after the classical foundations on this page.