Machine Learning I CMPUT267

Chapter 1 - Introduction

Welcome to ML!

Chapter 2 - Introduction to Probabilistic Modeling

Probability is the extension of logic from the discrete to the continuous

Probability Theory and Random Variables

An experiment has an outcome (in the outcome space). An event is a meaningful set of outcomes. Probabilities about events are expressed as random variables \(X\), which has its own outcome space \(\mathcal{X}\).

Defining Distributions

Discrete Distributions and PMFs

A probability mass function (PMF) is a function \(p : \mathcal{X}\to[0,1]\) where \(\displaystyle\sum\limits_{x\in\mathcal{X}}p(x)=1\); it describes the probability of a particular outcome \(\mathcal{x}\in\mathcal{X}\) occurring

The Bernoulli distribution describes an experiment; it as a single parameter \(\alpha\) that defines the probability of the experiment succeeding.

The (discrete) uniform distribution describes the distribution where every outcome is equally likely to occur.

The poisson distribution describes the probability of a certain number of incidents occurring (over an implicitly finite timespan). Its parameter \(\lambda\) is the expected number of incidents in the timeframe.

Continuous Distributions and PDFs

The continuous analog to the PMF is the probability density function (PDF), defined as a function \(p : \mathcal{X}\to [0, \infty)\) where \(\displaystyle\int_{\mathcal{X}}p(x)\,dx=1\)

The continuous uniform distribution is the continuous analog to the discrete uniform distribution; it has PDF \(p(x)=\dfrac{1}{b-a}\) where \(a\) and \(b\) define the bounds of the uniform range.

The exponential distribution is the continuous analog to the poisson distribution and is defined over \(\mathcal{X}=(0, \infty)\) as \(p(x)=\lambda e^{-\lambda x}\) with parameter \(\lambda\).

The Gaussian distribution or normal distribution defines many natural phenomena, as well as the emergent distribution of sample averages of not necessarily Gaussian populations (central limit theorem, roughly).

The Laplace distribution is similar to the Gaussian, but more peaked around the mean. It is defined as \(p(x)=\dfrac{1}{2b}e^{- \frac{1}{b}|x-\mu|}\) with parameters \(\mu \in \mathbb{R}\) and \(b > 0\).

The Gamma distribution is the continuous analog to the poisson distribution and is defined over \(\mathcal{X}=(0, \infty)\) as \(p(x)=\dfrac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}\) (where \(\Gamma(x)\) is the gamma function)

Multivariate Random Variables

Joint Distributions

We define the joint probability mass function \(p:\mathcal{X}\times\mathcal{Y}\to[0, 1]\) and corresponding joint probability distribution \(P\) such that \(p(x, y):=P(X=x, Y=y)\) for random variables \(X\) and \(Y\) with outcome spaces \(\mathcal{X}\) and \(\mathcal{Y}\), respectively.

\(d\)-dimensional Random Variables

We generalize joint distributions to \(d\) dimension: we consider a \(d\)-dimensional random variable \(\vec{X}=(X_1, X_2, \dots, X_d)\) with vector-valued outcomes \(\vec{x}=(x_1, x_2, \dots, x_d)\) where each \(x_i\) is chosen from some corresponding outcome space \(\mathcal{X}_i\).

Marginal Distributions

A marginal distribution of a subset of \(\vec{X}\cong \set{X_1, X_2, \dots, X_d}\supset A\) is found by summing/integrating over the remaining variables \(\set{X_1, X_2, \dots, X_d}\setminus A\).

Conditional Probability

We define the conditional probability \(p(y|x)\) of \(X\) given \(Y\) as \(p(y|x):=\dfrac{p(x, y)}{p(x)}\) where \(p(x)>0\)

Product Rule: \(p(x, y)=p(x|y)p(y)=p(y|x)p(x)\)

Bayes' Rule: \(p(x|y)=\dfrac{p(y|x)p(x)}{p(y)}\) (follows directly from the product rule)

Variables \(X\) and \(Y\) are independent if \(p(x_1, x_2, \dots, x_d)=p(x_1)p(x_2)\dots p(x_d)\), i.e. if the conditional and marginal distributions are the same. Thus, the variables don't "affect" each other.

Expectations and Moments

Expected Value

The expected value \(\mathbb{E}[X]\) of random variable \(X\) is the average of \(n\to \infty\) samples of \(X\). Numerically, for discrete \(X\), \(\mathbb{E}[X]:=\displaystyle\sum\limits_{x\in\mathcal{X}}x p(x)\) and for continuous \(X\), \(\mathbb{E}[X]:=\displaystyle\int\limits_{\mathcal{X}} x p(x) \, dx\)

We can extend expectation to the other types of distributions we've seen (we only write the body of the expression expectation expression for brevity)

Law of total expectation: \(\mathbb{E}[Y]=\mathbb{E}[\mathbb{E}[Y|X]]\) (the outer expectation is over \(X\), the inner over \(Y\)).

Variance

We define the variance \(\text{Var}[X]\) of random variable \(X\) as \(\text{Var}[X] := \mathbb{E}[(X-\mathbb{E}[X])^2]\). This measures how "concentrated" the distribution of \(X\) is around its mean

Aside: Covariance satisfies the metric axioms, so it is a metric function (distance function) over the space of distributions.

Covariance

Just as variance measures the "spread" of a single variable, covariance \(\text{Cov}[X, Y]\) of \(X\) and \(Y\) measures how spread out two variables are, as well as how spread out their "union" is. We have \(\text{Cov}[X, Y]:=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y]\)

We define the correlation \(\text{Corr}[X, Y]\) of \(X\) and \(Y\) as a normalized version of variance that measures how related the two variables are. We have \(\text{Corr}[X, Y]:\dfrac{\text{Cov}[X, Y]}{\sqrt{\text{Var}[X]}\sqrt{\text{Var}[Y]}}\in [-1, 1]\)

Properties of Variance and Covariance

Properties of Variance and Covariance

Properties
For random variables \(X, Y\) and \(c\in \mathbb{R}\), the following hold

  1. \(\mathbb{E}[cX]=c\mathbb{E}[X]\)
  2. \(\mathbb{E}[X+Y]=\mathbb{E}[X]+\mathbb{E}[Y]\) (linearity of expectation)
  3. \(\text{Cov}[X, X]=\text{Var}[X]\geq 0\)
  4. \(\text{Var}[c]=0\)
  5. \(\text{Var}[cX]=c^2\text{Var}[X]\)
  6. \(\text{Var}[X+Y]=\text{Var}[X]+\text{Var}[Y]+2\text{Cov}[X, Y]\)

Furthermore, if \(X\) and \(Y\) are independent:

  1. \(\mathbb{E}[XY]=\mathbb{E}[X]\mathbb{E}[Y]\) for any data points
  2. \(\text{Var}[X+Y]=\text{Var}[X]+\text{Var}[Y]\)
  3. \(\text{Cov}[X, Y]=0\)

Extra: (more) Formal Probability Theory

In probability theory, an experiment has an outcome \(\omega\); all the possible outcomes form the outcome space/sample space \(\Omega\). An event is a set of outcomes that denote a "type of thing"; these form the event space \(\mathcal{E} \subseteq \mathcal{P}(\Omega)\) of the experiment.

The concept of a random variable is a formalization and generalization of the notion of an experiment on which we can define probabilistic questions.

Axioms of Probability (Appendix)

\((\Omega, \mathcal{E})\) is a measurable space if the outcome space/sample space \(\Omega\) is non-empty and the event space \(\mathcal{E} \subseteq \mathcal{P}(\Omega)\) has the following properties

  1. \(A \in \mathcal{E} \implies \overline{A} \in \mathcal{E}\)
  2. \(A_1, A_2, \dots \in \mathcal{E} \implies \displaystyle\bigcup_{i=1}^{\infty}A_i \in \mathcal{E}\)
  3. \(\mathcal{E}\) is non-empty

A probability measure/probability distribution \(P : \mathcal{E} \to [0, 1]\) must satisfy the axioms of probability:

  1. \(P(\Omega) = 1\)
  2. \(A_1, A_2, \dots \in \mathcal{E}\) and \(A_i \cap A_j = \emptyset\) for all \(i, j\) \(\implies\) \(\displaystyle P\left(\bigcup_{i=1}^{\infty}A_i\right) = \sum\limits_{i=1}^{\infty}P(A_i)\)

\((\Omega, \mathcal{E}, P)\) is a probability space/probability measure if all the above properties are true.

A random variable is a function \(X : \Omega \to \mathcal{X}\) that defines a transformation between experiment's probability space and its own probability space; it defines the probability space \((\mathcal{X}, \mathcal{E}_X, P_X)\)

Using random variables lets us define probabilities in terms of boolean statements instead of set notation; \(P_X(X = x)\) is equivalent to \(P(\set{\omega : X(\omega) = x})\), for example

The space \((\mathcal{X}, \mathcal{E}_X, P_X)\) as defined above is a valid probability space, so the same axioms and rules apply.

Asides

We define a mixture as a sort of "compound distribution", whose PDF is defined as a normalized linear combination (convex combination) of other PDFs, i.e. \(p(x) = \displaystyle\sum\limits_{i=1}^{n} w_i p_i (x)\) where the weights \(w_1 + \dots + w_n\) are all positive and sum to \(1\).

Chapter 3 - Estimation

Estimators

An Estimator \(\hat{X}\) of random variables \(X_1 \dots X_n\) is a distribution that we construct in order to estimate (model) some random variables, that we (presumably) know some, but not all information about

Bias of estimator \(\hat{X}\) is \(\mathbb{E}[\hat{X} - X]\), i.e. the expected (signed) difference ("vertical offset") from the true distribution

To find the expected value, variance, etc. of an estimator, plug it into the formula for that property, then evaluate

Sample Mean

The sample mean \(\bar{X} = \displaystyle\dfrac{1}{n}\sum\limits_{i=1}^{n} X_i\) can be used as an estimator for the random variables \(X_1 \dots X_n\)

Confidence

Confidence interval: \(P(|\hat{X} - \mu| \leq \varepsilon) > 1-\delta\) indicates that \(\mathbb{E}[\hat{X}]\) is in the interval \([\hat{X}-\varepsilon, \hat{X}+\varepsilon]\) with a probability of \(1-\delta\)

Since the variance (usually) is a function of \(n\), more samples → tighter confidence intervals

Confidence Inequalities

Confidence inequalities relate the values of \(\delta\), \(\varepsilon\), and \(n\), which bounds confidence intervals

Hoeffding's inequality \(P(|\hat{X} - \mathbb{E}[\hat{X}]| \geq \varepsilon) \leq 2 \exp{-\dfrac{2n\varepsilon^2}{(b-a)^2}}\) for distributions with bounds \(a\), \(b\)

Chebyshev's inequality \(P(|\hat{X} - \mathbb{E}[\hat{X}]| \geq \varepsilon) \leq \dfrac{\sigma^2}{n\varepsilon^2}\) unbounded distributions

Since the left part of both inequalities is a confidence interval, equate the right to \(\delta\); this can be used to determine an actual interval (i.e. define \(\varepsilon\)) in terms of \(\delta\), \(n\), and \(\sigma\)

Consistency

An estimator \(\hat{X}\) is consistent if as \(n\to\infty\) (i.e. more and more variables are considered), \(\hat{X} \to X\), where \(X\) is the true distribution we are trying to model

The convergence rate of an estimator \(\hat{X}\) is "how fast" the estimator approaches the true value, expressed in big-O notation

Complexity

The complexity of an estimator \(\hat{X}\) is the number of samples required to guarantee an error of at most \(\varepsilon\) with probability \(1-\delta\)

Error

The mean-squared error is a way to calculate the expected error of an estimator \(\bar{X}\)

Chapter 4 - Optimization

The goal of optimization is (often) to find some parameter \(\vec{w}\) in the possible set of parameters \(\mathcal{W}\) to minimize some objective function \(c : \mathcal{W} \to \mathbb{R}\) that calculates the "cost" or "error" of an estimation.

A common objective function is the objective error: \(c(\vec{w}) = \displaystyle\sum\limits_{i=1}^{n} ((x_i)\vec{w} - y_i)^2\)

Stationary Points

For continuous \(\mathcal{W}\), the \(\arg\min\) must lie at a stationary point, i.e. a point where \(c'(w_0) = 0\)

We use the second derivative test to figure out the type of point

Gradient Descent

Generally, datasets are much too large to calculate stationary points symbolically, so they must be approximated numerically

The Taylor series \(c(\vec{a}) + \displaystyle\sum\limits_{n=1}^{\infty}\frac{c^{(n)}(\vec{a})}{n!}(w-\vec{a})^n\) of \(c(w)\) is an approximation for \(c(w)\) around its radius of convergence (i.e. local area)

Second-order

Second-order Gradient descent: Zero points can be "estimated" using a Taylor series by picking a nearby point \(w_0\), manipulating the Taylor series equation to be in terms of \(c'(w)\), then equating it to \(0\); the new \(w\) value \(\vec{w_1}\) will be the next estimate

First-order

Without the second derivative, we must pick a constant stepsize \(\eta\) to move by each time

Vector Gradient Descent

Adaptive Step Size

If the step size is too small, it takes a long time to find a stationary point; if the step size is too large, the search "may bounce" around the point without converging

Our ideal step size is \(\min\limits_{\eta \in \mathbb{R}^{+}} c(\vec{w_t}-\eta \nabla c(\vec{w}_t))\)

We try the largest "reasonable" step size \(\eta_{\text{max}}\). If it reduces the cost, we use that; otherwise, we decrease \(\eta_{\text{max}}\) and try again until the cost is reduced

Generally, we reduce \(\eta\) using the rule \(\eta \to \tau\eta\), for some \(\tau \in [0.5, 0.9]\)

Heuristic Alternative Example

Heuristic search: A search that uses previous information to help solve the current problem

We define \(\vec{g_t} = \nabla c(\vec{w_t})\) and choose \(n_t = (1+\sum\limits_{j=1}^{d} |g_{t, j}|)^{-1}\)

Testing for Optimality and Uniqueness

If a function is convex (i.e. secondary derivative is always non-negative), then all stationary point(s) will be global minima. As such, gradient descent is optimal (assuming it converges to the minimum)

We may constrain the bounds in which we want to search for our optimization. As such, the function's boundary points may (or may not) need to be considered (evaluated) to see if they are global minima

Identifiability: the ability to identify the true, possibly unique solution

Equivalence under constant shift: Adding or multiplying by a constant \(a\ne 0\) doesn't change the solution, i.e. \(\arg\min\limits_{\vec{w}\in \mathbb{R}^{d}}c(\vec{w}) = \arg\min\limits_{\vec{w}\in \mathbb{R}^{d}}a c(\vec{w}) = \arg\min\limits_{\vec{w}\in \mathbb{R}^{d}}c(\vec{w})+a\)

Chapter 5 - Parameter Estimation

In probabilistic modelling, our task is to approximate (model) the true distribution of a set of observations (dataset). Since distributions are defined uniquely by their parameters, our goal is to identify and estimate them given the dataset.

The hypothesis space or function class \(\mathcal{F}\) is the set of is the set of all possible distributions we consider for an unlabelled dataset \(\mathcal{D} = \set{x_i}^{n}_{i=1}\). The true distribution \(\star{f}\) of \(\mathcal{D}\) is in \(\mathcal{F}\)

An estimation strategy is an algorithm that picks an \(f \in \mathcal{F}\) to approximate \(\star{f}\)

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is the process of picking \(f \in \mathcal{F}\) that makes seeing the dataset \(\mathcal{D}\) with distribution \(f\) the most likely; we choose \(f\) to maximize that likelihood \(p(\mathcal{D}|f)\), so \({f_{\text{MLE}} = \arg\max_\limits{f\in\mathcal{F}}p(\mathcal{D}|f)}\)

Recall that, assuming independent samples from \(\mathcal{D}\), we have \(p(D|f) = \displaystyle\prod_{i=1}^{n} p(x_i|f)\), for \(x_i \in \mathcal{D}\)

When \(\mathcal{F}\) is finite (i.e. there is a discrete list) of possible parameters to consider, we can just calculate each \(p(\mathcal{D}|f)\) and pick the largest \(f\). Otherwise, for infinite (implying continuous) \(\mathcal{F}\), we need to use continuous optimization techniques like gradient descent to find the maximal \(f\)

Log-likelihood

Computing gradient descent to optimize \(f\) for continuous \(\mathcal{F}\) is painful; calculating the log-likelihood \({f_{\text{MLE}} = \arg\min_\limits{f\in\mathcal{F}}-\ln{p(\mathcal{D}|f)}}\) yields the same \(f_{\text{MLE}}\) (since \(\ln\) is monotone) while transforming products into sums, making gradient descent easier to compute.

MLE for Conditional Distributions

Conditional distributions provide additional auxiliary information that will inform the estimate of the true distribution. In machine learning, most parameter estimation happens for conditional distributions; we often make predictions given many features (auxiliary variables).

Given two random variables \(X\) and \(Y\) and \(\mathcal{D} = \set{(x_i, y_i)}^{n}_{i=1}\), we want to estimate parameters \(\vec{w}\) for \(p(y|x)\). We use the chain rule for probability \(p(x_i, y_i) = p(y_i | x_i)p(x_i)\) since because \(p(y|x)\) uses both \(y\) and \(x\), we need to consider both \(x_i\) and \(y_i\), i.e. \(p(x_i, y_i)\)). We find: \[\displaystyle\ln{p(\mathcal{D}|\vec{w})} = \ln{\prod_{i=1}^{n} p(x_i, y_i | \vec{w})} = \ln{\prod_{i=1}^{n} p(y_i|[x_i, \vec{w}])p(x_i)} = \sum\limits_{i=1}^{n} \ln{p(y_i | [x_i, \vec{w}]) + \sum\limits_{i=1}^{n}\ln{p(x_i)}}\]

Using the PDF \(p\) for whatever distribution we want to estimate, we can compute the gradient with respect to \(\vec{w}\) by computing each partial derivative, then either evaluate directly or perform gradient descent.

We can also formulate this as the MAP problem \(\displaystyle \arg\min_{\vec{w}\in \mathbb{R}^{k}}- \sum\limits_{i=1}^{n} \ln{p(y_i | [x_i, \vec{w}])}\)

When \(Y\) is a discrete distribution, we can simplify the summation into cases, i.e. for each possible value of \(y_i\), there is a different distribution of the same type for \(x_j\) that appear with \(y_i\) that we wish to estimate. So, for each partition of \(x_{\set{1\dots }}\) by \(y_{\set{1\dots}}\), we have a different set of parameters, all contained in \(\vec{w}\). Each corresponding set of parameters is estimated separately.

Maximum a Posteriori (MAP) Estimation

Maximum a Posteriori Estimation (MAP) is the process of picking the most probable \(f \in \mathcal{F}\) given dataset \(\mathcal{D}\); we choose \(f\) to maximize \(p(f|\mathcal{D})\), so \({f_{\text{MAP}} = \arg\max_\limits{f\in\mathcal{F}}p(f|\mathcal{D})}\)

We calculate the posterior distribution \(p(f|\mathcal{D})\) using Bayes' rule: \(\displaystyle p(f|\mathcal{D}) = \dfrac{p(\mathcal{D}|f) p(f)}{p(\mathcal{D})}\). \(p(\mathcal{D})\) doesn't depend on \(f\), so we derive \(f_{\text{MAP}} = \arg\max\limits_{f\in\mathcal{F}} p(\mathcal{D}|f)p(f)\)

Log-likelihood can be used for MAP as well

Note: priors are, like a whole thing. We can't really calculate them because they are defined in terms of other values we are trying to compute. Instead, we choose priors that we think match our situation, or have useful properties (more on that later). Priors are a degree of freedom.

Bayesian Estimation

MLE and MAP are point estimates because they provide a single vector \(\vec{w}\) as an estimation; they don't give insight into how changes in the parameter space are reflected in the accuracy of the estimation. In addition, defining distributions only by their parameters means that non-standard (e.g. skewed, multimodal) distributions cannot be represented or estimated.

Bayesian approaches aim to derive the entire posterior distribution \(p(f|\mathcal{D})\) instead of "summarizing" it; as such, they provide more information than just a single value for \(f\).

We calculate the posterior \(p(f|\mathcal{D})\) by updating the prior \(p(f)\) with the data \(\mathcal{D}\); often, \(p(f)\) and \(p(f|\mathcal{D})\) follow the same type of distribution (known as a conjugate prior), where \(P(f|\mathcal{D})\) will have lower variance because it is equipped with the information in \(\mathcal{D}\).

Considering the entire posterior lets us reason about the range of plausible parameters, i.e. the level of confidence around the estimated parameters. Using \(\sigma\), we can reason about whether our estimate is near optimal.

Computing the Posterior with Conjugate Priors

Using Bayes' rule, we find \(p(w|\mathcal{D}) = \dfrac{p(\mathcal{D}|w)p(w)}{p(\mathcal{D})}\). \(p(\mathcal{D}|w)\) and \(p(w)\) are known, and we can estimate the constant \(p(\mathcal{D})\) with \(p(\mathcal{D}) = \displaystyle\int_{} p(\mathcal{D}|w)p(w) \, dw\)

A prior \(p(w)\) is a conjugate prior to a likelihood \(p(\mathcal{D}|w)\) if the resulting posterior \(p(w|\mathcal{D})\) and prior have the same type of distribution.

Gradient Descent for Parameter Estimation

We can also find point estimates numerically instead of analytically, using gradient descent, with rule \(\vec{w_{t+1}} = \vec{w_{t}} - \eta_t \nabla c(\vec{w_t})\) , where the cost function \(c(\vec{w})\) follows from dropping constants from the expansion of \(-\ln p(x_i , \vec{w})\), or \(-\ln p(y_i | [x_i , \vec{w}])\) for conditional distributions

Chapter 6 - Stochastic Gradient Descent

Although gradient descent is a very adaptable approach to finding stationary points, it has a high computation cost because it requires iterating over the entire dataset. Stochastic approximation involves picking a random subsample of the dataset from which to calculate the gradient. This strategy works empirically due to the scale of datasets used in real applications of machine learning applications.

Since the gradient of a single sample \(\nabla c_k(\vec{w})\) is an unbiased (though high-variance) estimator for the true gradient \(\nabla c(\vec{w}) = \displaystyle\frac{1}{n} \sum\limits_{i=0}^{n} \nabla c_i(\vec{w})\), the sample average of \(b\) random gradient samples \(\displaystyle\frac{1}{b} \sum\limits_{k \in B} \nabla c_k(\vec{w})\) will be a lower-variance unbiased estimator for the true gradient. As \(b\to\infty\), the variance gets smaller; thus, with high enough \(b\), we can use this sample average instead.

SGD Algorithm

# gradient descent for b = 1
D, w, cost(n, w)
# data: vector D of length n
# starting estimate: vector w with random values
# cost(k, w) calculuates cost of w with relation to datapoint k
for i in ecpochs
    shuffle(D)       # randomly shuffle the order D
    for j in n       # for 1...n
        g = gradient(cost(n, w))     # multivariable version of c_n'(w)
        η = i^{-1}   # step size is inverse of epoch number (apative step size)
        w = w - ηg   # gradient descent update rule
    end
end
# w contains the estimate for c(w) = 0

Note that instead of shuffling the dataset each time we descend, we shuffle it once, then keep using the next \(b\) elements to construct the estimate. This is more efficient than randomizing every time, while guaranteeing that every datapoint gets looked the same amount (\(\pm 1\)) of times. This whole process is called an epoch, and may be repeated an arbitrary number of times.

Generally, setting \(b=1\) is too high variance; we usually pick a \(b > 1\) and add another inner loop to calculate the estimate of the sample average of the \(b\) datapoints. Note that a higher \(b\) likely leads to a better estimate for the true gradient (and thus more accurate gradient descent), but requires more computation to find. Thus, a balanced \(b\) should be chosen with care.

Stepsize Selection

In the algorithm above, we used the epoch number \(p\) to calculate stepsize as \(\eta = p^{-1}\), since it gets smaller as we perform more iterations of gradient descent while assuring each datapoint has the same relative weight.

In the AdaGrad algorithm, stepsize is calculated as \(\vec{\eta} = \dfrac{1}{\sqrt{1+\overline{\vec{g_t}}}} = (1 + \overline{\vec{g_t}})^{-\frac{1}{2}}\), with \(\overline{\vec{g_t}} = \overline{\vec{g_{t-1}}}+\displaystyle\frac{1}{d}\sum\limits_{j=1}^{d} (\vec{g_{t,j}})^2\), where \(\vec{g_t}\) is the mini-batch gradient on iteration \(t\). So, the the size of the last gradient, the smaller the current stepsize is, but we still have \(\vec{\eta} \to \vec{1}\) when the gradient is near \(\vec{0}\), implying convergence around the true value of \(\vec{w}\). Note that AdaGrad uses a vector stepsize that scales the stepsize independently in each dimension.

Time Complexity

If it takes \(O(d)\) to compute \(\nabla c_i(\vec{w})\) for one \(i \in \mathbb{N}\), then each mini-batch update in SGD takes \(O(bd)\), which seems better then the regular \(O(nd)\) for "regular" gradient descent, since we usually have \(n \gg b\). However, since regular gradient descent has more accurate update steps, it will almost certainly take more steps (denoted \(k\)) to reach a stationary point.

If stochastic gradient descent performs better, we have \(k_{\text{SGD}}bd \leq k_{\text{GD}}nd\), implying \(k_{\text{SGD}} \leq k_{\text{GD}}\dfrac{n}{b}\). Empirically, we have found that this is almost always the case in real applications of GD, since using it practically requires very large \(n\).

Chapter 7 - Intro to Prediction Problems

Machine learning problems come in many forms; a useful ontology (set of properties to categorize) for machine learning problems has the following dimensions

Forms include (from notes): supervised, semisupervised, and unsupervised learning, completion under missing features, structured prediction, learning to rank, statistical relational learning, active learning, and time series prediction.

Supervised Learning

Supervised learning uses a dataset \(\mathcal{D}\) of observations \(x \in \mathcal{X}\) paired with targets \(y \in \mathcal{Y}\). Usually, observations are vectors in some space (e.g. \(\mathbb{R}^d\)), where an arbitrary member of this space is an instance/sample. An entry in an sample is a feature/attribute; these are given when defining the problem and define the data in aggregate.

Aside: prediction algorithms are generally designed for inputs of real-valued vectors, but most data doesn't trivially fit into this format. So, we must consider how to encode different types of data as a vector in some euclidean space, where algorithms can be applied. Such a mapping is called an embedding. In this course, we will assume data is already embedded.

Classification vs. Regression

In both classification and regression problems, we seek to predict targets based on features. Regression problems have a continuous target space \(\mathcal{Y}\) (e.g. \(\mathcal{Y} \subseteq \mathbb{R}\)), whereas classification problems have discrete target space containing various classes/categories (e.g. \(\mathcal{Y} = \set{\text{healthy}, \text{diseased}}\)).

Classification problems can be further divided into multi-class problems, where each datapoint can have one label \(\ell \in\mathcal{L}\), i.e. \(\mathcal{Y} = \mathcal{L}\), and multi-label problems, where a datapoint can have multiple labels, i.e. \(\mathcal{Y} = \mathcal{P}(\mathcal{L})\).

Optimal Classification and Regression Models

To create an optimal classification or regression model, we must define a cost function \(\text{cost} : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}^{+}\) that measures the cost of a predictor function \(f : \mathcal{X} \to \mathcal{Y}\). \(\text{cost}(\hat{y}, y)\) is the cost for predicting \(\hat{y}\) (from \(f\)) when the true answer is \(y\). We seek to minimize the expected cost.

Cost Functions

The "simplest" cost function is the \(0\)-\(1\) cost function, where \(\text{cost}(\hat{y}, y) = 0\) when \(y = \hat{y}\) and \(1\) otherwise.

For discrete \(\mathcal{Y}\) (→ classification problems), we can define a different cost for any combination of values of \(\hat{y}\) and \(y\), i.e. \(\mathcal{Y}\times\mathcal{Y}\). Lets us specify when some kinds of inaccurate predictions are more problematic than others (e.g. false negatives are usually worse than false positives in a medical setting).

For regression problems, we must use continuous functions of \(\hat{y}\) and/or \(y\) so that the cost is defined for any combination of the two. Some common cost functions are:

Deriving Optimal Predictors

For classification problems, We can express \(\mathbb{E}[C] = \mathbb{E}[\text{cost}(f(\mathcal{X}), \mathcal{Y})]\) as \(\displaystyle \int_{\mathcal{X}} \sum\limits_{y\in\mathcal{Y}} \text{cost}(f(\vec{x}), y)p(\vec{x}, y) \, dx\) \(= \displaystyle \int_{\mathcal{X}} p(\vec{x}) \sum\limits_{y\in\mathcal{Y}} \text{cost}(f(\vec{x}), y)p(y | \vec{x}) \, dx\)

So, the optimal predictor \(\star{f}(\vec{x})=\arg\min\limits_{\hat{y}\in\mathcal{Y}} \mathbb{E}[C|\mathcal{X}=\vec{x}]\), so \(\star{f}(\vec{x})=\arg\min\limits_{\hat{y}\in\mathcal{Y}} \displaystyle\sum\limits_{y\in\mathcal{Y}} \text{cost}(\hat{y}, y)p(y|\vec{x})\). Thus, the value of this predictor depends on the cost function. For \(0\)-\(1\) cost, this evaluates to \(\arg\max\limits_{y\in\mathcal{Y}}p(y|\vec{x})\).

For regression problems, \(\mathcal{Y}\) is continuous, so we integrate over it instead, i.e. \(= \displaystyle \int_{\mathcal{X}} p(\vec{x}) \int_{\mathcal{Y}} \text{cost}(f(\vec{x}), y)p(y | \vec{x}) \, dx\). For the squared error, \(\star{f}(\vec{x}) = \mathbb{E}[Y|\vec{x}]\)

Reducible and Irreducible Error

There are two types of error in our prediction: reducible error comes from how close the trained model \(f(\vec{x})\) is to \(\mathbb{E}[f(\mathcal{Y}|\mathcal{X})]\); this can theoretically be reduced to \(0\) by finding \(\star{f}\), or (in practice) a suitably close estimator. The irreducible error comes from the variability inherent to \(\mathcal{Y}\), and as such cannot be reduced by improving the estimator.

Chapter 8 - Linear and Polynomial Regression

Given a dataset \(\mathcal{D}\), we want to find a functional form \(f\) that describes the relationship between the features and the target. Like for distributions, the function \(f\) will have parameters \(\vec{w}\).

For linear \(f\), we have \(f(\vec{x})= w_0 + w_1 x_1 + w_2 x_2 + \dots\), where the parameters are \(\vec{w}=(w_0, w_1, \dots)\). We can express this as the dot product \(f(\vec{x})=\displaystyle\sum\limits_{j=0}^{d}w_j x_j = \vec{x}^{\top}\vec{w}\), where we define \(x_0=1\) for ease of notation.

Maximum Likelihood Formulation

We formulate a linear regression as a random variable \(Y\) whose (linear) relationship with \(X\), the random variable representing a possible input vector \(\vec{x}\) we would like to learn. \(Y\) is defined as \(\displaystyle\sum\limits_{j=0}^{d}w_j x_j \, + \, \varepsilon\), where \(\vec{w}=(w_1,\dots w_j, \dots )\) are the true parameters (which we aim to learn) and \(\varepsilon\) is an noise error term following a zero-mean Gaussian distribution \(\mathcal{N}(0, \sigma^2)\) independent* of \(X\).

So, as an MLE problem, for \(\mathcal{F}\subseteq \mathbb{R}^{d+1}\), we have \(\vec{w}_\text{MLE}=\arg\min\limits_{\vec{w}\in\mathcal{F} \subset \mathbb{R}^{d+1}} -\displaystyle\sum\limits_{i=1}^{n}\ln{p(y_i \mid \vec{x_i}, \vec{w})}\). Since we know this is Gaussian, we can expand using the Gaussian PDF to find \(\dots = \arg\min\limits_{\vec{w}\in\mathcal{F} } \displaystyle\sum\limits_{i=1}^{n}{(y_i - \vec{x_i}^\top \vec{w})^2}\). So, our predictor is \(\hat{y_i}=\vec{x_i}^{\top}\vec{w}\).

Finding a Linear Regression Solution

We can minimize the squared error using cost function \(c(\vec{w})=\displaystyle \frac{1}{n}\sum\limits_{i=1}^{n} \frac{1}{2}(f(\vec{x})=\vec{x_i}^{\top}\vec{w}-y_i)^2\). We find that \(\nabla c(\vec{w})=\displaystyle \frac{1}{n} \sum\limits_{i=1}^{n}(\vec{x_i}^{\top}\vec{w}-y_i)x_{ij}\), where \(j\) is an attribute in datapoint \(\vec{x}\).

This gives us a system of \(d\) equations where, for each \(j\in[d+1]\), \(\nabla c(\vec{w})=0\). Written as a vector, this is \(\nabla c(\vec{w})=\displaystyle \frac{1}{n} \sum\limits_{i=1}^{n}(\vec{x_i}^{\top}\vec{w}-y_i)\vec{x_i}=\vec{0}\). We can use linear algebra to find a solution to this system, but (as with most things), approximating it with (stochastic) gradient descent is usually more practical.

Polynomial Regression

Most real-life problems don't have linear solutions; we can extend linear regression to produce non-linear predictions by applying a non-linear transformation \(\phi\) to the data, performing the linear regression, then reversing the transformation.

Univariate Case

To estimate a univariate polynomial function of degree \(p\), we define polynomial \(f(x)=\displaystyle\sum\limits_{j=0}^{p}w_j \phi_j(x) = \vec{\phi}^{\top}\vec{w}\) with parameters \(\vec{w}\). Here, \(\vec{\phi}=(\phi_0(x), \phi_1(x), \dots, \phi_p(x))\) where \(\phi_j(x)=x^j\) is a set of basis functions that transforms the polynomial regression problem into a linear one. We then solve the linear regression like our inputs are \(\phi_0(x), \phi_1(x), \dots, \phi_p(x)\)

This can also be thought of transforming the original dataset \(\mathcal{D}=\set{(x_i, y_i)}\) into \(\mathcal{D}_\phi =\set{(\phi(x_i), y_i)}\), where \(\phi(x_i) = \begin{bmatrix} \phi_0(x) \\ \vdots \\ \phi_p(x) \end{bmatrix}\), performing a normal linear regression on that whose parameters become the coefficients of the polynomial regression.

Multivariate Case

For multivariate inputs \(\vec{x}\), each transformation \(\phi_j : \mathcal{X}\to \mathbb{R}\) produces one term \(\phi_j(\vec{x})\) of the polynomial. E.g. for \(\vec{x}= \begin{bmatrix} a \\ b \end{bmatrix}\), our polynomial basis \(\vec{\phi}\) would be \(\set{1, a, b, ab, a^{2}, b^{2}}\) depending on the degree, and we are optimizing the function \(f\left(\begin{bmatrix} a \\ b \end{bmatrix}\right)=w_0 + w_1 a + w_2 b + w_3 ab + w_4 a^{2}+ w_5 b^{2}=\displaystyle \sum\limits_{j=0}^{6}w_j \phi_j \left(\begin{bmatrix} a \\ b \end{bmatrix}\right)\)

Chapter 9 - Generalization Errors and Evaluation of Models

A model's training error is the total cost of a model across all the training data.

A model's generalization error is the expected cost of a model across all the possible datapoints, i.e. the whole data space. It measures how well a model is likely to perform on data it has never seen before.

Generalization error can be caused by overfitting, where the function has been overly adapted to the training data specifically; it matches the noise of the data instead of the trend. In the \(n\to\infty\) case, an overfitted function returns \(y_i\) if input \(x_i\) is in the dataset and \(0\) otherwise.

Underfitting can occur when the model is not complex enough to represent the data. However, this can be more easily detected, since both generalization and training error will remain high.

Test Sets

A test set is a subset of the dataset \(\mathcal{D}\) left out of training and used to test the model's performance; in particular to test for overfitting. Since the model hasn't seen the testing data before, it can't have overfit to it.

Overfitting has likely occurred if the testing error is significantly higher than the training error, i.e. the generalization error is high.

A good choice for a model is one that minimizes the training error, since this seems to generalize the best. Relative to this point, models with lower complexity likely underfit and and models with higher complexity likely overfit.

Underfitting has likely occurred if the next increase in complexity (e.g. additional power in polynomial regression) decreases both the training and testing error. As such, the "optimal" model is the one with the smallest testing error.

Making Statistical Claims about Error

Confidence Intervals

If we take the sample average of \(m\) error samples of function \(f\)'s estimation of dataset \(\mathcal{D}\), our sample average error is \(\overline{E}=\displaystyle \frac{1}{m} \sum\limits_{i=1}^{m}c_i(f)\), where \(c_i(f)\) is the error of sample \(i\) calculated by \(c\) (e.g. squared error, etc.). The sample average error follows a student's T distribution

Confidence intervals effectively measuring the performance of one model, but also multiple. If for given \(\varepsilon_1\) and \(\varepsilon_2\), we have \(\overline{E_1}+\varepsilon_1 < \overline{E_2}-\varepsilon_2\) (i.e. the intervals don't overlap significantly*), we know that \(f_1\) is statistically significantly better than \(f_2\) with confidence level \(\delta\).

Parametric Tests

If we want to simply rank models \(f_1\) and \(f_2\) without precise error bars, we can perform a parametric test by testing a model on a dataset and calculating the probability that the result would happen randomly; this probability is called the \(p\)-value. Most consider that a \(p\) value of \(\alpha = 0.05\) or smaller implies statistical significance.

Binomial Test

We can perform the binomial test between two models \(f_1\) and \(f_2\) by

Other Tests

Although the structure of the testing is constant, different distributions can be used for different tests when necessary.

Choosing a parametric test is like choosing a learning model: strong assumptions can lead to faster learning/stronger results, but are more prone to bias/poor prediction, whereas weaker models make better predictions but need more data

Types of Errors

Type I error: A test is used under incorrect assumptions → the null hypothesis is rejected when it shouldn't have been (false positive)

Type II error: A test fails to reject the null hypothesis even though the data is statistically significant (false negative)

Chapter 10 - Regularization and Constraint of the Hypothesis Space

Note: Although motivating the actual regularization formulae, I didn't find the course notes explained this concept in a clear way, particularly when first introducing it. I would recommend reading this article on the subject first to provide some background before reading the course notes.

Regularization refers to a set of techniques used to reduce overfitting by trading training accuracy for generalizability. This is done by adding an additional term to the cost function that penalizes large coefficients in \(\vec{w}\), since this often indicates overfitting.

Deriving Regularization as MAP

If we define linear regressions as a MAP problem (instead of an MLE problem, like we have been), we must choose a prior \(p(\vec{w})\), since we are considering \(p( \vec{w}|\mathcal{D})\). This prior is picked to regularize overfitting. Two common choices are the Gaussian Prior (\(\ell_2\) norm) and the Laplace Prior (\(\ell_1\) norm).

In the following priors, \(\lambda\) is a hyperparameter (i.e. a parameter of the learning process itself). It ends up acting as the regularization parameter; the higher the \(\lambda\), the higher the penalty for having large coefficients.

Gaussian Prior (\(\ell_2\) Norm, Ridge regularization)

Assume each component \(w_j\) of \(\vec{w}\) has Gaussian prior \(\mathcal{N}(0, \dfrac{\sigma^2}{\lambda})\) with the following assumptions

We choose \(\dfrac{\sigma^2}{\lambda}\) for the prior's variance in order to create a regularization parameter (namely \(\lambda\)).

Since each prior \(p(w_j)\) is gaussian, the prior \(p(\vec{w})\) is defined as \(p(\vec{w})=p(w_1)\times p(w_2)\times \dots \times p(w_d)\). Taking the log likelihood and expanding the gaussian probability function yields

\[-\ln{p(\vec{w})}=-\displaystyle\sum\limits_{j=1}^{d}\ln{p(w_j)}=\dots=\frac{d}{2}\ln\frac{2\pi \sigma^2}{\lambda} + \boxed{\frac{\lambda}{2\sigma^2}\sum\limits_{j=1}^{d}w^2_j}\]

The first term is constant, so it doesn't affect the selection of \(\vec{w}\).

Since we are using MAP, we consider the prior and the likelihood, so we want to find the \(\arg\min\) of their sum, so we get \[\arg\min\limits_{\vec{w}\in \mathbb{R}^{d+1}}-[\ln(p(\vec{y}|[\vec{X}, \vec{w}]))+\ln(p(\vec{w}))] = \dots = \arg\min\limits_{\vec{w}\in \mathbb{R}^{d+1}} \frac{1}{2}\sum\limits_{i=1}^{n}(\vec{x_i}^{\top}\vec{w} - y_i)^{2}\, + \frac{\lambda}{2}\sum\limits_{j=1}^{d}w^2_j\]

Since \(\vec{x_i}^{\top}\vec{w}\) is our prediction \(\hat{y_i}\), we simply have our regular MAP cost function plus the regularization term, here \(\displaystyle\frac{\lambda}{2}\sum\limits_{j=1}^{d}w^2_j\).

The gradient of our regularized cost function is \(\nabla c_i(\vec{w})=\begin{bmatrix}(\vec{x_i}^{\top}\vec{w}-y_i) \\ (\vec{x_i}^{\top}\vec{w}-y_i) x_{i1} + \lambda{w_1} \\ (\vec{x_i}^{\top}\vec{w}-y_i) x_{i2} + \lambda{w_2} \\ \dots \end{bmatrix}\)

We can simply the problem (and the PDF) by assuming the mean of each \(w_j\) is \(0\), assuming here we subtracted the intercept term \(w_0\) from it. So, we don't need to consider the \(w_0\) term either.

Laplace Prior (\(\ell_1\) Norm, Lasso regularization)

We define and motivate the Laplace prior in the same way as the Gaussian prior, but assume each component \(w_j\) of \(\vec{w}\) follows a Laplace distribution instead.

Performing the same derivation, we find the regularization term to be \(\displaystyle\frac{\lambda}{2}\sum\limits_{j=1}^{d}||w_j||\), the difference being that we no longer square \(w_j\). So, the Laplace prior is more likely to return terms where some \(w_j\) are \(0\).

Expectation and Variance for MLE vs. MAP

In the (univariate) case where we have one parameter \(w\) (with "true" value \(\omega\)) over dataset \(\mathcal{D}=(X,Y)\), we have found in [[Chapter 5 - Parameter Estimation]] that the MLE estimation \(w_{\text{MLE}}(\mathcal{D})=\dfrac{\sum_{i=1}^{n}X_i Y_i}{\sum_{i=1}^{n}X_i^2}\) is an unbiased estimator of \(\omega\); we can find variance using the formula (in the course notes on [[CMPUT 267 Notes.pdf|page 107]]).

However, the MAP estimate \(w_{\text{MAP}}(\mathcal{D})=\dfrac{\sum_{i=1}^{n}X_i Y_i}{\lambda+S_n}\) is not an unbiased estimate of \(\omega\); we find that \(\displaystyle\mathbb{E}[w_{\text{MAP}}(\mathcal{D})]=\omega \mathbb{E}\left[\frac{S_n}{\lambda+S-n}\right]\). So, it is unbiased when \(\lambda=0\) and as \(\lambda\to 0\) (i.e. when the regularization penalty is \(0\), as expected), but becomes more biased as \(\lambda\to \infty\)

Bias-Variance Tradeoff Revisited

We find that we can reduce the mean-squared error by incurring some bias; this is optimal as long as the variance is decreased more than the squared bias increases.

Bias-Variance Quadrants

  1. Low bias, low variance: \(\mathcal{F}\) is large enough (i.e. has enough "detail") to represent \(\star{f}\) properly. The more complex \(\star{f}\) is, the more samples are required
  2. Low bias, high variance: \(\mathcal{F}\) is complex enough to represent \(\star{f}\), but we don't have enough samples (\(n\) is too small)
  3. High bias, low variance: \(\mathcal{F}\) is not complex enough to represent \(\star{f}\)
  4. High bias, high variance: Bad Bad Bad. Likely not enough data and the model \(\mathcal{F}\) is not complex enough to represent \(\star{f}\).

Choosing the Right Model

Bias-variance Tradeoff (again)

Generally, we find above that the smaller our dataset is, we should err on the side of simpler models so that we don't end up in the useless high-bias, high-variance category.

We can use a small validation set as data to train a bunch of different models, then evaluate which model would perform the best with the full testing set.

Inductive Bias

We can use inductive biases to improve models: these are biases that we are aware of (from our prior knowledge) that we can use to prune the function space.

However, this starts to nudge ML models from things that should be able to learn from any context (i.e. from any dataset) to things more tailored to situations useful to us.

Chapter 11 - Logistic Regression and Linear Classifiers

Logistic regression lets us apply the machinery of regression to classification problems by interpreting the parameterization of the estimated logistic regression as a classification

We learned that classification requires estimating \(p(y|\vec{x})\). For binary classification, i.e. \(\mathcal{Y}=\set{0, 1}\), \(p(y|\vec{x})\) must follow a Bernoulli distribution, which has one parameter \(\alpha\); \(\alpha(\vec{x})=p(y=1|\vec{x})\). We can use logistic regression to parametrize and learn this \(\alpha(\vec{x})\).

Parametrization for Binary Classification

Essentially, we perform linear regression, but constrain the range of the function to \([0, 1]\), since this is the space over which binary classification is (somewhat) defined. We constrain using the sigmoid function \(\sigma : \mathbb{R}\to (0, 1), w \mapsto (1+\exp{w})^{-1}\); we aim to learn weights \(\vec{w}\) such that \(p(y=1|\vec{x})=\sigma(\vec{x}^{\top}\vec{w})\)

Since this prediction is in \((0, 1)\), not \(\set{0, 1}\), we must convert \((0, 1)\to\set{0, 1}\). Generally, if \(f(\vec{x})>0.5\), it is classified as \(1\), otherwise \(0\). So, the function we are using is \(\text{round}\).

The distribution of the actual probabilities is still useful; if most of \(f\)'s values are above \(0.9\) or below \(0.1\), the model is more confident than if most values hover around \(0.5\), even if the "cutoff point" is the same.

MLE For Logistic Regression

Like in any MLE estimation, we choose \(f\) to maximize \(p(\mathcal{D}|f)\) for dataset \(\mathcal{D}\), so our objective (also called cross-entropy) \(c(\vec{w})\) is \(c(\vec{w}) = \dfrac{1}{n} \displaystyle\sum\limits_{i=1}^{n}-\ln{p(y_i | \vec{x_i})}\). We find that the gradient (for a single entry in \(\vec{w}\)) is \(\nabla c(w_i) = (p_i-y_i)x_{ij}\), where \(p_i=\sigma((\vec{x}^{\top}\vec{w})_i)\)

We can also incorporate a Gaussian or Laplace prior to derive a regularized version: for a Gaussian prior (\(\ell_2\) regularization), we find the gradient descent rule \(\vec{w_{t+1}}=\vec{w_t}-\eta_t (\sigma(\vec{x_i}^{\top}\vec{w_t})-y_i)\vec{x_i} - \eta_{t}\dfrac{\lambda}{n}\vec{w}_t\)

Logistic Regression for Linear Classification

Because \(\vec{x}^{\top}\vec{w}=0\) represents a hyperplane that splits \(\mathbb{R}^n\) into two regions, it is a linear classifier (even though \(\vec{w}\) was learned with the non-linear \(\sigma\)).

Aside: this is what classifiers do: partition space \(\mathbb{R}^n\) into equivalence classes. If a classifier is linear, this boundary is shaped like a hyperplane, but it can (and often should) be shaped differently.

Issues with Minimizing MLE

Why didn't we directly try to minimize \(c(\vec{w}) = \dfrac{1}{n} \displaystyle\sum\limits_{i=1}^{n}(\sigma(\vec{x_i}^{\top}\vec{w})-y_i)^2\) (or the equivalent with a different cost function) instead of this Bernoulli skedaddling? Because this is non-convex optimization.

Chapter 12 - Bayesian Linear Regression

In the previous chapters, we used point estimates MLE and MAP to characterize linear regression. In this chapter, we use Bayesian estimation to do the same, namely Bayesian estimation of the conditional distribution \(p(y|\vec{x})\).

The following derivations consider the univariate case with one weight \(w\) for simplicity.

Posterior Distribution \(p(w|\mathcal{D})\) for Known Noise Variance

Once again, we make the assumption that \(p(y|x)\sim \mathcal{N}(\mu=xw,\sigma^2)\) and that each component (weight) \(w_i\) has a gaussian prior \(p(w_i)\sim \mathcal{N}(0, \dfrac{\sigma^{2}}{\lambda})\) (we take the prior \(p(w)\) to be conjugate to the posterior \(p(\mathcal{D}|w)\), i.e. Gaussian in this case). From Bayes rule, \(p(w|\mathcal{D})=\dfrac{p(\mathcal{D}|w)p(w)}{p(\mathcal{D})}=\displaystyle \frac{p(w)\prod_{i=1}^{n}p(y_i|x_i, w)}{\int p(w)\prod_{i=1}^{n}p(y_i|x_i, w) \, dw}\).

We can derive that \(p(w|\mathcal{D})\sim \mathcal{N}(\mu_n, \sigma^2_n)\) where \(\sigma^2_n=\dfrac{\sigma^2}{\sum_{i=1}^{n}x_i^{2}+\lambda}\) and \(\mu_n=\dfrac{\sum_{i=1}^{n}x_i y_i +\lambda\mu _0}{\sum_{i=1}^{n}x_i^2+\lambda}\).

Since this is a Bayesian estimation, we can establish a credible interval with \(\sigma^2_n\), since this term gets smaller as \(n\to \infty\). Namely, for a \(\delta\) credible interval, we compute \(p(w\in [a, b]|\mathcal{D})=\delta\)

Posterior Distribution \(p(w|\mathcal{D})\) for Unknown Noise Variance

Generally, we don't know the variance \(\sigma^2\) when taking a linear regression since we don't understand the data's noise before analyzing the data itself.

If the parameters of our prior's distribution are \(\mu_0\in \mathbb{R}\) and \(\lambda_0, a_0, b_0 > 0\), we find that the posterior (and variance) \(p([w, \sigma^2]|\mathcal{D})=\text{NIG}(\mu_n, \lambda_n, a_n, b_n)\), where \(\lambda=\displaystyle\sum\limits_{i=1}^{n}x^2_i+\lambda_0\), \(\mu_n=\dfrac{\sum_{i=1}^{n}x_i y_i+\lambda_0 \mu_0}{\lambda_n}\), \(a_n=a_0+\frac{1}{2}n\), \(b_n=b_0+ \dfrac{1}{2}\left(\displaystyle\sum\limits_{i=1}^{n}y^{2}_i+\lambda_0\mu_{0}^{2}-\lambda_n\mu_n^2\right)\)

Under NIG, the variance on the choice for the weight \(w\) is given by \(\dfrac{b_n}{{(a_n -1)}\lambda_n}\)

The Posterior Predictive Distribution

Why go to the trouble of using Bayesian Linear Regression?

It is more useful to reason about the variability across our predictions using \(w\), rather than the variability of \(w\) itself under a particular distribution. Specifically, we want to know \(P(xw|[x, \mathcal{D}])\)

The Posterior predictive distribution \(p(y|x,\mathcal{D}\displaystyle)=\int_{w}p(w|D)p(y|[x, w])\, dw\) is method of model averaging that weights each model (value of \(w\)) proportionally to how effective it is.