Machine Learning - A Probabilistic Perspective

Introduction
1. Types
  1. Supervised Learning
    1. Classification
      1. binary classification
      2. multiclass classification
    2. Regression
  2. Unsupervised Learning
  3. Reinforcement Learning
2. Concepts
  1. Parametric vs non-parametric models
  2. The curse of dimensionality
  3. Overfitting
  4. Model selection
    1. cross validation (CV)
  5. No free lunch theorem
Probability
1. Interpretations
  1. Frequentist
    1. probabilities represent long run frequencies of events
  2. Bayesian
    1. probability is used to quantify our uncertainty about something
    2. can model uncertainty about events with short term frequencies
2. Concepts
  1. Discrete random variables
    1. Probability mass function, pmf
    2. state space
    3. indicator function
  2. Fundamental rules
    1. product rule
    2. sum rule
    3. Bayes rule
  3. Independence and conditional independence
  4. Continuous random variables
    1. cumulative distribution function, cdf
    2. probability density function, pdf
  5. Quantiles
  6. Mean and variance
3. Some common discrete distributions
  1. Binomial
    1. Bin(n, θ)
  2. Bernoulli
    1. Ber(θ)
  3. Multinomial
    1. Mu(n, θ)
  4. Multinoulli
    1. Cat(θ)
  5. The empirical distribution
4. Some common continuous distributions
  1. Gaussian (normal) distribution
    1. N(μ,σ2)
  2. Laplace distribution
    1. Lap(μ, b)
  3. The gamma distribution
    1. Ga(a,b)
    2. gamma function, Γ(a)
  4. The beta distribution
    1. Beta(a, b)
  5. Pareto distribution
    1. Pareto(k, m)
    2. long tails
5. Joint probability distributions
  1. Covariance and correlation
  2. Multivariate Gaussian, Multivariate Normal (MVN)
  3. Multivariate Student t distribution
  4. Dirichlet distribution
    1. Dir(x|α)
6. Transformations of random variables
7. Monte Carlo approximation
8. Information theory
  1. Entropy
    1. a measure of the random variable's uncertainty
  2. KL divergence/Relative Entropy
    1. a measure of the dissimilarity of two probability distributions
    2. Cross Entropy
  3. Mutual information
    1. Conditional Entropy
Generative Models for Discrete Data
1. Bayesian concept learning
  1. Likelihood
  2. Prior
  3. Posterior
  4. MLE
  5. MAP
2. The beta-binomial model
3. The Dirichlet-multinomial model
4. Naive Bayes classifiers
  1. Feature selection using mutual information
Gaussian models
Bayesian statistics
Frequentist statistics
Linear regression
Logistic Regression
Generalized linear models and the exponential family
Directed graphical models (Bayes nets)
Mixture models and the EM algorithm
Latent linear models
Sparse linear models
1. feature selection/ sparsity
Kernels
1. Introduction
  1. not clear how to best represent some kinds of objects as fixed-sized feature vectors
  2. deep learning
    1. define a generative model for the data, and use the inferred latent representation and/or the parameters of the model as features
  3. kernel function
    1. measuring the similarity between objects, that doesn’t require preprocessing them into feature vector format
2. Support vector machines (SVMs)
Gaussian processes
1. Introduction
  1. before, infer p(θ|D) instead of p(f|D)
  2. Bayesian inference over functions themselves
  3. Gaussian processes or GPs
    1. defines a prior over functions, which can be converted into a posterior over functions once we have seen some data
Adaptive basis function models
1. adaptive basis- function model (ABM)
  1. dispense with kernels altogether, and try to learn useful features φ(x) directly from the input data
2. Boosting
3. Ensemble learning
Markov and hidden Markov models
1. probabilistic models for sequences of observations
2. Markov models
3. Hidden Markov models
State space models
1. state space model or SSM
  1. just like an HMM, except the hidden states are continuous
Undirected graphical models (Markov random fields)
1. Introduction
  1. undirected graphical model (UGM), also called a Markov random field (MRF) or Markov network
  2. Advantages
    1. they are symmetric and therefore more “natural” for certain domains
    2. discrimi- nativel UGMs which define conditional densities of the form p(y|x), work better than discriminative DGMs
  3. Disadvantages
    1. he parameters are less interpretable and less modular
    2. parameter estimation is com- putationally more expensive
2. Markov random field (MRF)
3. Conditional random fields (CRFs)
4. Structural SVMs
Exact inference for graphical models
1. Introduction
  1. forwards-backwards algorithm
  2. generalize these exact inference algorithms to arbitrary graphs
Variational inference
1. Introduction
  1. approximate inference methods
  2. variational inference
    1. reduces inference to an optimization problem
    2. often gives us the speed benefits of MAP estimation but the statistical benefits of the Bayesian approach
More variational inference
Monte Carlo inference
1. Introduction
  1. Monte Carlo approximation
    1. generate some (unweighted) samples from the posterior
    2. compute any quantity of interest
  2. non-iterative methods
  3. iterative method
Markov chain Monte Carlo (MCMC) inference
1. Gibbs sampling
Clustering
1. Introduction
  1. Clustering
    1. the process of grouping similar objects together.
  2. flat clustering, also called partitional clustering
  3. hierarchical clustering
Graphical model structure learning
Latent variable models for discrete data
1. Introduction
  1. symbols or tokens
  2. bag of words
2. Distributed state LVMs for discrete data
3. Latent Dirichlet allocation (LDA)
  1. Quantitatively evaluating LDA as a language model
    1. Perplexity
  2. Fitting using (collapsed) Gibbs sampling
  3. Fitting using batch variational inference
  4. Fitting using online variational inference
  5. Determining the number of topics
4. Extensions of LDA
  1. Correlated topic model
  2. Dynamic topic model
  3. LDA-HMM
  4. Supervised LDA
Deep Learning
1. Introduction
2. Deep generative models
3. Deep neural networks