-
Introduction
-
Types
-
Supervised Learning
-
Classification
- binary classification
- multiclass classification
- Regression
- Unsupervised Learning
- Reinforcement Learning
-
Concepts
- Parametric vs non-parametric models
- The curse of dimensionality
- Overfitting
-
Model selection
- cross validation (CV)
- No free lunch theorem
-
Probability
-
Interpretations
-
Frequentist
- probabilities represent long run frequencies of events
-
Bayesian
- probability is used to quantify our uncertainty about something
- can model uncertainty about events with short term frequencies
-
Concepts
-
Discrete random variables
- Probability mass function, pmf
- state space
- indicator function
-
Fundamental rules
- product rule
- sum rule
- Bayes rule
- Independence and conditional independence
-
Continuous random variables
- cumulative distribution function, cdf
- probability density function, pdf
- Quantiles
- Mean and variance
-
Some common discrete distributions
-
Binomial
- Bin(n, θ)
-
Bernoulli
- Ber(θ)
-
Multinomial
- Mu(n, θ)
-
Multinoulli
- Cat(θ)
- The empirical distribution
-
Some common continuous distributions
-
Gaussian (normal) distribution
- N(μ,σ2)
-
Laplace distribution
- Lap(μ, b)
-
The gamma distribution
- Ga(a,b)
- gamma function, Γ(a)
-
The beta distribution
- Beta(a, b)
-
Pareto distribution
- Pareto(k, m)
- long tails
-
Joint probability distributions
- Covariance and correlation
- Multivariate Gaussian, Multivariate Normal (MVN)
- Multivariate Student t distribution
-
Dirichlet distribution
- Dir(x|α)
- Transformations of random variables
-
Monte Carlo approximation
-
Information theory
-
Entropy
- a measure of the random variable's uncertainty
-
KL divergence/Relative Entropy
- a measure of the dissimilarity of two probability distributions
-
Cross Entropy
-
Mutual information
-
Conditional Entropy
-
Generative Models for Discrete Data
-
Bayesian concept learning
- Likelihood
- Prior
- Posterior
- MLE
- MAP
- The beta-binomial model
- The Dirichlet-multinomial model
-
Naive Bayes classifiers
- Feature selection using mutual information
- Gaussian models
- Bayesian statistics
- Frequentist statistics
- Linear regression
- Logistic Regression
- Generalized linear models and the exponential family
- Directed graphical models (Bayes nets)
- Mixture models and the EM algorithm
- Latent linear models
-
Sparse linear models
- feature selection/ sparsity
-
Kernels
-
Introduction
- not clear how to best represent some kinds of objects as fixed-sized feature vectors
-
deep learning
- define a generative model for the data, and use the inferred latent representation and/or the parameters of the model as features
-
kernel function
- measuring the similarity between objects, that doesn’t require preprocessing them into feature vector format
- Support vector machines (SVMs)
-
Gaussian processes
-
Introduction
- before, infer p(θ|D) instead of p(f|D)
- Bayesian inference over functions themselves
-
Gaussian processes or GPs
- defines a prior over functions, which can be converted into a posterior over functions once we have seen some data
-
Adaptive basis function models
-
adaptive basis- function model (ABM)
- dispense with kernels altogether, and try to learn useful features φ(x) directly from the input data
- Boosting
- Ensemble learning
-
Markov and hidden Markov models
- probabilistic models for sequences of observations
- Markov models
- Hidden Markov models
-
State space models
-
state space model or SSM
- just like an HMM, except the hidden states are continuous
-
Undirected graphical models (Markov random fields)
-
Introduction
- undirected graphical model (UGM), also called a Markov random field (MRF) or Markov network
-
Advantages
- they are symmetric and therefore more “natural” for certain domains
- discrimi- nativel UGMs which define conditional densities of the form p(y|x), work better than discriminative DGMs
-
Disadvantages
- he parameters are less interpretable and less modular
- parameter estimation is com- putationally more expensive
- Markov random field (MRF)
- Conditional random fields (CRFs)
- Structural SVMs
-
Exact inference for graphical models
-
Introduction
- forwards-backwards algorithm
- generalize these exact inference algorithms to arbitrary graphs
-
Variational inference
-
Introduction
- approximate inference methods
-
variational inference
- reduces inference to an optimization problem
- often gives us the speed benefits of MAP estimation but the statistical benefits of the Bayesian approach
- More variational inference
-
Monte Carlo inference
-
Introduction
-
Monte Carlo approximation
- generate some (unweighted) samples from the posterior
- compute any quantity of interest
- non-iterative methods
- iterative method
-
Markov chain Monte Carlo (MCMC) inference
- Gibbs sampling
-
Clustering
-
Introduction
-
Clustering
- the process of grouping similar objects together.
- flat clustering, also called partitional clustering
- hierarchical clustering
- Graphical model structure learning
-
Latent variable models for discrete data
-
Introduction
- symbols or tokens
- bag of words
- Distributed state LVMs for discrete data
-
Latent Dirichlet allocation (LDA)
-
Quantitatively evaluating LDA as a language model
- Perplexity
- Fitting using (collapsed) Gibbs sampling
- Fitting using batch variational inference
- Fitting using online variational inference
- Determining the number of topics
-
Extensions of LDA
- Correlated topic model
- Dynamic topic model
- LDA-HMM
- Supervised LDA
-
Deep Learning
- Introduction
- Deep generative models
- Deep neural networks