CS8850 - Advanced Machine Learning - Spring 2025

Georgia State University, Atlanta, GA

Instructor

Prof. Sergey Plis

Email: splis@gsu.edu

Office Hours: TBA

Office: 55 Park Place, Room 1821

Teaching Assistant

Mansoor Ahmed

Email: mahmed76@gsu.edu

Office Hours: TBA

Office: TBA

Course Description

Machine learning studies algorithms that build models from data for subsequent use in prediction, inference, and decision-making tasks. Although an active field for the last 60 years, the current demand as well as trust in machine learning exploded as increasingly more data become available and the problems needed to be addressed become literally impossible to program directly. In this advanced course, we will cover essential algorithms, concepts, and principles of machine learning. Along with the traditional exposition, we will learn how these principles are currently being revisited thanks to the recent discoveries in the field.

Prerequisites

Programming: Proficiency in at least one programming language, preferably Python, as homework will be given with the expectation of Python knowledge.
Mathematics: Basic knowledge of linear algebra, optimization theory, probability, and statistics. The class is taught under the assumption that you have taken some machine learning, statistics, calculus, probability, and optimization theory classes and are ready for advanced material.

Reading List

I will use parts of the following textbooks, accompanied by mandatory research papers and optional reading material that I may recommend to supplement the lectures.

Bishop, C. M. Pattern Recognition and Machine Learning. Springer; 2006.
Shalev-Shwartz, S., & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press; 2014.
MacKay, D. J. Information Theory, Inference and Learning Algorithms. Cambridge University Press; 2003.
Goodfellow, I., Bengio, Y., & Courville, A. Deep Learning. MIT Press; 2016.
Provost, F., & Fawcett, T. Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking. 2013.
Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning.
Hamming, R. The Art of Doing Science and Engineering: Learning to Learn.

Course Schedule and Resources

Below is the detailed course schedule with subtopics, lecture videos (with timestamps), and additional reading materials.

S.No	Topic	Subtopics	Lecture Videos	Reading References
1	Mathematical Foundations -- Refresher	Linear Algebra Applied Probability Differential Equations and Calculus Optimization Theory	UChicago -- Math for ML 3Blue1Brown	Notes: Mathematics for Machine Learning by Soon Ong Books: Goodfellow -- Chapt. 2 (pg. 29-45), Bishop -- Chapt. 2 (pg. 67-93) Other: Exceptional lectures and books on LA by Prof. Gilbert Strang
1	Foundations of Learning	Formal Learning Model Generalization and Overfitting Empirical Risk Minimization (ERM) ERM with Inductive Bias Bound the probability of error -- confidence and accuracy	Formalizing the Problem of Learning (24:19) Inductive Bias (12:03) Can We Bound the Probability of Error? (25:56)	Notes: ERM -- UWisconsin, UTuebingen, UDelware Books: Shai Chapt. 2 (pg. 33-41) Slides: Lecture Slides
2	PAC Learning	PAC Learning Framework VC Dimension Sample Complexity Learnability Conditions	Main Definitions (13:52) Agnostic PAC Learning (53:35) Learning via Uniform Convergence (10:15)	Notes: Cornell, Princeton, UPenn Books: Shai Chapt. 3-4 (pg. 43-58), Shai Chapt. 6 (pg. 67-78) Slides: Purdue, CMU, Lecture Slides Videos: Ali -- UWaterloo, Andrew Ng
3	Linear Learning Models	Linear decision boundary -- binary classifier Perceptron algorithm -- batch & stochastic Proof of convergence Inseparable case	Linear Decision Boundary (34:10) Perceptron (37:10) Perceptron Extensions (14:09) Linear Classifier for Linearly Non-Separable Classes (8:59)	Notes: Northeastern, NUS Slides: Lecture Slides
4	Principal Component Analysis	Linear Regression (LR), Least Means Squares (LMS) Spectral theorem, similarity transform, eigenvectors, diagonalization, spectral factorization PCA -- quadratic forms, multivariate Gaussian density, Isodensity surfaces, Principal Axes Theorem Eigenvectors and Eigenvalues Covariance Matrix -- Diagonalizing the Covariance Matrix, KL-transform, ex. bivariate case	Linear Regression (39:24) Linear Algebra Micro Refresher (2:04) Spectral Theorem (25:54) Principal Component Analysis (22:29) Demonstration (17:38)	Notes: LR: UIUC, Stanford -- Part 1 PCA: UChicago Slides: UToronto, Lecture Slides
5	Curse of Dimensionality	The curse of dimensionality Volume in high-dimensional space -- Stirling approximation, Hypersphere Volume Ex: Gaussian distributions in high dimensional space	Curse of Dimensionality (1:16:27)	Notes: CMU, Princeton Books: Bishop Chapt. 1 (pg. 33-37), Hastie Chapt. 2 (pg. 22-23), Hamming Chapt. 9 (pg. 58-66) Slides: Lecture Slides
6	Bayesian Decision Theory	Review probability distributions, Random variables, joint and marginal probabilities Bayes Theorem, Prior and Posterior Distributions Decision Rules -- Continuous and Discrete Features Bayesian vs Frequentist Approaches General Bayesian Decision Theory Maximum A Posteriori (MAP) -- conjugate priors	Bayesian Decision Theory (56:47)	Notes: UToronto -- Probability refresher, CMU (pg. 1-27), CMU Books: Duda Chapt. 1-2 (pg. 3-37), Bishop Chapt. 1 (pg. 39-49) Slides: Lecture Slides
7	Parameter Estimation -- MLE	Maximum Likelihood Estimation (MLE) -- conditional independence, MLE for Bernoulli and Gaussian distributions, sample complexity & PAC learning MLE and KL-divergence -- Hartley's Information and Shannon's entropy, cross-entropy, KL-divergence minimization	Independence (12:07) Maximum Likelihood Estimation (50:35) MLE as KL-divergence Minimization (21:41)	Notes: CMU, UOklahoma, Stanford Books: Duda Chapt. 10 (pg. 107-117), Bishop Chapt. 2 (pg. 78-97), Goodfellow Chapt. 5 (pg. 128-131) Blogs: Cross-Entropy-KL-Divergence-MLE Slides: Lecture Slides
7	Parameter Estimation -- MAP & NB	Maximum A Posteriori (MAP) Estimation MLE vs. MAP MAP for binomial and multinomial distributions Bayes rule -- AIDS test example Naive Bayes classifier -- continuous and discrete features, text classification example	MAP Estimation (56:00) The Naïve Bayes Classifier (37:09)	Notes: Columbia Slides: CMU 1, CMU 2, Lecture Slides
8	Logistic Regression	Naive Bayes recap -- Gaussian NB as a linear classifier, generative vs. discriminative classifiers Defining Logistic Regression -- Linear Fit to Log-Odds, softmax Solving Logistic Regression -- an alternative perspective on log odds, Logistic Sigmoid, MLE & Negative Log likelihood -- Taylor expansion, Newton-Raphson update for linear and logistic regression	NB to LR (19:49) Defining Logistic Regression (27:42) Solving Logistic Regression (23:35)	Notes: Stanford, CMU Books: Bishop Chapt. 4 (pg. 205-210), Shai Chapt. 9 (pg. 126-128) Slides: UNevada, Lecture Slides
9	Kernel Density Estimation	Density Estimation Basics -- Non-parametric density estimation, histogram-based, Parzen windows, smooth kernels Bandwidth Selection, Bias-variance tradeoff (digression) Multivariate density estimation, Product kernels, Unimodal and Bimodal distribution KDE Applications of KDE	Non-parametric Density Estimation (1:13:33)	Notes: UWashington Books: Duda Chapt. 4 (pg. 187-198) Slides: Buffalo, Texas A&M, Lecture Slides
10	Support Vector Machines	Maximum Margin Classifier -- Bayes decision boundary, Restricted Bayes optimal classifier, Linear SVM Classifier, Linear SVM: primal formulation, problems and solutions Lagrange Duality -- Karush-Kuhn-Tucker (KKT) conditions, Quadratic programming Dual Formulation of SVM Kernel Tricks -- Mapping to Higher Dimensions, Mercer's theorem Soft Margin	Max Margin Classifier (35:53) Lagrange Multipliers (32:45) Dual Formulation of Linear SVM (10:34) Kernel Trick and Soft Margin (27:28)	Notes: SVMs & KMs Books: Shai Chapt. 15 (pg. 202-214), Bishop Chapt. 7 (pg. 325-344) Slides: SVR, Lecture Slides
11	Matrix Factorization	Singular Value Decomposition (SVD), Cocktail party problem Independent Component Analysis (ICA) -- Linear vs statistical independence Methods: projection pursuit, infomax (mutual information), and MLE FastICA Non-negative Matrix Factorization (NMF) Dictionary Learning Autoencoders	Matrix Factorization (1:24:22)	Notes: NMF: ETH ICA: UHelsinki DL: UTexas Slides: ANGMS, Lecture Slides
15	Stochastic Gradient Descent (SGD)	Gradient Descent Basics Stochastic vs Batch Gradient Descent Learning Rate Scheduling Convergence Properties	Stochastic Gradient Descent (1:06:57)	Slides: Lecture Slides Books: Shai - Chapt. 14 (pg. 184-201) Notes: Cornell Notes, Nikolai Notes
16	k-means Clustering	Clustering Basics Hard k-means Soft k-means Gaussian Mixture Models (GMM)	Clustering (6:05) Gaussian Mixture Models (16:34) MLE Recap (4:20) Hard k-means Clustering (30:27) Soft k-means Clustering (7:18)	Slides: Lecture Slides Books: Bishop - Chapt. 9 (pg. 423-455), Duda - Chapt. 10 (pg. 617-619) Notes: Stanford Notes, Princeton Notes Papers: MLE via EM
17	Expectation Maximization (EM)	EM Algorithm Gaussian Mixture Models (GMM) Convergence Properties Applications of EM	Do We Even Need EM for GMM? (14:39) A "Hacky" GMM Estimation (15:17) MLE via EM (38:28)	Slides: Lecture Slides Books: Shai - Chapt. 24 (pg. 348-352), Bishop - Chapt. 9 (pg. 423-455) Notes: Duke Notes, Stanford Notes Papers: Dempster et al. (1977)
18	Automatic Differentiation	Forward Mode AD Reverse Mode AD Backpropagation Applications in Deep Learning	Introduction (25:10) Forward Mode AD (26:46) A Minute of Backprop (2:26) Reverse Mode AD (17:26)	Slides: Lecture Slides Books: Bishop - Chapt. 5 (pg. 225-284), Shai - Chapt. 20 (pg. 268-282) Notes: MIT, UToronto , UCSC Papers: Forward Mode AD, AD in ML
19	Nonlinear Embedding Approaches	Manifold Learning t-SNE UMAP Applications in Visualization	Manifold Learning (20:13)	Slides: Lecture Slides Books: Duda - Chapt. 10 (pg. 661-669) Notes: CMU Notes, UPittsburg Notes Papers: t-SNE Paper
20	Model Comparison I	Bias-Variance Tradeoff No Free Lunch Theorem Confusion Matrix Cross-Validation	Bias-Variance Tradeoff (36:52) No Free Lunch Theorem (7:29) Problems with Accuracy (12:39) Confusion Matrix (25:15)	Slides: Lecture Slides Books: Shai - Chapt. 5 (pg. 61-78), Bishop - Chapt. 3 (pg. 145-170) Notes: Stanford Notes Papers: Bias-Variance Tradeoff, No Free Lunch Theorem
21	Model Comparison II	Cross-Validation and Hyperopt Expected Value Framework Visualizing Model Performance Receiver Operating Characteristics (ROC)	Cross-Validation and Hyperopt (29:08) Expected Value Framework (22:48) Visualizing Model Performance (31:02) Receiver Operating Characteristics (22:34)	Slides: Lecture Slides Books: Shai - Chapt. 11 (pg. 145-160), Bishop - Chapt. 7 (pg. 325-360) Notes: Sklearn Cross-Validation Papers: Hyperparameter Optimization
22	Model Calibration	Calibration Techniques Reliability Diagrams Platt Scaling Isotonic Regression	On Model Calibration (36:53)	Slides: Lecture Slides Books: Shai - Chapt. 12 (pg. 175-190), Bishop - Chapt. 4 (pg. 195-220) Notes: Calibration of Modern Neural Networks Papers: Histogram Binning, Isotonic Regression
23	Convolutional Neural Networks (CNNs)	Building Blocks Skip Connections Fully Convolutional Networks Semantic Segmentation	Building Blocks (39:22) Skip Connections (38:46) Fully Convolutional Networks (8:07) Semantic Segmentation with Twists (23:40)	Slides: Lecture Slides Books: Bishop - Chapt. 5 (pg. 225-284), Shai - Chapt. 20 (pg. 268-282) Notes: AlexNet Paper Papers: Highway Networks, ResNet
24	Word Embedding	Bag of Words Word2Vec GloVe Applications in NLP	Introduction (10:35) Semantic Matrix (30:26) Word2Vec (54:22)	Slides: Lecture Slides Books: Shai - Chapt. 21 (pg. 290-305), Bishop - Chapt. 12 (pg. 561-600) Notes: Word2Vec Paper Papers: GloVe Paper

I manage this page, errors and omissions are accepted.

Tools and Libraries: