Algorithmic and Information Theoretic Methods in Data Science

Algorithmic & Information Theoretic Methods in Data Science (ECE6980)

Course home

Overview

We will study algorithmic, statistical, and information theoretic tools to understand the fundamentals of data science. Some of the questions we will ask are:
  • How much data do you need? Sample complexity
  • How much time to you need? Computational complexity
  • How much storage do you need? Streaming and space complexity
  • Is the data safe with you? Privacy
  • Can you tolerate noise in the data? Robustness
  • Can you handle distributed data? Communication complexity
Tentative list of topics:
  • Distance between distributions: total variation, kl-divergence, chi-square, etc. Concentration inequalities.
  • Learning discrete distributions
  • Minimax-setting and sample complexity
  • Information theory basics
  • Data compression and learning
  • Fano's Inequality and LeCam's method
  • Gaussian mixtures, metric entropy and covering numbers
  • Little data over ginormous domains and their applications
  • Small sample hypothesis testing
  • Good-Turing probability estimation, missing mass
  • Robustness of statistical algorithms, Huber contamination model, Tukey median
  • Streaming algorithms, memory vs sample complexity
  • Distributed machine learning problems, communication vs sample complexity
  • Privacy and machine learning - samples vs privacy
  • Estimating discrete distribution properties, e.g., entropy estimation, etc.
First few lectures will include preliminaries such as concentration inequalities, and basic information theory.

Follow the Code of academic integrity. Collaboration is allowed in the assignments, copying is not. Acknowledge the discussions and other sources used. Assignments should be submitted before the deadline.