Algorithmic and Information Theoretic Methods in Data Science
Algorithmic & Information Theoretic
Methods in Data Science (ECE6980)
Overview
We will study algorithmic, statistical, and information
theoretic tools to understand the fundamentals of data science.
Some of the questions we will ask are:
-
How much data do you need?
Sample complexity
- How much time to you need?
Computational complexity
-
How much storage do you need?
Streaming and space complexity
-
Is the data safe with you?
Privacy
-
Can you tolerate noise in the data?
Robustness
-
Can you handle distributed data?
Communication complexity
Tentative list of topics:
-
Distance between distributions: total variation, kl-divergence,
chi-square, etc.
Concentration inequalities.
-
Learning discrete distributions
-
Minimax-setting and sample complexity
-
Information theory basics
-
Data compression and learning
-
Fano's Inequality and LeCam's method
-
Gaussian mixtures, metric entropy and covering numbers
-
Little data over ginormous domains and their applications
-
Small sample hypothesis testing
-
Good-Turing probability estimation, missing mass
-
Robustness of statistical algorithms, Huber contamination model, Tukey median
-
Streaming algorithms, memory vs sample complexity
-
Distributed machine learning problems, communication vs sample
complexity
-
Privacy and machine learning - samples vs privacy
-
Estimating discrete distribution properties, e.g., entropy estimation, etc.
First few lectures will include preliminaries such as
concentration inequalities, and basic information theory.
Follow the Code of academic
integrity. Collaboration is allowed in the assignments, copying
is not. Acknowledge the discussions and other sources
used. Assignments should be submitted before the deadline.